Tanaka Corpus

From EDRDG Wiki
Revision as of 04:30, 19 March 2010 by JimBreen (talk | contribs) (Importing the summary Tanaka document)
Jump to: navigation, search

Introduction

This page provides some brief documentation for the Tanaka Corpus of parallel Japanese-English sentences, and in particular the modification and editing that has been carried out to enable use of the corpus as a source of examples in the WWWJDIC dictionary server and other systems.

The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)

At the 2002 Papillon workshop in Tokyo, Professor Boitet includeda copy of the corpus in a CD distributed to participants. Jim Breen realised it had potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and index the corpus and linked it at the word level to the dictionary function in the server (see below.)

Compilation

Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected.

From inspection, it appears that many of the sentence pairs have been derived from textbooks, e.g. books used by Japanese students of English. Some are lines of songs, others are from popular books and Biblical passages.

The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.

The original file can still be downloaded (see below.)

Initial Modifications to the Corpus

As mentioned above, the Tanaka Corpus was edited and adapted to be used within the WWWJDIC dictionary server as a set of example sentences associated with words in the dictionary. In order to adapt the corpus for this role, it was edited as follows:

  1. an initial regularization of the punctuation of the Japanese and English sentences was carried out, then duplicate pairs were removed, reducing the original file from 210,000 pairs to 180,000 pairs;
  2. sentences which differed only by differences in orthography (e.g. kana/kanji usage, okurigana differences), numbers, proper names, minor grammatical points such as plain/polite verb usage, etc. were reduced to single representative examples;
  3. sentences where the Japanese consisted of a short Japanese statement in kana were removed;
  4. sentences with spelling errors, kana-kanji conversion errors, etc. were corrected;
  5. sentences where the English version did not match the Japanese were edited to make the two versions agree;
  6. where the sentences contain gender-specific language or words, the English portion has been tagged with [M] or [F] respectively;
  7. sentences where the Japanese was too garbled to derive a valid English equivalent were removed.

The process described above has continued, and at present the edited corpus has just over 150,000 sentence pairs.

Incorporation into the WWWJDIC Server

(The initial incorporation of the Tanaka Corpus in the WWWJDIC server is described in a paper presented to the 2003 Papillon workshop.)