Tanaka Corpus: Difference between revisions

From EDRDG Wiki
Jump to navigation Jump to search
No edit summary
Line 4: Line 4:
The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his [http://www.edrdg.org/projects/tanaka/tanaka.pdf Pacling2001] paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)
The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his [http://www.edrdg.org/projects/tanaka/tanaka.pdf Pacling2001] paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)


At the 2002 Papillon workshop in Tokyo, Professor Boitet includeda copy of the corpus in a CD distributed to participants. Jim Breen realised it had potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and index the corpus and linked it at the word level to the dictionary function in the server (see below.)
At the 2002 Papillon workshop in Tokyo, Professor Boitet includeda copy of the corpus in a CD distributed to participants. Jim Breen realised it had potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and index the corpus and linked it at the word level to the dictionary function in the server (see below.)  
 
The inclusion of the Corpus in the WWWJDIC server exposed it to a wide audience, and a number of other systems incorporated the corpus into their operation. It also began to be used in some research projects in natural language processing.
 
In 2006 the Corpus was incorporated into the [http://tatoeba.org/home Tatoeba Project] being developed by Trang Ho to provide a sentence-based multi-lingual resource. That project is now the "home" of the Corpus.
=Compilation=
=Compilation=
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected.
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected.
Line 33: Line 37:


To see an example of the sentence linking, [http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1Q%C2%E7%B3%D8%C0%B8_1_ here] is the sentence display for 大学生. There is also a function for [http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?10 browsing] the sentences.
To see an example of the sentence linking, [http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1Q%C2%E7%B3%D8%C0%B8_1_ here] is the sentence display for 大学生. There is also a function for [http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?10 browsing] the sentences.
=Current Format (WWWJDIC)=
As mentioned above, the Tanaka Corpus is now being maintained within the [http://tatoeba.org/home Tatoeba Project]. The Japanese-English pairs are extracted weekly and formatted for use by WWWJDIC in the format below.
The file is in text format, with the Japanese in  EUC-JP coding (UTF-8 encoding is also available).
The format is as follows:
# the file consists of pairs of lines, beginning with "A: " and "B: " respectively. There may also be comment lines which begin with a "#".
# the "A:" lines contain the Japanese sentence and the English translation, separated by a TAB character. At the end of the English translation is a sequence number identifying the sentence pair. It is in the format: #ID=nnnnnn. This sequence number is the identification of the English sentence in the tatoteba Project;
# the "B:" lines contain a space-delimited list of Japanese words found in the preceding sentence.
# the Japanese words in the "B:" lines can have the following appended:
## a reading in hiragana. This is to resolve cases where the word can be read different ways. WWWJDIC uses this to ensure that only the appropriate sentences are linked. The reading is in "round" parentheses.
## a sense number. This occurs when the word has multiple senses in the EDICT file, and indicates which sense applies in the sentence. WWWJDIC displays these numbers. The sense number is in "square" parentheses.
## the form in which the word appears in the sentence. This will differ from the indexing word if it has been inflected, for example. This field is in "curly" parentheses.
## a "~" character to indicate that the sentence pair is a good and checked example of the usage of the word. Words are marked to enable appropriate sentences to be selected by dictionary software. Typically only one instance per sense of a word will be marked.The WWWJDIC server displays these sentences below the display of therelated dictionary entry.
The following example pair illustrates the format:
A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507
B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}
=Subset=
An automatically-generated subset of the edited corpus is also available. The subset contains those sentences with one or more words marked with a "~" (see above.)
=Tatoeba Project=
The Corpus is now maintained within the [http://tatoeba.org/home Tatoeba Project]. This project has extended the file to include many other languages, and many sentences are available in three or more languages. The project WWW site has extensive facilities for searching and editing the sentences, and has an active community of people entering and editing sentences.
=Copyright Issues=
Professor Tanaka originally placed the Corpus in the Public Domain, and that status was maintained for the versions used by WWWJDIC. In late 2009 the Tatoeba Project decided to move it to a Creative Commons [http://creativecommons.org/licenses/by/2.0/fr/deed.en_GB CC-BY] licence (that project is in France, where the concept of public domain is not part of the legal framework.) It can be freely downloaded and used provided the source is attributed.
=Downloads=
The original file is available from
* [ftp://ftp.monash.edu.au/pub/nihongo/tanakacorp_utf8.gz here] (in UTF8 coding) or
* [ftp://ftp.monash.edu.au/pub/nihongo/tanakacorp_euc.gz here] (in EUC-JP coding).
'''Please do not use these versions in projects'''.
The edited version used in the WWWJDIC server can be downloaded from:
* [http://www.csse.monash.edu.au/~jwb/examples.gz complete version] (EUC-JP).
This is the current file being used by the Monash WWWJDIC server. Each time it is updated a [http://www.csse.monash.edu.au/~jwb/examples_date date-stamp] is set;
* [http://www.csse.monash.edu.au/~jwb/examples.utf.gz complete version] (UTF-8)
* [http://www.csse.monash.edu.au/~jwb/examples_s.gz subset] (in EUC-JP).
Downloads can also be made from the [http://tatoeba.org/eng/pages/download-tatoeba-example-sentences Tatoeba database].

Revision as of 05:24, 19 March 2010

Introduction

This page provides some brief documentation for the Tanaka Corpus of parallel Japanese-English sentences, and in particular the modification and editing that has been carried out to enable use of the corpus as a source of examples in the WWWJDIC dictionary server and other systems.

The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)

At the 2002 Papillon workshop in Tokyo, Professor Boitet includeda copy of the corpus in a CD distributed to participants. Jim Breen realised it had potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and index the corpus and linked it at the word level to the dictionary function in the server (see below.)

The inclusion of the Corpus in the WWWJDIC server exposed it to a wide audience, and a number of other systems incorporated the corpus into their operation. It also began to be used in some research projects in natural language processing.

In 2006 the Corpus was incorporated into the Tatoeba Project being developed by Trang Ho to provide a sentence-based multi-lingual resource. That project is now the "home" of the Corpus.

Compilation

Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected.

From inspection, it appears that many of the sentence pairs have been derived from textbooks, e.g. books used by Japanese students of English. Some are lines of songs, others are from popular books and Biblical passages.

The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.

The original file can still be downloaded (see below.)

Initial Modifications to the Corpus

As mentioned above, the Tanaka Corpus was edited and adapted to be used within the WWWJDIC dictionary server as a set of example sentences associated with words in the dictionary. In order to adapt the corpus for this role, it was edited as follows:

  1. an initial regularization of the punctuation of the Japanese and English sentences was carried out, then duplicate pairs were removed, reducing the original file from 210,000 pairs to 180,000 pairs;
  2. sentences which differed only by differences in orthography (e.g. kana/kanji usage, okurigana differences), numbers, proper names, minor grammatical points such as plain/polite verb usage, etc. were reduced to single representative examples;
  3. sentences where the Japanese consisted of a short Japanese statement in kana were removed;
  4. sentences with spelling errors, kana-kanji conversion errors, etc. were corrected;
  5. sentences where the English version did not match the Japanese were edited to make the two versions agree;
  6. where the sentences contain gender-specific language or words, the English portion has been tagged with [M] or [F] respectively;
  7. sentences where the Japanese was too garbled to derive a valid English equivalent were removed.

The process described above has continued, and at present the edited corpus has just over 150,000 sentence pairs.

Incorporation into the WWWJDIC Server

(The initial incorporation of the Tanaka Corpus in the WWWJDIC server is described in a paper presented to the 2003 Papillon workshop.) In order to facilitate the linking of sentences in the Corpus to words in the online dictionary, a list of Japanese words and phrases was extracted from each sentence. This was carried out using the Chasen morphological analysis program. Compound words which had dictionary entries were recombined as necessary. At present about 27,000 unique Japanese words and phrases are indexed.

The list of words associated with each sentence is used by the WWWJDIC server to select examples of the usage of the words. In addition, users of the WWWJDIC server can search the Corpus using text strings in Japanese and/or English, and using regular expressions. Via WWWJDIC users could also submit corrections to sentences via a WWW feedback form. Several thousand corrections were submitted this way.

More information on the WWWJDIC use of the corpus is in the documentation.

To see an example of the sentence linking, here is the sentence display for 大学生. There is also a function for browsing the sentences.

Current Format (WWWJDIC)

As mentioned above, the Tanaka Corpus is now being maintained within the Tatoeba Project. The Japanese-English pairs are extracted weekly and formatted for use by WWWJDIC in the format below. The file is in text format, with the Japanese in EUC-JP coding (UTF-8 encoding is also available).

The format is as follows:

  1. the file consists of pairs of lines, beginning with "A: " and "B: " respectively. There may also be comment lines which begin with a "#".
  2. the "A:" lines contain the Japanese sentence and the English translation, separated by a TAB character. At the end of the English translation is a sequence number identifying the sentence pair. It is in the format: #ID=nnnnnn. This sequence number is the identification of the English sentence in the tatoteba Project;
  3. the "B:" lines contain a space-delimited list of Japanese words found in the preceding sentence.
  4. the Japanese words in the "B:" lines can have the following appended:
    1. a reading in hiragana. This is to resolve cases where the word can be read different ways. WWWJDIC uses this to ensure that only the appropriate sentences are linked. The reading is in "round" parentheses.
    2. a sense number. This occurs when the word has multiple senses in the EDICT file, and indicates which sense applies in the sentence. WWWJDIC displays these numbers. The sense number is in "square" parentheses.
    3. the form in which the word appears in the sentence. This will differ from the indexing word if it has been inflected, for example. This field is in "curly" parentheses.
    4. a "~" character to indicate that the sentence pair is a good and checked example of the usage of the word. Words are marked to enable appropriate sentences to be selected by dictionary software. Typically only one instance per sense of a word will be marked.The WWWJDIC server displays these sentences below the display of therelated dictionary entry.

The following example pair illustrates the format:

A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507

B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}

Subset

An automatically-generated subset of the edited corpus is also available. The subset contains those sentences with one or more words marked with a "~" (see above.)

Tatoeba Project

The Corpus is now maintained within the Tatoeba Project. This project has extended the file to include many other languages, and many sentences are available in three or more languages. The project WWW site has extensive facilities for searching and editing the sentences, and has an active community of people entering and editing sentences.

Copyright Issues

Professor Tanaka originally placed the Corpus in the Public Domain, and that status was maintained for the versions used by WWWJDIC. In late 2009 the Tatoeba Project decided to move it to a Creative Commons CC-BY licence (that project is in France, where the concept of public domain is not part of the legal framework.) It can be freely downloaded and used provided the source is attributed.

Downloads

The original file is available from

  • here (in UTF8 coding) or
  • here (in EUC-JP coding).

Please do not use these versions in projects.

The edited version used in the WWWJDIC server can be downloaded from:

This is the current file being used by the Monash WWWJDIC server. Each time it is updated a date-stamp is set;

Downloads can also be made from the Tatoeba database.