[Note: This paper was written for the Papillon workshop in Sapporo in 2003. It has been slightly updated since then to reflect modifications in the operation of the examples in the WWWJDIC server, and to correct some out-of-date URLs.]
This paper describes a project in which the Tanaka corpus of matched Japanese-English sentence pairs has been linked to the WWWJDIC online Japanese-English dictionary. The process of linking the corpus is described in detail, as well as an analysis of the word coverage, and the editing of the corpus to remove some of the errors it contains. The paper concludes that the Tanaka corpus can successfully provide a source of example sentences for a Japanese-English dictionary
The practice of incorporating sentences or sentence fragments as part of a dictionary entry appears to have originated with Latin and Greek dictionaries compiled in the 16th and 17th centuries, where such passages served as citations from classical texts establishing the provenance of the words. The incorporation of such citations was adopted in later major English dictionaries by lexicographers such as Johnson and Webster, and is now regarded as an essential feature of authoritative mono-lingual dictionaries.
The development of comprehensive bilingual dictionaries from the mid-19th century, and more recently mono-lingual "learner's dictionaries", extended this practice to include selected or composed examples illustrating the usage of the words. Such examples are considered to be an essential component of such dictionaries. In one English-Japanese dictionary [1] in the author's possession, the body of each entry consists entirely of parallel English and Japanese sentences utilizing the headwords.
The development of extensive electronic corpora such as COBUILD [2] and BNC [3] has brought corpus linguistics to a prominent position in lexicography. In the context of learners' or bilingual dictionaries, such corpora tend to be used as an aid to the construction of examples, rather than as a direct source. Landau [4] comments that "What a corpus can do above all else - even when it cannot provide verbatim examples that can be used in a dictionary - is to give examples at the right level of complexity and in a framework that is typical so that the lexicographer can devise examples that are not silly, stilted, or clearly artificial." One of the editors of Taishukan's Unabridged Genius English-Japanese Dictionary, Kosei Minamide [5], writing about corpora and the examples used in that dictionary, states "Such corpora is (sic) liable to drown us in data", and adds "Because of the complicated problems concerning copyright and the extreme difficulty of finding entirely suitable examples in the corpus, we had most of the illustrative examples invented by native speakers."
There are no reported cases of electronic corpora being used directly for the provision of dictionary examples. The difficulty of using such corpora for this purpose can be seen from examination of some of the text samples from the online COBUILD collection for the word swimming:
against Douglas Stern, Doug Stern's Swimming Clinic Inc., the United States no-touch sex with clothes on [p] swimming - especially nude [p] smiling [p] induced cloud or magical blackness swimming in the air; it was simply likely to keep busy playing games, swimming, jeeping, or making crafts such as historic feat of winning Olympic swimming medals 12 years apart. Janet Silk Cup Derby (Hickstead) 1435b Swimming: National Champs & Euro Trials ( End. The quieter spots and the best swimming on one-mile Long Bay beach are at suitable physical exercise such as swimming or cycling. He will find that any in such a way that you feel you are swimming outdoors in an open-air pavilion. sun-splashed conservatory - even a swimming pool. An unforgettably exotic or and telephone. There's an indoor swimming-pool, sauna, solarium and
Clearly only one or two extracts in this sample contain useful material for example sentences, and in both cases some rewriting would be appropriate. It is only one sample, but it supports the views of Landau and Minamide.
When the author began compiling a Japanese-English dictionary file as part of the EDICT [6] project in 1991, there were immediate calls from users of the file and software for example sentences to be associated with the dictionary entries. The initial dictionary format file did not readily allow for the inclusion of such examples, so a structure for such examples was implemented, involving a simple marker in the text of the English translation which indicated the availability of further explanatory information and examples in a linked adjunct file. As the early stages of the EDICT project benefited from considerable voluntary effort, a call was made for the preparation and submission of examples and other explanatory material. None was forthcoming; it appeared that while the user community had sufficient interest and enthusiasm to submit lexical material, preparation of examples was not such a high priority.
In 1999 the JMdict project, which involved an expanded dictionary structure, was launched. From the beginning of the project it was intended to incorporate example sentences within entries, with elements reserved in the DTD for this purpose.
As reported at the PACLING2001 conference in a paper on the compilation of multilingual corpora [7], Professor Yasuhito Tanaka at Hyogo University had assembled over several years a collection of over 200,000 Japanese-English sentence pairs. The technique he employed was to encourage a number of students each to enter approximately 300 items, drawn from instructional texts and other available sources. The resulting corpus, which he stated was in need of considerable editing, was placed in the Public Domain. At the 2002 Papillon Workshop, Professor Christian Boitet provided a copy of the corpus to participants, with a view to it possibly being used as the foundation for a set of examples within the Papillon dictionary project.
The author examined the corpus and concluded that it did indeed have excellent potential for providing such examples, but that it also had a large number of errors which would need eventual correction. It was decided to conduct a trial in which the corpus would be used to provide usage examples for entries in the author's WWW Japanese-English dictionary server (WWWJDIC). [8]
The broad purpose of the trial was:
As provided, the corpus was a text file with alternating Japanese and English sentences. After code conversion, the sentence pairs were aggregated into tab-delimited single lines to aid sorting and inspection. It was immediately apparent that there were a large number of duplicate or near-duplicate pairs, differing only by such things as punctuation, or spelling errors in the English portion.
After some simple harmonization of the punctuation, mainly consisting of ensuring that the punctuation in the Japanese sentences used "JIS" characters, and in the English sentences used ASCII characters, occurrences of examples which duplicated another example with regard to the Japanese sentence were removed. Whilst this may on occasions have removed an example with a "correct" English sentence in favour of an incorrect sentence, it was considered that this could eventually be corrected at a later stage.
The removal of this type of duplicated example reduced the file from an initial 203,000 sentence pairs to approximately 183,000. Further inspection at this stage revealed that a considerable number of errors and near-duplicates remained, however it was considered that the file was in a state that permitted at least a trial of its application to the role of providing example sentences for a dictionary. Further editing could, and did, take place in parallel with the implementation of the dictionary association.
The process of associating example sentences with dictionary entries, had it followed the same approach as with printed dictionaries (which was also the approach allowed for in the JMdict data structure), would have meant selecting one or two sentence pairs for each of approximately 20,000 words, and embedding them in the appropriate part of the dictionary database. This approach clearly has a number of problems:
Instead, an approach was adopted that achieved the same effect, i.e. the association of examples with dictionary entries, but which avoided the problems outlined above. The approach involved:
Given the size of the file, it was not considered efficient to search it each time a link was required. Also the fact that many of the words involved were verbs, adjectives, etc., which often appeared in the sentences in inflected forms, would greatly complicate such a search. In order to expedite the linking process, it was decided to pre-process the sentences to identify the target words within the examples that could be used to attract links from the dictionary entries. Thus each example in the corpus would be extended so that it consisted of the triplet: (Japanese-sentence, English-sentence, word-list).
The extraction of the words in each sentence was carried out initially using the Chasen [9] morphological analyzer from Nara Institute of Science and Technology (NAIST). Each sentence was passed through the Chasen program, and the extracted words which contained at least one kanji were retained. Using a package such as Chasen had the advantage of bringing most of the inflected forms of words back to the plain (dictionary) form, and also of accurately segmenting the text so that trailing okurigana, etc. were retained. Extracted kana-only words were not retained initially, as for the most part they consisted of particles, conjunctions, etc. which have little relevance to the dictionary entries. It was recognized that a number of words which are always or often written with kana could end up being overlooked, but it was considered that they could be revisited at a later stage. (See section 7 below.)
An additional analysis was carried out to extract all sequences of katakana from the sentences, on the assumption that these would typically be loan-words.
The examples in the extended corpus were thus converted into the following format:
The process described above identified approximately 660,000 word occurrences in the 177,500 sentences in the corpus at the time of writing, i.e. a mean of 3.7 words per sentence. In total approximately 23,000 unique words were identified. Of the unique words, approximately 3,500 do not occur as head-words in the JMdict/EDICT dictionary files. On inspection these words are for the most part proper names or verbs in the potential form (see below).
The frequency distribution of words is given in the following table.
No. of occurrences | No. of words | Examples |
1 | 8,025 | アーカイブ, グラム, 愛憎, 学外 |
2 | 3,132 | アンコール, レシート, 塩梅, 信託 |
3 | 1,803 | エラー, ブーム, 区役所, 標本 |
4 | 1,243 | キロメートル, バスタオル, 色白, 抜歯 |
5 | 828 | カトリック, テレビ局, 加工, 行く手 |
6-10 | 2,393 | ヒーター, 無用, チキン, 馬車 |
11-20 | 1,884 | パンフレット, 亡くす, 国籍, 服従 |
21-30 | 748 | トマト, レッスン, 乗り換える, 火災 |
31-100 | 1,653 | アドバイス, 稼ぐ, 協力, 前もって |
101-500 | 943 | ゲーム, 案内, 殺人, 事業 |
501-1000 | 109 | ドア, 美しい, 息子, 降る |
1000+ | 67 | 手紙, 電話, 問題, 彼女 |
A significant number of the infrequently-used words are proper names.
To enable the association of the example sentences with dictionary software, and the subsequent display of examples for a given word, an ancillary word-sentence index file was created and inverted. For example for the word 加工 it contains:
The integration of the corpus into the WWWJDIC server was carried out as follows:
It is apparent that there are a number of problems with the Tanaka corpus that need to be considered when using it as a source of dictionary-related examples.
The first is that a number of the sentences are such things as short interjections, proverbs, quotations, aphorisms, etc. which while they are of interest, are not necessarily useful in the context of showing typical usage of words. Some examples of these are:
Fortunately the nature of most of these is obvious, but at some stage it may be useful to tag them as quotations, etc. lest the incautious learner be misled by them.
A more serious problem is presented by the presence in the sentence pairs of:
Errors in the Japanese sentences often comprise:
AIは人口知能の略です。 AI stands for artificial intelligence.
(人口 is incorrect. The word should be 人工 - also pronounced じんこう)
DNAのサンプルを畜えることは許されるべきではない。 Storing DNA samples should not be permitted.
DNAのサンプルを蓄えることは許されるべきではない。 Storing DNA samples should not be permitted.
(畜える is clearly a typing error for 蓄える)
Many of these are being detected in near-duplicate sentences, and others are being found as well. The file will obviously benefit from being thoroughly proof-read by Japanese native speakers.
Errors in the English sentences, apart from the mistranslations discussed below, largely consist of spelling errors, faulty capitalization, and incorrect punctuation. Many of the spelling errors were removed by carrying out a spell-check on the file, however cases are still being found where incorrect words have been used.
コップが地面に落ちて砕けた。 The glass clashed to the ground.
Again, a thorough proof-reading would be required to detect and remove all these errors.
While a large number of duplications have been removed, a considerable number remain. These are due to such things as:
部屋は兎小屋みたいだけど。 But my place is like a rabbit hutch.
きつねの尾はウサギのより長い。 The tail of a fox is longer than that of a rabbit.
すごいよ。 It's incredible.
すごいわ。 That's wonderful.
なぜいけないか。 Why not?
なぜいけないの。 Why not?
In extreme cases sets of up to 130 such near-duplicate sentences have been detected.
Many of these cases can be detected by scanning the file with the sentences sorted by either the Japanese or English sentence. The approach being adopted is to eliminate the punctuation variations, retain the sentence form which makes the greatest use of kanji (as this will lead to more examples being available) and aim for a mix of register types, emphases, etc. across the sentence collection.
Another approach is to use a measure of similarity between sentences, based on the list of Japanese words in each sentence. Sentences with identical word-sets would be candidates for examination and possible reduction.
Editing of the sentence collection is still being carried out, with approximately 6,000 near-duplicates removed so far.
A number of sentences have obviously had the English component derive from a literal translation of the Japanese with little regard to the validity of the result:
おまえを失いかけた時、俺は自分の汚れた心を見た。 When I was gonna be losing you on my mind found my heart in soil.
In other cases, the English, while correct grammatically, does not mean the same as the Japanese, perhaps because there has been some truncation:
As mentioned above, the Chasen morphological analysis package was used to extract target words from the Japanese sentences. In general this process was carried out very successfully, however in a number of cases either incorrect or inappropriate segmentation of the text occurred.
私は経済的に両親からひとり立ちしている。 I am economically independent of my parents.
私 経済 的 両親 ひとり立ち
In general this is not a major problem as the components themselves are usually dictionary entries, however it is appropriate to aggregate at least some of these when they are detected in order to increase the number of usable examples. In practice a relatively simple solution was available. An examination was made of adjacent pairs of words in the indices to determine if they were both continuous in the associated sentence and present as a headword in the dictionary file. If both conditions were met, the pair of words was joined. This process resulted 10,600 word pairs being joined, giving approximately 2,800 new unique index words.
These will need to be identified and corrected, as the JMdict/EDICT dictionary files, in common with most printed dictionaries, do not usually carry these inflected forms as separate entries.
As discussed above, kana-only words were not included from the original analysis as it would have been difficult to separate useful words from conjunctions, particles, etc. However a number of common words are usually written in kana alone, and it would be useful to be able to associate examples with them.
As the linkage employed in WWWJDIC uses the initial field in an entry, which usually has the kanji form of the word, the approach that has been followed is to add that form to the word list, leaving the kana form in the sentence. Several hundred such words such as 一寸 (ちょっと), 迚も (とても), etc. as well as many words usually only written in kana such as ずっと, けど, どうぞ, どうやって, etc. were added by hand.
It became clear that identifying and adding kana-only words to the index lines by hand was quite time-consuming and inefficient. As foreshadowed in section 5, the issue was revisited with a programmed extension of the indices to include such words. The process applied was:
As a result of this process, the number of unique words covered by the indices was increased by about 1,300, and a total of approximately 70,000 were added to the indices.
Direct association of dictionary head-words with example sentences containing those words does not immediately cater for situations where, for example, a single gairaigo (loanword) has multiple meanings, or where a word has more than one sense. For example, チップ can mean both "chip" and "tip", resulting in the following example sentences being selected:
Similarly, for お嬢さん, which can mean both "(your) daughter" and "young lady", we see:
The current approach to this problem is to append the sense-number to the index words in the examples file in the cases where more than one sense exists. Thus in the cases mentioned above, the full set of example sentences will be identified and displayed, but the senses will be stated. A more ideal solution, reserved for future implementation, is to allow the dictionary user to choose the sense for which example sentences are sought. (At the time of writing the file is being edited to add the sense numbers.)
The following example for the word 汚す, which has the senses of "to disgrace" and "to dirty", illustrate the approach taken.
There is also a potential problem with homonyms. For example, the kanji: 略 can be used to write two different words: ほぼ (almost; roughly; approximately), and りゃく (abbreviation; omission). There are different dictionary entries for these words, but without some special treatment, both would link to the same set of example sentences, as the link is normally based on the kanji headword. E.g.
Although such cases of true homonymy are relatively rare in Japanese, a solution, such as the extension of the indices to include reading, is required to avoid confusion. The approach that has been adopted is to allow for the appending of the reading to the index word, so that the linked example sentences only apply to the correct word. Thus for the sentence above, the index list comprises: 何 略(りゃく) 知る ユネスコ
A number of example sentences which include homonymous words have been manually identified and marked in this way to prevent misleading linkages occurring.
The file size of the corpus, including indices, is at present over 18Mb. While this is not a problem for server systems, there has been interest in having a smaller version of the corpus for use with PDA-based dictionary software.
While it would be possible to make a manual selection which included representative examples of words, this would be a major task, and would also have the disadvantage of breaking up the corpus at a time when it is still being edited. Ideally there should be a technique available which can automatically extract a suitable subset from the full corpus at any time.
One approach is to use the measure of similarity mentioned above to cull near-duplicates. As an interim step, a simple heuristic was trialled to determine the potential effectiveness of such an automatic subset generation.
The steps in the heuristic are:
This process resulted in a collection of 44,800 sentence pairs, i.e. a little over 25% the size of the full file. From inspection it appears to have a reasonable coverage of the more common words, and as intended has a full coverage of the less common words. It is interesting to note that changing some of the parameters of the heuristic does not significantly alter the outcomes. For example raising the occurrence threshold from 5 to 10 increases the size of the subset file by about 20%.
Removing proper names from the lists of index words is likely to reduce the size of the subset file.
The initial integration of the Tanaka corpus into the WWWJDIC server took place in August/September 2002, and has been operational since then. Minor revisions have been made since then, e.g. the introduction of a random selection of sentences as the initial display. An option which allows users to submit comments and corrections via a feedback form is about to be released.
Editing of the corpus has continued since its integration into the server, and main server and its mirror sites have their files updated approximately weekly. The complete corpus with associated index words is available for download from the Monash site. The subset file is under consideration for inclusion in the dictionary module of the popular JWPce package.
Feedback from the WWWJDIC user community has been very positive, with many responses that the example sentences are very useful in the study of Japanese. The errors in the file do not appear to be causing undue difficulty, in fact they seem to be more often a source of amusement.
Examination of the examples displayed for a selection of words was compared with those in several printed Japanese-English dictionaries. In general the results were comparable. The WWWJDIC/Tanaka case did not usually provide examples for as many words, however for some words it provided a much wider choice of examples. The following sets from WWWJDIC and the recent Sanseido "Grand Concise" (a Japanese-English dictionary designed for the domestic Japanese market) for the word 裏側/うらがわ - the reverse; other side; lining, illustrates a typical comparison.
Grand Concise
月の裏側 the back [hidden] side of the moon
封筒の裏側 the reverse side of an envelope
人生の裏側をのぞく get a peep of life on the seamy side
家の裏側へ回ってください Please come around to the back of the house
There have been a number of requests for the inclusion of examples using words which are not currently in the corpus. A small number of sentences have been added, and consideration is being given a subsequent project to identify missing common words and extract suitable sentences from available corpora.
This paper describes a project in which the Tanaka corpus of matched Japanese-English sentence pairs has been linked to an online Japanese-English dictionary. It has demonstrated that the corpus is capable of serving very well as the basis for example sentences in an electronic dictionary, and has indicated several avenues for improving and expanding the corpus.
The project has also demonstrated the viability and advantages of the approach of maintaining the example corpus as a separate entity from the lexicon, and only linking the two at the time of displaying an extended entry.