A WWW JAPANESE DICTIONARY

J.W. Breen
School of Computer Science & Software Engineering
Monash University.

1. Introduction

Since 1991, the author has been engaged in the EDICT (Electronic DICTionary) project to develop a series of computer-based Japanese-English dictionaries, capable of being used both as traditional dictionaries and as semi-automated aids for reading Japanese text. The main EDICT glossary file now has over 95,000 entries, and has been joined by subject-specific files covering bio-medical terminology, legal terms, computing, telecommunications, business, etc., as well as a proper names file with 350,000 entries and a kanji database covering over 12,000 kanji. A variety of software packages have been released for use on a number of computer systems, and the files are used within several free or shareware Japanese word-processor systems. The files, which have also been used in a number of natural-language processing (NLP) and machine translation (MT) projects, are all available free of charge.

The development of the World-Wide Web as an information retrieval system on the Internet in 1993 opened the possibility of providing a comprehensive dictionary facility from a small number of servers. The facilities within the WWW to combine server-based software with text input from almost any browser has meant that an identical service can be provided regardless of the user's type of computer. Also complex software distribution and installation is avoided, and the central lexicographical databases can be expanded and the services enhanced without requiring software changes on the part of users.

A large number of WWW-based dictionaries have become available in the last decade, covering most of the world's major languages. Sites such as yourDictionary list several hundred such servers. Many dictionary servers emulate traditional published dictionaries in that entries can only be accessed using the specified head-words.

The first WWW-based dictionary using the EDICT files began operating in 1993, and since then approximately 10 different server systems have been developed to use these files. In addition there are several servers based on other Japanese dictionaries in electronic form, the most notable being the oddly-named Goo servers operated by NTT in Japan, and based on a set of dictionaries published by Sanseido.

This chapter describes the dictionary and related services provided by the author's WWWJDIC server which operates at Monash University, and from mirror servers in the USA, Canada, Poland, Germany and Japan. This server was initially designed to provide an integrated word and character dictionary, and related services such as text glossing. It has been extended to incorporate a number of additional facilities of assistance to students of Japanese.

2. Integration of Japanese Dictionaries

As with other languages, Japanese dictionaries, whether monolingual or bilingual make use of ordered head-words to assemble and organize entries. In Japanese dictionaries the head-words are written either in the hiragana and katakana syllabaries, or in the case of some dictionaries intended for non-native speakers, in romanized Japanese.

In addition, the use of kanji characters necessitates the use of special character dictionaries which contain information about each kanji, plus a selection of words using that kanji. These dictionaries are ordered using some identifiable aspect of the individual kanji, such as a radical component shape and the count of strokes. A student of Japanese using dictionaries to assist with reading a text will typically have to switch between the two forms of dictionary in order to determine the meaning of a new word. This is often found by students to be a time-consuming and frustrating task.

The availability of dictionaries in file form and a database of kanji information enables a Japanese dictionary package to integrate the two so that a user can move easily between the two. Also dictionary packages with appropriate indexing are not limited to a single headword per entry, as is the case with printed dictionaries. Thus an entry in a Japanese dictionary could be accessed by its pronunciation, its full written form using kanji, or potentially by any kanji used to write the word.

The author pioneered the integration of word and character Japanese electronic dictionaries with the release of the JDIC package for DOS computers in 1991. This was the beginning of a series of packages for a number of computer platforms which employed similar integration techniques.

3. Facilities

The WWWJDIC server provides the following facilities:

  1. a keyword search in one of the thirteen lexicographic files currently available. Each entry in the file typically consists of a jukugo (熟語: word or phrase written with several kanji), its reading in kana, and a short English gloss. The keywords entered in the search can either be in Japanese or in English. In the case of a Japanese keyword, it can be in kanji and kana, entered using an IME (Input Method Editor) or cut and paste from another screen or program, or entered in romaji. Figure i shows an example of a typical word search in WWWJDIC.

    fig1.gif
    Figure i: WWWJDIC result when searching for こうじょう.

    By following the links at the end of the display of each entry, the user can carry out a number of additional searches, e.g. in another dictionary or via a search engine, or as described below, view a selection of sample sentences using the word, or view a table of conjugations generated for each verb.

  2. a kanji selection facility, in which kanji can be identified by a wide variety of methods ranging from traditional bushu/stroke-count to coding systems such as Halpern's SKIP, De Roo codes, Four Corner etc. (1) Kanji readings and English senses can also be used. One novel feature is the classification of kanji according to their basic shape components, with kanji being identifiable by several components instead of a single bushu. Figure ii shows an example of the result of a kanji selection. The coded information after the kanji includes indexes into several dictionaries: Nelson, Halpern, Spahn & Hadamitzky, Morohashi, etc., as well as readings in Korean and Chinese.

    The ability to identify a kanji by a number of index methods is unique to dictionary software packages. Printed kanji dictionaries must use a primary indexing system for publication, and relegate other indices to appendices.

    In addition to providing information about a kanji, the server enables linking at the character level to other WWW-based databases of Japanese and Chinese characters.

    fig2.gif
    Figure ii: Kanji dictionary display for 番.

      An additional educational feature of the server is the facility to view a stroke-by-stroke animation of a kanji being written. Learning correct stroke order is considered an important element of kanji acquisition, and many instructional software packages include some support for this, often in the form of video clips. The WWWJDIC server uses animated images of approximately 2,000 kanji constructed by the author from the diagrams in the Kodansha "Kanji Learner's Dictionary" compiled by Jack Halpern. Mr Halpern kindly permitted the digitized version of the diagrams to be converted and used by the server.

    Figure iii shows the 番 kanji partially-written in the animated form.

    fig3.gif
    Figure iii: Animation of the writing of the Kanji 番.

  3. the capability for the user to move flexibly between the kanji-oriented and text-oriented dictionary files. For example, having identified a kanji, it is possible to retrieve entries in the dictionary files which contain that kanji, either in the first character position or in any position in a word. Similarly, it is possible to examine the details of any kanji from a retrieved dictionary entry. It is in this sense that the WWW dictionary is able to combine the features of both a Japanese-English/English-Japanese dictionary and a kanwa dictionary.

  4. the capability to annotate Japanese text with the English glosses of the words within it. The text can either be cut and pasted from another page or program, or can come from a selected WWW page. Figure iii shows an example of this facility. This is a major feature of the WWWJDIC server and is described in the following section.

4. Text Glossing

The ability to use dictionary files to gloss text is a powerful adjunct to computerized dictionaries. The files of the EDICT project have often been used for this purpose, with earlier examples including the author's JREADER program, Hatasa & Henstock's AutoGloss/J Package, Yamamoto's Mailgloss system, Kitamura & Tera's DLink system DLink system etc.

In carrying out a glossing of Japanese text, a degree of processing of the text must be carried out beforehand, in particular to segment the text into its lexemes and to convert the inflected forms of words into their dictionary forms. These tasks are non-trivial for Japanese text, and have led to the development of powerful morphological analysis software tools such as ChaSen and JUMAN. These tools are generally too large and slow to use within a WWW server, where a rapid response is essential.

With WWWJDIC a simpler approach to segmentation has been employed in which the text is scanned to identify in turn each sequence of characters beginning with either a katakana or a kanji. The dictionary is searched using each sequence as key, and if a match is made, the sequence is skipped and the scan continues. In addition, a small supplementary dictionary file of words and phrases typically written in hiragana is also used. Thus the dictionary file itself plays a major role in the segmentation of the text in parallel with the accumulation of the glosses. The technique cannot identify grammatical elements and some words written only in hiragana, however is it quite successful with gairaigo and words written using kanji.

A further element of preprocessing of text is required for inflected forms of words, as the dictionary files only carry the normal plain forms of verbs and adjectives. An inverse stemming technique previously employed in the author's JREADER program is used here, wherein each sequence which could potentially be an inflected verb or adjective, e.g. a kanji followed by two hiragana is treated as a potential case of an inflected word. Using a table of inflections, a list of potential dictionary form words is created and tested against the dictionary file. If a match is found, it is accepted as the appropriate gloss. The table of inflections has over 300 entries and is encoded with the type of inflection which is reported with the gloss. Although quite simple, this technique has been extensively tested with Japanese text and correctly identifies inflected forms in over 95% of cases. (In Figure iv this can be seen where 思います has been identified as an inflection of 思う.)

fig4.gif
Figure iv: Example of the glossing of words in Japanese text.

When preparing glosses of words in text, it is appropriate to draw on as large as a lexicon as possible. For this reason, a combination of all the major files of the EDICT project is used, unlike the single word search function where users can select which glossary to use. This can introduce other problems as the inappropriate entry may be selected. For example, for the word 人々 the ひとびと entry must be selected, not the much less common にんにん. To facilitate this, a priority system is employed in which preference is given in turn to entries from:

  1. a 20,000 entry file of more commonly used words;
  2. the remainder of the EDICT file;
  3. the other subject-specific files;
  4. the file of names, broken up into prioritized subfiles as Japanese names often have several pronunciations.

5. Example Sentences

It is generally considered desirable for dictionaries, especially bilingual dictionaries used by students, to have representative clauses and sentences showing the usage of words. The EDICT file did not include such examples, and the compilation process using volunteers did not lend itself to the task of generating and including such examples.

In 2002 the author received a copy of a file of some 210,000 Japanese-English sentence pairs compiled by Professor Yasuhito Tanaka at Hyogo University and his students (see Pacling2001). The collection, which has been placed in the Public Domain, consists of material drawn largely from instructional texts. After editing to remove unsuitable and duplicated material, the remaining 180,000 sentence pairs were processed using the Chasen morphological analysis system to extract the Japanese words from each sentence, a process that identified approximately 20,000 unique words. The collection of sentences was then integrated with the WWWJDIC server so that a user can link to the example sentences and view examples of a word's usage.

Figure v shows some of the sample sentences available for the verb 食べる (to eat.)

fig5.gif
Figure v: Example sentences using 食べる.

The collection of sentences still requires considerable editing to remove errors and duplicated sentences, as well as reducing the number of pairs to something more manageable. Initial feedback, however, is that it is proving a useful addition for students of Japanese.

6. Verb Conjugations

A further extension to WWWJDIC for language education purposes is the option to see a table of verb conjugations for almost all of the approximately 9,000 verbs in the main EDICT dictionary file. As most Japanese verbs are quite regular, it was originally thought that such an option would be of limited use, however the possibility received strong support from a sample of instructors and students.

The conjugation table is generated as required from a set of rules for each verb type, and relies on the verb classification being indicated in the dictionary file.

fig6.gif
Figure vi: Verb Conjugation Table for 食べる.

7. Use of WWWJDIC by other systems

As well as the traditional user interface via a browser screen, another interface has been provided to enable other WWW-based systems make requests to the WWWJDIC system. An interesting example of this is the Japanese Text Initiative at the University of Virginia library. As part of this project, a "portal" system has been developed which allows individual words to be selected from texts and passed to WWWJDIC for display of the meanings, etc.

A further interesting application of WWWJDIC has been its use via the NTT "DoCoMo" WAP mobile telephones in Japan. The DoCoMo telephones have a small screen and a built-in "micro-browser" which enables access to WWW services via NTT's proxy servers. In order to make WWWJDIC services accessible to DoCoMo users, a special interface with a smaller screen usage and abbreviated dialogue has been provided. In addition, an option to operate using the "Shift-JIS" coding commonly employed in Japan has been added, as the DoCoMo browser does not support other standard encodings such as EUC.

8. Conclusion

The WWW, with its ability to associate central data files and server software, and be accessed flexibly by innumerable users, has opened the possibility of extensive sophisticated dictionary facilities being provided to many people at little cost. These facilities can extend beyond those of traditional paper dictionaries by providing additional services such as integrated kanji and text dictionaries, access using several different keys and automated glossing of text, as well as providing integrated educational tools such as linking to text examples and generation of sample verb conjugations.

A singular advantage of WWW-based approaches is they lend themselves to continual update and enhancement without having to burden users with new acquisitions and installations, or the developers with preparing and distributing new editions. The immediacy also serves to encourage feedback and suggestions from users, which ultimately can lead to a system better "in tune" with the user requirements than traditional publishing and production techniques can achieve.

At present many of the systems are experimental, however as more extended lexicons become available online, and as server and browser software become more advanced, the WWW is likely to play an increasingly important role in language study and multi-lingual communications.


Footnotes

1. These are all numeric codes based on the stroke-counts of identifiable portions of kanji. Halpern's SKIP (System of Kanji Indexing by Patterns) is used to order and index kanji in his New Japanese-English Character Dictionary (Kenkyusha, Tokyo 1990) and Kanji Learner's Dictionary (Kodansha, Tokyo 1998). De Roo's code is used in his "2001 Kanji" (Bonjinsha). The Four Corner code was developed by Wang Chen in 1928 and is widely used in Chinese and Japanese dictionaries. As an example, the kanji 村 has a SKIP of 2-4-3 indicating a vertical division into 4 and 3 stroke portions, a De Roo code of 1848 representing 木 (18) and 寸 (48), and a Four Corner code of 4490 because there is a 十 (4) at the top two corners and a 小 (9) at the bottom left.


Appendix: Technical Aspects of WWWJDIC

WWWJDIC operates as a CGI program running under the control of a WWW server. All the operational systems use the Apache server. The code is largely drawn from the author's XJDIC dictionary system for Unix/X11. In summary each dictionary file consists of a relatively simple text file which is searched using a form of binary search via an index file of sorted pointers to lexical tokens in the target file.

As the total set of dictionary and index files used by WWWJDIC amounts to approximately 80Mb, it is important that the searches be efficient, and that a minimal amount of time be spent loading software. Initially it was intended that the searching be carried out by a permanently-running daemon at the request of the transient CGI program instances. This could have been implemented relatively easily as the XJDIC system has an option for its dictionary search module to be daemon interacting with multiple user-interface client programs. In fact this proved not to be necessary for relatively efficient WWW operation, as the use of memory-mapped input/output has meant that the file system tends to keep the object code and pivot pages of the dictionary in disk cache to such an extent that there is little or no advantage in having a more complex client/daemon arrangement.

One of the issues in constructing a WWW-based dictionary system in which there is inevitably an extended dialogue between the user and the system, is that CGI programs are essentially stateless, and hence some technique is needed to maintain information at the user level about the state of the dialogue. Many WWW systems use cookies, i.e. small files sent to the browser and stored on the user's system, for this purpose. In WWWJDIC the approach that has been employed is to embed state information in the HTML sent out to the browser such that the next transmission from the user returns that information and enables the server software to be initialized appropriately. For example, in figure i only the first ten entries matching the こうじょう keyword have been displayed and the user is asked if more entries are to be displayed. If more are requested, the request returns the location in the index file for the current dictionary and the display of entries can proceed.

All the Japanese text in the files is handled internally in the EUC (Extended Unix Code) in which each character is typically encoded as a pair of bytes each with the MSB set to distinguish them from the normal ASCII characters. Most characters are from the JIS X 0208 set, which encodes 6,355 kanji, all the kana and a number of special characters. Most WWW browsers can display these characters once the appropriate fonts are installed. In addition there are some kanji from the supplementary JIS X 0212 set, which has a further 5,801 kanji. As few browsers can support these kanji, the server software provides bit-mapped image files. Normally the generated HTML delivered to the browsers is in EUC coding and is identified by an appropriate "charset" value in the header as recommended by the W3C.