Practical Issues and Problems in Building a Multilingual Lexicon

Problèmes et questions pratiques dans la construction d'un lexique multilingue

多言語辞書の編集における実際的問題

J.W. Breen

School of Computer Science & Software Engineering
Monash University.

Abstract. The issues and problems that need to be addressed when compiling a multilingual lexicon from a number of different sources are identified and discussed. The approach that has been taken with the JMDict Japanese-Multilingual Dictionary in handling these problems is described.

Introduction

Truly multilingual dictionaries are rare beasts in the world of lexicography for a number of reasons: limited market, head-word identification, editorial complexity, etc. A cooperatively compiled electronic dictionary operating primarily in an online environment is seen by many to have the potential to overcome many of these impediments while at the same time meeting the requirements of scholars, translators and students. One such electronic dictionary file is the XML-based JMDict (Japanese-Multilingual Dictionary)[1], which employs Japanese as the mediating language and has entries glossed in English, French, German and Russian. The file contains approximately 90,000 Japanese-English entries, of which over 10,000 have German and French material, and 2,000 have Russian material.

The development of the multilingual components of the JMDict file from its bilingual Japanese-English precursor (EDICT: Electronic DICTionary)[2] has inevitably focussed on the incorporation of material from other bilingual lexical sources. This has highlighted a number of issues which need to be addressed in the compilation of any such file. Many of these issues arise as a result of the other sources having been compiled employing differing approaches to the structure, selection and coding of lexical material. As the incorporation of the material is best carried out with minimal human intervention, and ideally can be repeated at later stages as the sources are further developed, the issues need to be identified and appropriate strategies selected and implemented in advance of the incorporation.

The Issues

Among the issues which need to addressed with regard to the source lexicons are:

the coding and structure.
the inclusion policy for entries, e.g. whether the entry belongs in a general lexicon or a special-purpose lexicon, such as one of proper names or idioms.
the recording policy, particularly in such matters as the inflectional forms in Japanese and orthographical variants such as use of okurigana and kanji variants.
the identification and marking of Part-of-Speech information.
the identification and subsequent alignment of multiple senses.
the inclusion of usage examples.
the on-going editing of the combined material.

The nature of each of these issues and the general approach to its handling are described below.

Coding and Structure

An issue which in the initial stages of multilingual text processing posed a major challenge is that of the coding of text in the various languages. In particular the mixing of Japanese text using the JIS X 0208 character set in one of its major encapsulations (EUC-JP or Shift-JIS) with languages such as French or German which use the Latin alphabet with diacritics, none of which are in the JIS set, presented particular difficulties as neither of the encapsulations can be mixed with text coded in the full ISO-8859-1 character set. As compatibility with a Japanese text editor was of paramount importance, alternative representation methods, such as using oe for ö were employed where possible.

To a large extent the arrival and increasing maturity of Unicode, and in particular the availability of editors and other utilities supporting Unicode has meant that the problem of conflicting codesets is well on the way to being solved. Legacy files, however, present the problem of converting the interim forms into the correct modern codes, something that often requires human intervention. In the case of a Japanese-German file that was incorporated into JMDict, a group of German-speaking volunteers examined and marked up approximately 10,000 entries in this way.

The structure of target files also can present a challenge. Ideally, even if the files are not in a database structure, the lexical components should be able to be readily identified through the use of field separators and markers. While the major components are often easily identified, it is common to find such things as senses not well marked, or part-of-speech inconsistently coded.

Inclusion Policy

Any lexicon needs a policy as to the extent of its coverage. In the case of JMDict the policy has been to include as many words as possible in use in modern Japanese, but to exclude all except a small number of proper names, as these are being catered for in other files such as ENAMDICT[3]; and to exclude words associated wholly with specialized fields as these also are being handled elsewhere, e.g. in the Life Sciences Dictionary[4].

Another policy issue is the extent to which common phrases and idiomatic expressions are included. In JMDict a "reasonable" number of such terms are included, as they are of considerable use to dictionary users. When dealing with a data source which follows other policies, e.g. mixing proper names with general entries, or including greater numbers of expressions, a decision needs to be made as to how to deal with these items. Simple matching can be used to filter non-conforming entries, but this carries a risk of overlooking a potentially useful set of material.

Recording Policy

It is usual for lexicons to apply a set of policies for the forms of lexical items recorded. A common practice, for example, is to record verbs in the infinitive for European languages or in the plain present tense for Japanese. Similarly for Japanese dictionaries, "true" adjectives (形容詞) are usually recorded in their uninflected form and quasi-adjectives (形容動詞) without the な ending. Adverbial forms of both are not often recorded, as they are derived forms.

In contrast to this formal lexicographic policy, it is common to find word-lists where inflected forms of verbs and adjectives, and adverbial forms are included, presumably to assist users who encounter the words in that form.

In a similar fashion, in Japanese dictionaries the verbs formed with the addition of する are typically not recorded with the する as part of the headword. This policy is also adopted in JMDict. Often word-lists do include the する, including cases of it in inflected forms such as して and しない. This also presents challenges to matching entries and aligning glosses.

Part-of-Speech Information

The JMDict file is largely populated with part-of-speech tags associated with the Japanese portion of each entry. This largely removes the requirement for such tagging in the material incorporated for other languages. Many files of lexical material include such tags in various forms, often using ambiguous representations that are difficult to detect and remove automatically.

Sense Details

The EDICT file, which was the precursor to JMDict, simply listed a set of English glosses for each Japanese word, and did not distinguish between multiple senses. The JMDict structure allows for delineation between senses, and work is proceeding to organize the glosses where necessary. Although highly desirable, this presents a challenge when incorporating matched entries from other sources, as such sources often do not mark the senses, and when they do, they may be in a different order from the ones in JMDict.

This problem may be avoided simply by grouping the different language material regardless of sense, but if a method could be determined to enable the grouping to be carried out at the sense level, the result will be more satisfactory.

Usage Examples

A common feature of significant dictionaries is the inclusion of representative examples of the usage of a word in a short clause or sentence. The JMDict structure allows for such examples, which ideally be located within the separate language glosses. As yet there are no examples in JMDict, and none in the files which are candidates for providing material. The development and editing of such examples will be a major task of any multilingual dictionary project.

On-going Editing

Most dictionaries undergo continual revision. With published dictionaries this usually takes place in cycles oriented around new editions. With on-line dictionaries the cycle-time can be reduced to the stage where the release of new versions is close to continuous. In the case of the Japanese and English components of JMDict and its legacy format EDICT, changes to the master files occur almost daily, and several public releases are made each year. Editorial control rests with one area, which at least ensures a consistent editorial policy and standard.

The extension of JMDict to multilingual components raises an additional challenge in that editing entries for the full range of languages requires skills in a wide range of languages. A possible solution is to have the lexicon set up for on-line update, as is proposed for Papillon. The effectiveness of such an approach has yet to be demonstrated, particularly in the important areas of quality control for the entries. Indeed, the experience with the JMDict and EDICT files has been that the error rate among contributed submissions is unacceptably high, and that careful cross-checking and editing is necessary.

The JMDict Approach

The approach that has been followed with JMDict has been designed to meet as far as possible the issues outlined above. In summary, it can be described as follows:

the basic master file for JMDict contains only Japanese and English material. It is a text file in EUC coding, and is edited using a regular text editor and some update utilities. The XML-format JMDict file and the legacy EDICT file are generated from this file using utility software. This method was employed for several reasons:
1. at the time the JMDict project began, there were no XML editors available which were capable of handling the file (over 20Mb);
2. by separating the master file and the distribution format, it has enabled the XML structure to be modified on several occasions without changing the master file.
the German, French and Russian components are derived from discrete files of Japanese-German, etc. material. The files are the JDDICT[5] (Japanese-German Dictionary) file transcribed from a small Langenscheidt compiled by Wolfgang Hadamitzky, Jean-Marc Desperrier's Dictionnaire français-japonais[6] which results from a project to assign French glosses to the main EDICT entries, and Oleg Volkov's un-released JR-EDICT (Japanese-Russian Electronic Dictionary). All these files are in UTF-8 coding and follow the basic EDICT format. The JDDICT file is not being revised, however the other two are undergoing expansion and editorial revision.
the process of generating a complete version of JMDict file is as follows:
1. as required, the most recent versions of the contributing files are fetched;
2. the kanji and kana headwords in each entry are matched against the JMDict master file, and the JMDict entry sequence number is associated with the respective glosses in an interim update file;
3. the interim update files are merged with the JMDict master file;
4. the full XML format is generated and checked for validity and DTD conformance.

The advantage of the process described above is that it leaves the critical language-dependent aspects in the hands of people who have the skills and motivation to handle it. The merge of material is greatly facilitated by the fact that the additional material is always based on a subset of the full JMDict entries, and has been compiled ab initio following consistent formats and principles. A high level of cooperation between the participants has also assisted greatly.

The next planned expansion to JMDict is to include material from the WaDokuJT[7] Japanese-German file compiled by Ulrich Apel. This is a large and rich file and while it has some structural similarities to the EDICT and JDDICT files, the differences are such that many of the issues described earlier in this paper will need to be addressed. An initial examination indicates that some 30-40,000 entries which do not cover material already in the JDDICT file may readily be included in a semi-automated fashion.

Conclusion

This paper has described and discussed the issues that need to be addressed when compiling a multilingual lexicon from a number of disparate sources.

The discussion has been focussed on the compilation of the JMDict (Japanese-Multilingual Dictionary), which has adopted the approach of retaining the source files as independent entities under the control of editors skilled in particular language pairs, and build the JMDict file via an automated merge of material. The success of this approach suggests a model that may be applicable to other multilingual lexicon projects.