Japanese Dictionaries and Multiple Surface Forms: Issues and Solutions

A major challenge for compilers and maintainers of Japanese dictionaries, especially in the digital age, is the wide variety of surface forms that can be used in Japanese. Lexical items are written using a combination of one or more of three scripts: kanji (Chinese characters) and the hiragana and katakana syllabaries (kana). The variety in the surface forms can be due to:
- the choice of kanji; often two or more kanji are available with the same meaning and pronunciation.
- variations in writing parts of morphemes in kana.
- replacing a kanji in a term with its kana equivalent.

For example the verb tekozuru (to have a hard time, etc.) is recorded in published dictionaries in four kanji forms: 手古摺る, 梃摺る, 手子摺る and 梃子摺る. In fact contemporary Japanese usage, as seen in the Google n-gram corpus, is that the verb is almost invariably written in either a mixed form (手こずる) or in kana alone (てこずる).

Similarly many loanwords in Japanese, which are typically transcribed into the katakana script, have variant transcriptions. For example:
- "diamond" is usually transcribed as ダイヤモンド (daiyamondo), but is often rendered as ダイアモンド (daiamondo);
- "syndicalism" is usually transcribed as サンジカリズム or サンディカリズム reflecting two approaches to the "di" syllable, and on occasions サンディカリスム is used where the "ism" is not voiced (published dictionaries vary as to which form is covered.)

The variety of surface forms presents a particular challenge to compilers of decoding dictionaries aimed at non-Japanese speakers, and also for lexicons supporting text analysis and glossing software.

In this paper we explore approaches to identifying, selecting and presenting variant surface forms in entries in order to optimize the usefulness of the displayed entry, while at the same time enabling the entries to be correctly identified using all of the known forms.