Expanding the Lexicon: Harvesting Neologisms in Japanese

James BREEN
Monash University
Clayton 3800, Australia
jwb@csse.monash.edu.au

"A writer of dictionaries; a harmless drudge, that busies himself in tracing the original and determining the significance of words." --- Dr Johnson's definition of a lexicographer

Abstract

This paper describes an automated system for harvesting potential Japanese neologisms from texts collected from the WWW. The harvesting mechanism includes the retrieval and extraction of text, the parsing of the text to detect potential neologisms, and the evaluation of the candidate words.

1. Introduction

Detecting the emergence of neologisms (Johnson's "tracing the original") is one of the many interesting challenges in lexicography. Traditional techniques have relied on human readers noticing that a word has not previously been recorded. The relatively new phenomenon of having large amounts of recent text available for electronic retrieval has enabled a degree of automation to be applied to the search for, and detection of neologisms.

Whereas in European languages the practice of using spaces between words means that the collection and analysis of word lists from texts is a relatively simple task, languages such as Japanese, which do not delineate word boundaries in the orthographical system, pose a particular problem when attempting to detect new words. Reading text in such a language requires parsing it into its components, a process which typically draws on knowledge of the orthography, grammar and morphology, and on the availability of a comprehensive lexicon. A neologism, as with any unknown word, will by definition not be in the lexicon, although it is possible its constituent morphemes will be present. One exception to this problem in Japanese are the "katakana words" written in the katakana syllabary. These include loan-words, non-Japanese names, etc., and are readily identifiable from the script used.

It is, of course, possible to detect neologisms through a systematic manual examination of recent texts [Chen, 2002], however with the increasing availability of recent texts in electronic form, it is appropriate to develop techniques to examine them automatically.

One possibility for the automated detection of neologisms is to take a brute-force approach, e.g. examine every possible digraph, trigraph, etc. in the text, however this is not likely to be very practical. Apart from the katakana words mentioned above, most target new words are likely to be written using either kanji alone or combinations of kanji and hiragana. Examining strings of hiragana, which typically make up inflections, conjunctions, etc. is not likely to yield much, and blindly extracting digraphs from sequences of kanji is likely to yield false results as it ignores the processes of affixation and compounding that play a large role in Japanese word formation. [Tsujimura, 1996]

Another approach is to pass text through a parser and identify when the parser fails to associate a word with its lexicon. This approach could typically be applied to kanji compounds, although it could also be used for katakana words. This approach has been used with Chinese texts [Goh, 2003] but to date there have been published studies for Japanese neologisms.

In the trials described in this paper, two forms of neologism have been targetted for harvesting:

  1. katakana words, using a simple extraction process based on the detection of the katakana script;
  2. kanji compounds, using a parser and attempting to detect from the parser output that a potential neologism has been encountered.

2. Text Corpus

In this trial a series of about 500 short articles from the Asahi Shinbun daily newspaper have been used. The articles cover the topics of politics, culture, life-style and international affairs. This set of articles has been chosen because:

  1. the language and orthography used is reasonably formal;
  2. the contexts are constrained, thus allowing for easier translation of neologisms
  3. the layout of the Asahi Shinbun WWW site lends itself to a straightforward daily collection of new articles, and the structure of each article enables the text to be extracted using a simple script.

3. Katakana Words

The system used for harvesting katakana words is relatively simple, consisting of:

  1. examining each article and extracting contiguous sequences of katakana characters and associated characters such as the chouon (long vowel) and kurikaeshi (repetition mark);
  2. comparison with a list of known katakana words;
  3. generation of a cumulative list of unrecorded words, ranked by frequency of occurrence and tagged with the source article.

The list of known words was compiled from the JMdict [Breen, 2004] and JMNEdict [Breen, 2005] files, supplemented by the katakana-only entries from the large Eijiro dictionary file. The list has over 205,000 katakana words.

In the first three weeks of the trial, approximately 1,080 katakana words appeared in the Asahi articles, of which approximately 280 were not on the known word list. The majority of these were transcriptions of names, e.g. Chinese and Korean politicians, locations in Iraq, etc. Some of the higher ranking words that may be suitable for adding to the lexicon are shown in Table 1.

Katakana Meaning
スンニ Sunni
ネーギン Nagin (New Orleans mayor)
オコーナー O'Connor (US judge)
ゼロメートル sea level (zero metres)
ピープルパワー people power
プロフィル profile
(normally プロフィール)

Table 1: Some sample new katakana words

4. Kanji Compounds

In examining whether lexicon-driven parsers could be used to detect neologisms, two parsers were tested against text containing a small set of known and artificial neologisms. The two parsers were:

  1. ChaSen [Matsumoto et al, 2005]. This Hidden Markov Model parser from the Nara Institute of Science and Technology is widely used in Japanese NLP. It is primarily a part-of-speech-tagger, and for unknown words its approach is to assume that each character is a distinct morpheme, regardless of whether the character is in its morpheme files.

    ChaSen's lexicon (currently ipadic-2.7.0) is not particularly large. It consists of approximately 240,000 entries, of which 141,000 are names (person, place organization), 74,000 are nouns (including verbal and adjectival), 15,000 are verbs, 2,000 are adjectives and 3,000 are adverbs.

  2. WWWJDIC [Breen, 2003]. This is the parser used in the text-glossing function in the author's WWW-based dictionary system. The glossing function uses a lexicon of approximately 675,000 entries, of which 435,000 are names, 140,000 are general Japanese-English entries drawn mainly from the JMdict file and 100,000 are from subject-specific glossaries covering bio-medical science, computing, law, engineering, etc. About 140,000 entries include katakana.

    Unlike ChaSen, the WWWJDIC parser makes little attempt to analyze the structure of the text; instead it attempts to identify and present complete words, phrases, expressions, etc. using a form of greedy longest-match algorithm.

The following are the results of the parsing of a short sentence: 富士山に登るのに丸一晩掛かった (It took all night to climb Mt Fuji);

Text Reading Dictionary form   POS
富士山 フジサン 富士山 名詞-固有名詞-一般
助詞-格助詞-一般
登る ノボル 登る 動詞-自立, 五段・ラ行, 基本形
名詞-非自立-一般
助詞-格助詞-一般
マル 名詞-一般
一晩 ヒトバン 一晩 名詞-一般
掛かっ カカッ 掛かる 動詞-自立, 五段・ラ行, 連用タ接続
助動詞, 特殊・タ, 基本形

Figure 1: Parse output from ChaSen

  • 富士山 【ふじさん】 (n) Mt Fuji
  • 登る 【のぼる】 (v5r) (1) to rise; to ascend; to go up; to climb; (2) to go to (the capital); (3) to be promoted; (4) to add up to; (5) to advance (in price); (6) to sail up; (7) to come up (on the agenda)
  • 丸一晩 【まるいちばん】 (n) whole night; all night
  • Possible inflected verb or adjective: (plain, past)
    掛かる 【かかる】 (v5r,vi) (1) to take (e.g., time, money, etc); (2) to hang

Figure 2: Parse output from WWWJDIC

The two parsers were tested using passages containing jukugo (kanji compound words) known not to be in their lexicons. As expected from ChaSen's documentation, the usual result was to treat the unknown word as a sequence of single kanji. The behaviour of the WWWJDIC parser was less predictable as its treatment depended on neighbouring kanji and compounds. From the tests it was concluded that the reporting of a sequence of two or more consecutive single kanji from ChaSen was the most reliable indication of a potential neologism.

There are, however a number of circumstances in which ChaSen will deliver such sequences of single kanji. Strings of numerics cause this behaviour, as will a numeric followed by a counter, or a counter followed by 目 (ordinal number suffix). The affixation common in Japanese morphology also results in sequences of single kanji. For example 元副首相 (former deputy prime minister) is parsed as 元 + 副 + 首相. In addition, the small size of the proper name files used by ChaSen leads to many names not being recognized as such, and they too are reported as single kanji.

The approach that has been applied has been:

  1. process each batch of new articles through ChaSen;
  2. collect each pair of adjacent kanji which ChaSen has reported as a single morpheme;
  3. remove any pairs which occur as two-kanji entries in the JMdict file (approx. 35,000 entries);
  4. remove any pairs on a "stop list" consisting of pairs collected from previous batches;
  5. classify remaining kanji pairs, and add them to the stop list.

Thus we obtain a system that can be trained to recognize and subsequently ignore the kanji pairs that are progressively collected.

At the completion of the processing of approximately 500 articles totalling 280,000 characters, just over 500 unique unrecognized kanji pairs had been identified. Of these, 120 were Japanese names, 50 were Chinese or Korean names, 280 were juxtaposed numerics, counters, affixes, etc. which on inspection were not considered appropriate for further consideration, and 50 were classed as candidates to be newly recorded words. The kanji pairs were checked:

  1. by examining their use in the source document;
  2. checking them against several electronic dictionaries;
  3. checking their usage in WWW pages.

The candidates to be recorded as "new" words fell into several groups:

  1. previously unrecorded names, e.g. 武示 (たけし), 晃毅 (こうき), 潔重 (ゆきしげ), etc. Some of these were found in lists of candidate names in the recent election.
  2. words such as 米紙 (American press/newspapers) and 軍歴 (military service record) which have just appeared in recent dictionaries.
  3. abbreviations such as 日歯連 (from 日本歯科医師連盟 - Japan Dentists Federation), 国緊隊 (from 国際緊急援助隊 - Japan International Cooperation Agency) or 全総 (from 全国総合開発計画 - Comprehensive National Development Plan).
  4. newspaper-style formations such as 中韓 (China-Korea) or 仏誌 (French publication), which while immediately recognizable to a native speaker of Japanese are probably worth recording in a bilingual or multilingual dictionary.
  5. apparently new formations such as 入境 (border crossing or border entry) and 公助 (public assistance).

The analysis of these candidates and preparation of dictionary entries has to be carried out manually. In some cases the readings are available in the conversion files of Input Methods, and in some cases possible readings can be tested via WWW searches as people writing new or unusual words will sometimes add the reading. In general, the ability to use the WWW to check both the amount of usage of a word and the context of its usage is an immense boon in lexicography.

5. Status of the Havesting System

The collection system described above is now operating automatically. Every evening (Japan time) a program script collects the target newspaper articles for that day and processes them, combining the results with those from previous days. The results are periodically examined and classified. Approximately 20 new articles are collected each day, and approximately 10 additional katakana words and 20 new kanji pairs are identified from each days processing. The proportion of rejected kanji pairs appears to be slowly declining, probably as a result of the more common pairs being already encountered.

6. Future Directions

Two expansions of the present harvesting system are under consideration:

  1. adding additional material from other newspapers or other text sources;
  2. expanding the analysis to include longer kanji compounds resulting from affixation or compounding.

7. Conclusion

A relatively simple system has been developed and tested which has been successful in detecting and harvesting a number of previously unrecorded words from Japanese texts. The system is capable of automatic operation and accumulation of candidate words. The system has the potential to be expanded to cover a number of other types of Japanese words.

References

James Breen, 2003 A WWW Japanese Dictionary, in "Language Teaching at the Crossroads", Monash Asia Institute, Monash Univ. Press.

James Breen, 2004 JMdict: a Japanese-Multilingual Dictionary, COLING-2004 Multilingual Linguistic Resources Workshop, Geneva, August 2004 Also: http://www.csse.monash.edu.au/~jwb/jmdictart.html

James Breen, 2005 Japanese Multilingual Named Entity Dictionary, http://www.csse.monash.edu.au/~jwb/enamdict_doc.html

Lee Shiu Chen, 2002 Lexical Neologisms in Japanese, Australian Association for Research in Education Conference, Brisbane, 2002.

Goh Chooi Ling, Masayuki Asahara, Yuji Matsumoto, 2003 Chinese Unknown Word Identification Using Character-based Tagging and Chunking 41st Annual Meeting on Association for Computational Linguistics, Sapporo, Japan, 2003

Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. 2002. Morphological Analysis System ChaSen version 2.2.9 Manual. Nara Institute of Science and Technology. http://chasen.aist-nara.ac.jp/hiki/ChaSen/

Natsuko Tsujimura, 1996 An Introduction to Japanese Linguistics, Blackwell, 1996.