This paper describes an automated system for harvesting potential Japanese neologisms from texts collected from the WWW. The harvesting mechanism includes the retrieval and extraction of text, the parsing of the text to detect potential neologisms, and the evaluation of the candidate words.
Detecting the emergence of neologisms (Johnson's "tracing the original") is one of the many interesting challenges in lexicography. Traditional techniques have relied on human readers noticing that a word has not previously been recorded. The relatively new phenomenon of having large amounts of recent text available for electronic retrieval has enabled a degree of automation to be applied to the search for, and detection of neologisms.
Whereas in European languages the practice of using spaces between words means that the collection and analysis of word lists from texts is a relatively simple task, languages such as Japanese, which do not delineate word boundaries in the orthographical system, pose a particular problem when attempting to detect new words. Reading text in such a language requires parsing it into its components, a process which typically draws on knowledge of the orthography, grammar and morphology, and on the availability of a comprehensive lexicon. A neologism, as with any unknown word, will by definition not be in the lexicon, although it is possible its constituent morphemes will be present. One exception to this problem in Japanese are the "katakana words" written in the katakana syllabary. These include loan-words, non-Japanese names, etc., and are readily identifiable from the script used.
It is, of course, possible to detect neologisms through a systematic manual examination of recent texts [Chen, 2002], however with the increasing availability of recent texts in electronic form, it is appropriate to develop techniques to examine them automatically.
One possibility for the automated detection of neologisms is to take a brute-force approach, e.g. examine every possible digraph, trigraph, etc. in the text, however this is not likely to be very practical. Apart from the katakana words mentioned above, most target new words are likely to be written using either kanji alone or combinations of kanji and hiragana. Examining strings of hiragana, which typically make up inflections, conjunctions, etc. is not likely to yield much, and blindly extracting digraphs from sequences of kanji is likely to yield false results as it ignores the processes of affixation and compounding that play a large role in Japanese word formation. [Tsujimura, 1996]
Another approach is to pass text through a parser and identify when the parser fails to associate a word with its lexicon. This approach could typically be applied to kanji compounds, although it could also be used for katakana words. This approach has been used with Chinese texts [Goh, 2003] but to date there have been published studies for Japanese neologisms.
In the trials described in this paper, two forms of neologism have been targetted for harvesting:
2. Text Corpus
In this trial a series of about 500 short articles from the Asahi Shinbun daily newspaper have been used. The articles cover the topics of politics, culture, life-style and international affairs. This set of articles has been chosen because:
3. Katakana Words
The system used for harvesting katakana words is relatively simple, consisting of:
The list of known words was compiled from the JMdict [Breen, 2004] and JMNEdict [Breen, 2005] files, supplemented by the katakana-only entries from the large Eijiro dictionary file. The list has over 205,000 katakana words.
In the first three weeks of the trial, approximately 1,080 katakana words appeared in the Asahi articles, of which approximately 280 were not on the known word list. The majority of these were transcriptions of names, e.g. Chinese and Korean politicians, locations in Iraq, etc. Some of the higher ranking words that may be suitable for adding to the lexicon are shown in Table 1.
|ネーギン||Nagin (New Orleans mayor)|
|オコーナー||O'Connor (US judge)|
|ゼロメートル||sea level (zero metres)|
4. Kanji Compounds
In examining whether lexicon-driven parsers could be used to detect neologisms, two parsers were tested against text containing a small set of known and artificial neologisms. The two parsers were:
ChaSen's lexicon (currently ipadic-2.7.0) is not particularly large. It consists of approximately 240,000 entries, of which 141,000 are names (person, place organization), 74,000 are nouns (including verbal and adjectival), 15,000 are verbs, 2,000 are adjectives and 3,000 are adverbs.
Unlike ChaSen, the WWWJDIC parser makes little attempt to analyze the structure of the text; instead it attempts to identify and present complete words, phrases, expressions, etc. using a form of greedy longest-match algorithm.
The following are the results of the parsing of a short sentence: 富士山に登るのに丸一晩掛かった (It took all night to climb Mt Fuji);
|登る||ノボル||登る||動詞-自立, 五段・ラ行, 基本形|
|掛かっ||カカッ||掛かる||動詞-自立, 五段・ラ行, 連用タ接続|
|た||タ||た||助動詞, 特殊・タ, 基本形|
The two parsers were tested using passages containing jukugo (kanji compound words) known not to be in their lexicons. As expected from ChaSen's documentation, the usual result was to treat the unknown word as a sequence of single kanji. The behaviour of the WWWJDIC parser was less predictable as its treatment depended on neighbouring kanji and compounds. From the tests it was concluded that the reporting of a sequence of two or more consecutive single kanji from ChaSen was the most reliable indication of a potential neologism.
There are, however a number of circumstances in which ChaSen will deliver such sequences of single kanji. Strings of numerics cause this behaviour, as will a numeric followed by a counter, or a counter followed by 目 (ordinal number suffix). The affixation common in Japanese morphology also results in sequences of single kanji. For example 元副首相 (former deputy prime minister) is parsed as 元 + 副 + 首相. In addition, the small size of the proper name files used by ChaSen leads to many names not being recognized as such, and they too are reported as single kanji.
The approach that has been applied has been:
Thus we obtain a system that can be trained to recognize and subsequently ignore the kanji pairs that are progressively collected.
At the completion of the processing of approximately 500 articles totalling 280,000 characters, just over 500 unique unrecognized kanji pairs had been identified. Of these, 120 were Japanese names, 50 were Chinese or Korean names, 280 were juxtaposed numerics, counters, affixes, etc. which on inspection were not considered appropriate for further consideration, and 50 were classed as candidates to be newly recorded words. The kanji pairs were checked:
The candidates to be recorded as "new" words fell into several groups:
The analysis of these candidates and preparation of dictionary entries has to be carried out manually. In some cases the readings are available in the conversion files of Input Methods, and in some cases possible readings can be tested via WWW searches as people writing new or unusual words will sometimes add the reading. In general, the ability to use the WWW to check both the amount of usage of a word and the context of its usage is an immense boon in lexicography.
5. Status of the Havesting System
The collection system described above is now operating automatically. Every evening (Japan time) a program script collects the target newspaper articles for that day and processes them, combining the results with those from previous days. The results are periodically examined and classified. Approximately 20 new articles are collected each day, and approximately 10 additional katakana words and 20 new kanji pairs are identified from each days processing. The proportion of rejected kanji pairs appears to be slowly declining, probably as a result of the more common pairs being already encountered.
6. Future Directions
Two expansions of the present harvesting system are under consideration:
A relatively simple system has been developed and tested which has been successful in detecting and harvesting a number of previously unrecorded words from Japanese texts. The system is capable of automatic operation and accumulation of candidate words. The system has the potential to be expanded to cover a number of other types of Japanese words.
James Breen, 2003 A WWW Japanese Dictionary, in "Language Teaching at the Crossroads", Monash Asia Institute, Monash Univ. Press.
James Breen, 2004 JMdict: a Japanese-Multilingual Dictionary, COLING-2004 Multilingual Linguistic Resources Workshop, Geneva, August 2004 Also: http://www.csse.monash.edu.au/~jwb/jmdictart.html
James Breen, 2005 Japanese Multilingual Named Entity Dictionary, http://www.csse.monash.edu.au/~jwb/enamdict_doc.html
Lee Shiu Chen, 2002 Lexical Neologisms in Japanese, Australian Association for Research in Education Conference, Brisbane, 2002.
Goh Chooi Ling, Masayuki Asahara, Yuji Matsumoto, 2003 Chinese Unknown Word Identification Using Character-based Tagging and Chunking 41st Annual Meeting on Association for Computational Linguistics, Sapporo, Japan, 2003
Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. 2002. Morphological Analysis System ChaSen version 2.2.9 Manual. Nara Institute of Science and Technology. http://chasen.aist-nara.ac.jp/hiki/ChaSen/
Natsuko Tsujimura, 1996 An Introduction to Japanese Linguistics, Blackwell, 1996.