This paper describes a small experimental project to determine whether it is possible to discover hitherto unrecorded abbreviations in Japanese by mimicking the natural-language abbreviation process using a large set of source words, then using the WWW as a corpus of Japanese texts to determine whether the synthesized abbreviations exist. The process of formation of the abbreviations is described, along with the WWW search and the resulting analysis of text material. While only a small number of cases have been investigated completely, and the project is only at a proof-of-concept stage, a number of "new" abbreviations have been detected.
Any person leaning Japanese as a foreign language quickly discovers that a large number of abbreviations are in regular use. Many loan-words are truncated, and even the governing party in Japan, the Liberal-Democratic Party (自民党: jimintou) has a title that is abbreviated from the rarely-used 自由民主党.
A problem for the learner is that these abbreviations are poorly lexicalized. Some only appear in newspaper headlines, and thus are considered unimportant by lexicographers. They rarely appear in leaners' dictionaries, where space is at a premium, and are often overlooked in the major Japanese-English dictionaries, as these are produced primarily for the domestic Japanese market and few native Japanese speakers would wish to look up such an abbreviation in English.
In this paper, a project to determine whether it would be possible to expand the representation of such abbreviations using natural-language processing techniques is described. The project involves the synthesis of possible abbreviations, then the search of a text corpus to determine whether the synthesized abbreviation is being used.
2. Abbreviation Formation in Japanese
One of the major processes in Japanese word formation is compounding, i.e. the combination of two or more words to form a new word. This process, which is common to many languages, may involve independent words, or may involve morphemes which are not normally independent words. Compounds in Japanese may involve native Japanese components, e.g. 近道 (chikamichi: shortcut), Sino-Japanese components, e.g. 殺人 (satsujin: murder, manslaughter), or hybrids of native, Sino-Japanese and loan-word components, e.g. 台所 (daidokoro: kitchen) and 石油ショック (sekiyuushokku: oil shock). (Tsujimura, 1996)
Compounding extends to combining two existing compounds to form a new compound which is typically a noun, noun-phrase or multi-word expression consisting of four or more kanji, e.g. 為替 (kawase: exchange, money order) and 相場 (souba: market price) combine to form 為替相場 (exchange rate). There are many such extended compounds in use in Japanese, and a large number of them appear in dictionaries as independent entries. As confirmed by inspection of various lexicons, the majority of such extended compounds are made up of four kanji, typically formed from two two-kanji compounds, although there is also a large number of longer compounds.
Another word formation process is that of abbreviation (sometimes also called clipping),) in which words are shortened. In many cases, particularly for long loan-words, the word is simply truncated, e.g. プラットホーム (purattohoomu: platform) becomes just ホーム, and スーパーマーケット (suupaamaaketto: supermarket) becomes just スーパー. For long compounds, the process typically involves selecting the first two morae of each constituent compound (Tsujimura, 1996). For example, 学生割引 (gakuseiwaribi: student discount), is usually abbreviated to 学割, with the latter occurring four times more often in Japanese WWW documents. In some ways this process is analogous to the creation and use of acronyms in languages which use alphabets.
While this form of abbreviation is by no means unique to Japanese (e.g. in the 1960s the Ministry of Technology in the UK was often referred to as "MinTech"), it appears that the process is more strongly embedded and more commonly employed than in many other languages.
Inspection of the lexicalization of this two-kanji class of abbreviations reveals:
3. Abbreviations - the Search Process
Given that a large number of four-kanji compounds have already been lexicalized, e.g. the JMdict file (Breen, 2004) has over 8,000 four-kanji compounds recorded, it is possible to use that set of compounds as the basis for an automated search of a Japanese corpus to determine if hitherto unrecorded abbreviations are in use.
The process employed in this project is:
The Japanese pages in the WWW were used as a corpus in this investigation for several reasons:
The examination of the WWW was made using the Google search engine via the API (Application Program Interface), which provides for programmed searches. The API interface enables a number of filters to be set, and in this case the text language was limited to Japanese, i.e. only pages which have been classified by Google to be in Japanese. The language classification appears to be quite conservative, however it was considered important to exclude pages containing Chinese or Korean as the Google database holds pages in Unicode coding and thus false matches are possible for kanji search keys.
A further restriction was to ensure that the pair of characters were adjacent in the text. For poorly lexicalized terms the Google indices appear to handle kanji as separate tokens, and as a default the search may return a match on non-adjacent kanji. By specifying a key in quotation marks it possible to restrict the match to a sequence of kanji, however the match will still occur if the kanji are separated by space or punctuation characters, necessitating a finer analysis of the search results.
The examination of possible abbreviations proceeded in two stages:
The ordering of the candidates according to frequency of hits was done in order to concentrate on the more commonly occurring sequences which would, if valid, be worth including in a lexicon. If the process was successful for these candidates, it could be repeated for less-frequently occurring candidates. As Google ranks pages according to a weighting system based on, among other things, the number of hyperlinks pointing to a page, it is reasonable to expect that valid uses of an abbreviation would be discernible in the higher-ranked pages. Approximately 23% of the remaining candidates received over 1,000 hits in the Google search.
The text snippets supplied by the Google API typically contain about 70 characters of Japanese text surrounding the target word. The text in the snippets was stripped of residual HTML tags, then examined to determine:
In order to assess this classification, the analysis was applied to a small set of recognized abbreviations: 拡販, 学割, 郵貯 and 労組, and to a set of common Japanese compounds: 先生, 学校, 政府 and 工場. Table 1 shows the results from the 110 highest-ranking pages for those words. The column marked "Confidence" is the ratio of pages classified as either Strong or Moderate to the total number of pages containing the candidate, and may be seen as a crude measure of the precision of the technique.
From this it is reasonable to conclude that a strong representation in the Strong and Moderate classifications may be grounds for concluding that word exists and is in use.
To test this assumption, the analysis was carried out on a selection of possible abbreviations. Table 2 shows the results from candidates which had resulted in Google reporting several thousand matched pages, and Table 3 shows the results for candidates for which about two hundred matches were reported. The compound from which the abbreviation candidate was formed is also shown.)
These results do not lend themselves to straightforward interpretation.
Many of the candidates in Table 2 with reasonably high confidence measures turn out to be valid words, but not all are abbreviations. For example, 国補, 合皮, 最賃, 高建, 国関 and 県病 are abbreviations of the original four-kanji compound, however 国展 and 国都 are words formed independently and 国計 and 工化 are abbreviations of other words.
While it is tempting to dismiss candidates such as 工技, which has a low confidence measure, inspection of the WWW pages that contain it reveals that it used as an abbreviation of 工業技術 in such things as the titles of prefectural industrial research centres, e.g. the 工技ネット新潟 in Niigata. Similarly 印電 resulted in one page where it clearly was used as an abbreviation of 印刷電信, but in all others the matches resulted from the juxtaposition of 印 (seal) and 電話番号 (telephone number) on forms.
In the cases of 財相, 最限 and 再制 in Table 3 the confidence measure was very high, although the number of hits was low. The result was skewed either to the strong or moderate classifications. On inspection the reasons for this become apparent:
Although only a relatively small number of cases has been examined in depth, it appears that provided the confidence measure is above about 0.60 and there is a reasonable representation of strongly classified hits, there is a good chance that a "new" word has been identified. At present this has only been tested for candidates with total hits in the thousands.
The situation with candidates with relatively low numbers of hits is far less clear. If the actual of target pages is small, there is a risk of the results being influenced by input errors, spelling mistakes, etc. Also the impact of having pages with related material, as was the case with 最限, becomes greater.
It is appropriate to question at this stage whether the line of investigation taken in this project is worthwhile. Of the original 8,000 candidates derived from four-kanji compounds fewer than 1,500 meet the criteria of not already being in a lexicon and achieve a suitably large number of page hits. Of these, it is unlikely that more than 50% will result in a "new" word being lexicalized. As the validation usually requires reading several WWW pages to determine the meaning and context of the word, the overall process can be quite time-consuming.
If the purpose of the process is simply to expand the lexicon, there are probably easier and less time-consuming ways to do this, such as calling for donations of material from native speakers. However as a method of detecting unrecorded abbreviations, it appears that the technique is worth applying to completion.
This project has demonstrated that it is possible to identify numbers of Japanese abbreviations by synthesizing candidate abbreviations from longer compound words, then testing for their presence in WWW pages. A semi-automated process was developed which identified which candidates had a high likelihood of being either a valid abbreviation or a hitherto unrecorded neologism.
Breen, J.W. JMdict: a Japanese-Multilingual Dictionary, COLING-2004 Multilingual Linguistic Resources Workshop, Geneva, August 2004 Also: http://www.csse.monash.edu.au/~jwb/jmdictart.html
Breen, J.W. and Tokita, A, The WWW in Japan: a threat to cultural identity, or a domesticated system?, First International Conference on Cultures and Technologies in Asia, Mumbai, India, Feb 2004
Keller, F and Lapata, M., Using the Web to Obtain Frequencies for Unseen Bigrams, Computational Linguistics, Vol. 29, No. 3, September 2003, MIT Press.
Tsujimura, N. An Introduction to Japanese Linguistics, Blackwell, 1996.