[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

File of surface form variants



As some of you know, I have been using the MeCab morphological
analyzer in combination with the "UniDic" morphological lexicon
in my research on extracting lexemes in Japanese. One of the
great things about UniDic is the way it has a very wide coverage
of variant ways of writing Japanese words (AKA "surface forms").
It maps them back to single base forms. The variants include
alternative kanji, 交ぜ書き and various mixtures with katakana.

Anyway, a while back I pushed the whole of Unidic (about 700k
entries) through a bit of sorting and flltering to see how many
of these variants were associated with current  JMdict/EDICT
entries, but weren't already there. The answer was ~19k.

I have put this file at
http://www.csse.monash.edu.au/~jwb/unidic_updates.html

Many of these could be added to JMdict. For example, the
first: うら寂しい [うらさびしい] /うら淋しい/
is quite plausible, as we already have:
心寂しい; うら寂しい; 心淋しい 【うらさびしい】......
and うら淋しい get 20M hits, more than some we have
already.

Of course I haven't any idea what to do with the 19k.
It's all a bit indigestible. I could turn it into a series of pages
with "update" links.

Any suggestions welcome.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne