[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



> Here are morpheme unigrams for 3B of sentences cut at 10 for the feel what is 
> there at the bottom.
> https://tulip.kuee.kyoto-u.ac.jp/ngrams/3B/unigrams.gz

I thought it was an interesting mix of real words, place names, typos
and foreign words (though, the boundary between Chinese and rare
Japanese words gets a bit blurry).

Some were verb inflections (e.g. 飛び抜けよう got 17), some were another
word but with an honorific prefix.

湯煮 got 14 (未定義語), お湯煮 (名詞普通名詞) also got 14. It seems to
be a real word: http://www.pride-fish.jp/uekatsu/yuni.html

鳫飛 (名詞普通名詞) got 17: a shogi move, apparently. It has its own
wikipedia page.

But if you did want to trim the file size, you could filter out those
that have both a low frequency and are 未定義語 ?

Darren