[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rare] tag for obscure kanji?




Some were verb inflections (e.g. 飛び抜けよう got 17), some were another
word but with an honorific prefix.

湯煮 got 14 (未定義語), お湯煮 (名詞普通名詞) also got 14. It seems to
be a real word: http://www.pride-fish.jp/uekatsu/yuni.html

The second one seems to be a failure of automatic vocabulary acquisition.
鳫飛 (名詞普通名詞) got 17: a shogi move, apparently. It has its own
wikipedia page.
Jumandic includes some Wikipedia titles as dictionary words.
Btw, I am still strongly pushing to add JMDict into the automatic dictionary word induction pipeline as well,
but nobody is working on it at the moment.

But if you did want to trim the file size, you could filter out those
that have both a low frequency and are 未定義語 ?

This is a good idea!


Update on n-gram status.

Unigrams were easy, but Apache Spark started to fail on 2-grams and higher.
Managed to compute 2-grams with some configuration changes, but 3-grams just don't finish, timing out on shuffles. Managing mutli-TB datasets is difficult even with tools that seem to be specialized for the task...

Arseny