[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



Hi Arseny(?)

On Fri, 15 Mar 2019 at 14:31, eiennohito@gmail.com [edict-jmdict]
<edict-jmdict@yahoogroups.com> wrote:

> We have newer (and larger) 10B web corpus and I can compute n-gram statistics for it.
> The thing is I am not sure how to match JMDict entries with JUMAN/Juman++ analysis results.

Have you made an n-gram corpus from it using JUMAN/Jumandic or
MeCab/Unidic|IPADIC? That's
really the starting point. With the old Kyoto corpus, kindly supplied
by Kurohashi-sensei, I followed
the same approach as was used for the Google one, the only difference
being that I used Unidic
rather than IPADIC as the morpheme dictionary.

[This is an aside for people not working in Japanese NLP...
Making an n-gram corpus means taking a sentence like これは本です, breaking
it into morphemes,
in this case これ は 本 です, and dividing it into 4 x1-grams, 3x2-grams,
2x3-grams and 1x4-gram.
All the n-grams are then collated and counted, which is a very large
processing task.  The result is
a set of sorted files for the 1-grams, 2-grams, etc. with their counts.

For my lookup system I squished the morphemes of the n-grams back
together and sorted them as
text strings. Thus when I look up 本です in the Kyoto corpus it returns
49039, which means the
2-gram 本 + です occurred that many times in the corpus.]

> JMDict is not a morphological analysis dictionary as Unidic/Jumandic/IPAdic are and there are multiple many-to-many matchings possible. Phrases are problematic as well. We also don't really do disambiguation of hiragana.

The lookup approach I use gets around that. If I look up サルモネラ食中毒 (282) there
is really only one "サルモネラ食中毒" string to match against. It probably came from
the 3-gram サルモネラ + 食 + 中毒. Usually the match is only one-one (I sometimes
see multiple counts with the Google corpus as IPADIC tended to mangle 複合動詞
leading to different morpheme segmentations.)

Anyway, I'm very interested to hear more about your corpus.

Cheers

Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/