[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rare] tag for obscure kanji?
Status update on n-grams.
I have 1,2,3,4-grams computed with cutoff being lower as per request.
Sizes (gzipped):
1-grams: 46M (cutoff at 20)
2-grams: 2.4G (cutoff at 10)
3-grams: 18G (cutoff at 7)
4-grams 43G (cutoff at 5)
Jim,
Jumandic treats conjugation differently than IPADic/Unidic (and that's
the reason why I chose keep POS information; Jumandic morphemes make
sense only together with POS tags)
IMHO, Jumandic is more natural for language learners and we try to
resolve most of POS-level ambiguities during the analysis, while Unidic
leaves ambiguity in XXX−可能 tags. That's not being strict in my opinion.
Well, I'm not saying that Jumandic is perfect. It is difficult to say
which approach is better, both have their pluses and minuses.
Arseny