[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rare] tag for obscure kanji?




Status update on n-grams.

I have 1,2,3,4-grams computed with cutoff being lower as per request.

Sizes (gzipped):
1-grams: 46M (cutoff at 20)
2-grams: 2.4G (cutoff at 10)
3-grams: 18G (cutoff at 7)
4-grams 43G  (cutoff at 5)


Jim,

Jumandic treats conjugation differently than IPADic/Unidic (and that's the reason why I chose keep POS information; Jumandic morphemes make sense only together with POS tags)

IMHO, Jumandic is more natural for language learners and we try to resolve most of POS-level ambiguities during the analysis, while Unidic leaves ambiguity in XXX−可能 tags. That's not being strict in my opinion.

Well, I'm not saying that Jumandic is perfect. It is difficult to say which approach is better, both have their pluses and minuses.


Arseny