[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



Hi!

The additional data compresses really well!
And I think that POS/dictionary form information would help you to handle matching of conjugated words better. This will also give you something on orthographic variants (e.g. 美味しい vs 美味し〜い -- Juman++ does normalization of that in most cases).
And throwing the data away (ignore everything except the first field) is easier than adding the data if it didn't exist in the first place.

From my experience with large noisy corpora, you really want to throw away the low frequency stuff -- it is mostly garbage. Signal-to-noise ratio there is very bad. 

Ah, I forgot to mention, it's 10B unique sentences.

About getting the data, it would be OK if you would download it with rate limit (2MB/s is ok). May take couple of days, but should not be that bad.

Arseny