[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



First, thanks for preparing this data!

> From my experience with large noisy corpora, you really want to throw
> away the low frequency stuff -- it is mostly garbage. Signal-to-noise
> ratio there is very bad.

I'd also like to see the lower counts.

For the application in hand, of taking a known word in a dictionary and
wanting to lookup its frequency, the noise does not matter so much, as
we end up never looking at it.

(Noise in the source means the tokenization might have gone wrong, which
introduces some error, but I don't see there is much you can do about that.)

> Ah, I forgot to mention, it's 10B unique sentences.

So duplicate sentences get thrown away?

How badly does that distort the frequency counts, I wonder?

Darren