[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Mining the Tanaka corpus for more priority entries



On 2015-06-12 01:55, Jim Breen jimbreen@gmail.com [edict-jmdict] wrote:
Are you the "Luce" who has been making some very
useful edits to the dictionary recently.

Yes, that would be me. Although I consider my edits mostly just housekeeping.

The Tanaka-based analysis looks very interesting and
promising. It's picked up many useful entries such as
すぐに which should be given priority tags. Your suggestion
of 10 or more occurrences warranting a priority tag looks
pretty good - it's just as valid as some of the other methods
used. We can also verify them against the Google/KM
n-grams; in fact I've been meaning to do a bulk match
of JMdict entries with those sets of counts to see if any
obvious ones are missed.

The list likely has a few entries that should be removed.
ジル ('Jill') is the only one that has caught my eye.

But the good thing about the corpus is that the words are (mostly) supplied with the readings as well. You can't, for example, catch 名(めい) via the n-grams method.

I'll look into this more over coming days.

Thanks.

Thanks for the useful contribution. (I recommend others
look at the list linked from the posting.)

I recommend downloading the list somewhere safe; the pastebin site's expiration policies felt quite unclear to me.