[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

JMdict priority tags



Let me state at the outset: They are a mess.

The current tags are the accumulation over the last 15 years or so
of attempts to identify entries which are more common than others, and
which could identify a subset of approx. 20 k entries of "common" terms
approximating to the size of a small-medium Japanese dictionary.

The current tags have come from wordlists, newspaper-mining, etc.
but there are many anomalies. It's really time to clean them up. We have
discussed possible better methods for identifying common terms in the
past without coming up with any good solutions. There's a few tasks
hanging, such as cleaning up the ~1500 "spec2" entries, and Luce's
list of terms common the the Tatoeba sentences.

Last year I processed JMdict against the Google n-gram collection
and got the frequencies of all the surface forms and readings. I didn't
complete the analysis, and I have just been having a quick look at it.
The n-grams give a really good indication of what Japanese terms
are common or not. There's quite a bit of massaging needed - you need
to taking into account things like multiple surface forms, usually-kana
terms,  verb inflections, etc. (see the counts for 分かる below), etc. but on
the whole I think using them as the basis for tagging would not be a bad
idea. In fact it may be a good idea to throw away some or all of the
present tags and start again.

Anyway, my quick look at the n-gram frequencies of JMdict terms shows
that if we picked a count of 250,000 as a threshold, there'd be about 26k
entries tagged as common. Interestingly, the 26k would only include about
16k of terms currently tagged, i.e. about 10k terms not currently tagged
would get them, and 7k currently tagged terms would loose them. That's
quite a change.

Anyway, what i want to do is open this up for discussion, given that we
have access to what seems like a good set of data for frequencies, and
the tools to change the database without too much drama.

Looking forward to comments.

Jim

Appendix: Here are some counts for inflections of 分かる, just to
show that you can't rely solely on the plain forms of verbs and adjectives.

分かる11513812
分かります3114743
分かった3502335
分かりました1318792
分かって5024600
分かりまして9744
分からなくて405419
分からないで54562
分かりませんで2470

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University