[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Priority tags revisited
A regular topic of discussion on this list, at least some years ago,
was the tagging in the JMdict database that leads to the allocation
of a "P" (priority/popular/common/etc.) tag in the EDICT editions. The
current tags came from a collection of sources, and some are a bit
shaky.
I was discussing related matter with Charles Kelly a couple of days ago
and it occurred to me that the n-gram data I use on the server at UofMelb,
both Kyoto/Melbourne and Google, could be matched with the entries and
used as an indicator of the commonness of terms.
While the n-gram files are huge (36Gb and 45Gb) it was very simple to
scan them pulling out the lines with high counts. For example, the KM n-gram
file has about 90k strings with counts above 15000. Many are useless
for matching against entries (を率いて, なかった方, etc.), but about 19k
match headwords of entries. What is interesting is that of these 19k, only
about 12k of those entries had "P" tags. Some of the other 7k were from
the match picking on less-common readings, etc. but there are plenty of
common terms there (まだまだ, 思われる, サーバー, etc.) which probably
deserve a "P" more than those that actually have them.
I don't have the time or energy at the moment to do much with this
information. While it would be possible to add some tags based on this
information (perhaps a new tag category?) some evaluation might be
needed. For example レンタルサーバー and カスタマーレビュー have quite
high n-grams counts, but this is probably due to that data being from WWW
pages. I doubt レンタルサーバー is really that common in everyday Japanese.
Food for thought.
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University