[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
This seems to be turning into a database-coding
technics/ethics conversation, so maybe you two should
start a new subject line, as for the Google freq count
thing, let me just bounce some ideas off you guys (and
gals if there are any sneaking around out back) and ask
some questions.
Where to go next:
1. Frequency counts for all the hiragana readings as well.
There is probably no way to check for false hits other
than by hand, which defeats the purpose, is this a big
problem?
2. Alternative readings. There are 16000 or so non-joyo
characters in the edict file. This should be replaced
with hiragana (or katakana) to see what form is the most
widely used. Although that's not really too many entries
I would like to find a way to automate it. Ideas?
3. Create a metric for the verb freq counts. If it was
just me, because all of this isn't so exact anyways, I
would do a test run with a small sample of verbs and
search through all the permutations that generate a good
quantity of hits, and then average the resulting ratio and
apply it to all the remaining verbs' hit counts. But I'm
not very "scientific" minded, would this be worth doing
for every verb?
If there are other things I should be considering, please
tell me, thanks.
-Kale