[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?



OK now I understand (I think) what Stutzman Kale was saying.  Not 16,000 characters... but 16,000 EDICT entries containing non-joyo kanji.  Correct me if I'm wrong.

Wouldn't you also want to test in similar manner variants of the joyo-only containing words?  Surely many people write even Joyo constructed compounds in a variety of forms depending on their educations level, personal preference, or when attempting to seem younger than they are.  I imagine for many words the number of kanji which are written instead in kana would be a rough indication of grade level achieved.  In other cases, maybe there are even regional preferences.

Sounds to me though like all of these ideas would begin to exploded the number of google pings needed to fully assess a new entry.  How does that impact the idea?



On Jan 23, 2007, at 6:43 AM, Jim Breen wrote:

>> 2. Alternative readings. There are 16000 or so non-joyo
>> characters in the edict file. This should be replaced
>> with hiragana (or katakana) to see what form is the most
>> widely used. Althoug that's not really too many entries
>> I would like to find a way to automate it. Ideas?

It would be a very nice little project to go through the entries
with non-Jouyou kanji, add the kana substitutes, then establish
via a search engine whether those constructs were actually used, and
if so how often. The substitution process is non-trivial, as Jim
Rose describes, but it might be possible to come up with several
alternatives and test each of them.