[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] "P" Markers revisited



[Francis Bond ([edict-jmdict] "P" Markers revisited) writes:]
>> there is an interesting article in the journal of lexicography on an
>> equivalent of P-markers:

Yes, very interesting. No surprises there, I guess.

>> I could think of three ways of trying to improve the current ratings,
>> but as I don't use them myself, am not so motivated.
>> 
>> (a) we could do a corpus count (or web corpus count) and mark the most frequent
>>   - it is very hard to get a fully representative corpus.  Amano and
>> Kondo, for example found that in over ten years of newspaper text the
>> word 唐揚げ never appeared, although it is a very familiar word.

Every corpus will have some biases. Many are overly newspaper-oriented.

Apropos of 唐揚げ, the other kanji form - 空揚げ - does crop up in the
newspaper-based "wordfreq" list, but is not high enough ranked to get a P.
Doesn't score so well on Googits either.

>> (b) we could compare the vocabulary to the familiarity ratings in
>> 日本語語彙特性, and mark various bands of familiarity, although the IP issues
>> could be murky.
>> @Book{Goitokusei,
>>   author =	 "Shigeaki Amano and Tadahisa Kondo",
>>   title =	 "Nihongo-no Goi-Tokusei (Lexical properties of
>>                   Japanese)",
>>   publisher =	 "Sanseido",
>>   year =	 1999
>> }

IP on that would be messy. My "ichi" markings have the same problem.

>> (c) we should keep the current markers and correct individual ones
>> that appear to be wrong, preferably backing up our intuitions with
>> some kind of evidence (such as GPB "Ghits per billion documents"
>> http://itre.cis.upenn.edu/~myl/languagelog/archives/000953.html).
>> There is currently a way to add P-markers (spec1 and spec2: I don't
>> know the difference), but no way of removing them...  I am not sure if
>> it is worth adding another flag just to show this.

If we come up with a GPB metric, It could go in as a marker, and join the 
mix that generate the simple P markers. I have been using >= 1MegaGoogits
as a trigger for assigning a "spec1". Obviously this should rise over time.
I seeGoogle has stopped stating the number of pages it indexes.

>> I think IP problems rule out (b), if anyone has a big corpus then it
>> would be nice to try (a), but for the moment we should stick with (c).

Yes, I think it's the most accessible, and I have seen reports that the
word coerage is not too bad. Better than newspapers.

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学