[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] "P" Markers - Google as corpus?
[Stutzman Kale (Re: [edict-jmdict] "P" Markers - Google as corpus?) writes:]
>> --- Jim Breen <Jim.Breen@infotech.monash.edu.au> wrote:
>> > We need some agreed mechanism for using/weighting
>> > metrics to come up
>> > with "common" words. The one I have used for the
>> > "P"s is quite crude.
>> > Take good old ほうれんそう, I word I learned very
>> > early in Japanese.
>> > At present it has an entry with 菠薐草, 法蓮草 and
>> > ほうれん草, with
>> > the 菠薐草 version tagged as "ichi1", thus getting
>> > it a P. The major
>> > dictionaries ONLY have the 菠薐草 version. ....
[...]
>> To solve this we could go through the file and replace any
>> 常用以外漢字 with hiragana, search and compare -- although
>> someone would have to teach me a more powerful method of
>> automation, other than Search/Replace. But it seems
>> pretty reasonable that "
>> 常用漢字 + kana" variants would be used more often.
Well, it would be a nice little project for someone to:
- extract the entries with kanji parts containing 常用以外漢字
- sub those kanji with kana (not trivial, what with 連濁, つ->っ, etc.)
- check some corpus, e.g. the WWW, to make sure those modified versions
are actually used;
- where there is evidence, add them in as alternative headwords (provided
they still have >= 1 kanji.)
>> > So should ほうれん草 be promoted to pride-of-place
>> > at the front of
>> > the ほうれんそう headwords? Should the official
>> > 菠薐草 be stripped of
>> > its "P"? What should be the overall policy?
>>
>> That's a tough one. I think it would be interesting if
>> the relative frequency information could be used to make
>> the most commonly seen variant first, because 1. It would
>> be less confusing for beginners and more informative for
>> everyone else 2. It would be (yet another) service wwwjdic
>> offers that other dictionaries don't.
I think that a useful policy. My only concerns are finding a really
appropriate relative frequency metric. Newspapers tend to the formal side
and the WWW is possibly biassed in the colloqial direction.
Certainly JMdict/EDICT tends to be far stronger on including mixed
kanji/kana variants that commercial dictionaries.
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学