[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Why two entries for 先生 in JMdict.xml?
On 28 February 2013 22:08, Jean-Christian Imbeault
<jc.imbeault@gmail.com> wrote:
> Was testing out my JMdict-based dictionary today and was surprised when looking up 先生 that it
> came up with an entry for シーセン. I checked the JMdict and there are two entries for 先生.
>
> Just curious but why aren't the two entries merged into one with multiple senses and
> readings? Perhaps if I understand the reasoning I can build my dictionary in a better way.
Entries are merged if they can be regarded as the same word. The rule
that's applied
with JMdict when deciding whether a merge should happen is one I
documented nearly
9 years ago as the "2-out-of-3 rule". It works like this:
- we treat each entry as a triple consisting of <kanji-form, reading, meaning>
- if 2 of the 3 members of the triple are the same, merge them. If only one is
the same , don't. (If 3 are the same, they are one entry already.)
So 川柳/せんりゅう/comic haiku is not merged with 川柳/かわやなぎ/riverside willow
because it's only 1/3.
合気道/あいきどう/aikido IS merged with 合氣道/あいきどう/aikido because it's 2/3.
The two 先生 entries are only 1/3 (the kanji part). The readings and meanings are
quite different.
This rule appears to work quite well. It's really a guide, and things
are handled case-by-case,
especially when there are multiple readings and/or kanji forms. We
usually don't merge on
the basis of old, uncommon or incorrect kanji forms, for example. For
words written in kana
alone, the kana needs to be related, e.g. ダイヤモンド and ダイアモンド.
JMdict has more merged entries than many other dictionaries of a
comparable size.
HTH
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University