[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
olivier.binda wrote:
...
> Now, I completely agree that Jmdict should only have one entry for
> these as it is the same word.
> Yet, when I'm producing my dictionnary from JMdict, I have to produce
> (at least) 2 entries :
> one for いい
> one for よい
> Because, the entries in my dictionnary are sorted by alphabetical order
> of the romanized kana lecture (to make it easy for westerners as it is
> our native sorting system and not かきくけこ...).
>
> So whenever there is more than 1 hiragana reading in a Jmdict entry, I
> have to split it into 2+ entries for my dictionnary
> and the actual structure of JMdict with kana and kanji not grouped
> doesn't make it easy to decide which kanji reading should go with
> which hiragana reading (it's the reason why I implemented my function
> in the first place, to help the computer make that decision).
The <re_restr> elements should contain that information.
In the Jmdict xml file, when a reading applies to only a
subset of the kanji, there is a <re_restr> element that that
gives valid kanji. For example here is 1156010:...
<k_ele>
<keb>囲繞</keb>
</k_ele>
<k_ele>
<keb>囲にょう</keb>
</k_ele>
<r_ele>
<reb>いじょう</reb>
<re_restr>囲繞</re_restr>
</r_ele>
<r_ele>
<reb>いにょう</reb>
</r_ele>
This says that いにょう is a reading of both
囲繞 and 囲にょう but that いじょう is restricted
to 囲繞..
One way to do what you want might look like:
いじょう【囲繞】
(also: いにょう【囲繞; 囲にょう】)
1. [n,vs] surrounding; enclosure
...
いにょう【囲繞; 囲にょう】 see→いじょう【囲繞】
> Again, the structure of Jmdict with kanji reading and kana reading
> don't help me find my dictionnary entries should be
> kana1+K1 and kana2 and not kana1+K1 and kana2+K1
There is also a <re_nokanj> tag that indicates when a
reading has no associated kanji.
You should also be aware of the <stagr> and <stagk>
tags that indicate when a sense in only relevent to a
specific reading or kanji respectively.
The comments in the jmdict xml file's dtd is very useful
in figuring out the meaning of the various tags.