[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Ordering of kana parts




On Jul 23, 2010, at 5:19 PM, Jim Breen wrote:

I want to (re)visit the ordering of kana parts (Jim Rose
has asked for ruling in
http://edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2004760.1 )

As with the kanji parts, senses, etc. they should be in
descending frequency-of-use order. The question here is
whether the hiragana should be first (as it states the reading of the
kanji) or the (non-kanji) katakana should go first, because it is more common.

My view originally was that hiragana should go first, and that the
presence of a "uk" was enough to flag that the katakana was the
common term. Rene documented this at:
http://www.edrdg.org/wiki/index.php/Editorial_policy#Names_of_biological_species

Jim proposed putting the katakana first as these were really the common
forms. I went along with it, as I could ee the argument for signalling that
to users.

THEN I modified WWWJDIC to apply a heuristic to displaying entries
with "uk" tags:
- if the kana part contains katakana terms, display them first, and leave
the hiragana attached to the kanji;
- if the kana part only has hiragana, display it first.

Thus we get displays such as:
カイガラムシ 《貝殻虫; 介殻虫》 【かいがらむし】 (n) (uk) scale insect
and
がいせん 《外専》 (n) (uk) (vulg) person physically attracted to foreigners...

This relatively simple change seems to get around the issue
of making the the common kana form(s) more prominent. Sure
it's only in WWWJDIC, but other clients could easily do it.

My feeling is that given this change, it would be best to revert to having
the hiragana field(s) first. It seems structurally cleaner.

Can we get a consensus on this?


Thanks for addressing this.

My first concern would be how is the data interpreted outside of this group. Sure we know to output the katakana first when its marked [uk], but would the typical user of the file who is not a party to these discussions?  My understanding is that they will read something in the documentation of the structure of the dictionary and determine that the readings are ordered by frequency.  But alas, we now say they are ordered by frequency UNLESS a kanji compound is present.  Therefore we violate our own ordering rule over the arbitrary discovery of a kanji compound, give arbitrary and misleading primacy to readings of the kanji, but we don't say WHY... which is tan amount to saying that we give arbitrary primacy to words written in kanji regardless of their actual use - is it because we love kanji? Do we feel antipathy toward katakana?  I don't get it.  So the work around to do it hiragana first in the event of kanji is to make the documentation explicitly note that we are creating an exception to our own principle.  Justifying that exception is where I think this gets nebulous.

That is because I don't understand why this is "structurally cleaner".  I consider it structurally convoluted, though historically entrenched -> long run view vs short run.  As far as I can tell we would be creating this exception to ordering simply because that's how it was done in the past, by arbitrary rules that had no meaningful basis. Is that a valid reason to continue doing so?  Perhaps if we could get an estimate of how many taxonomic entries exist in this format now, and compare that to the ultimate number that could eventually exist in JMDICT we might see that we are incorporating a structural flaw in organization due to having a short run view of the data set.

My second concern is what if any affect this has on the EDICT legacy file, since that is, and probably will remain for some time, the principal file being incorporated into the world's various software platforms... those already here and those being born in the near future.  Is there an effect at all?  Maybe not.  But we should ponder that.

Either way, we need to set the rule now and hope that we pick the right direction.  Whatever is decided I'm behind 100% after its decided.  Until then I'm for Katakana first within the scope of taxonomic entries.

Jim R