[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Aligning kana between headwords and readings



> For EDICT, and continuing into JMdict, I have tried to
> maintain a fairly strict 1-to-1 mapping of kana between the
> kanji part of an entry and the kana/reading part. In
> particular, I have always made the katakana portion match.
> Thus for ローマ字, the reading is ローマじ; not ろーまじ.

I don't see any merit in having the reading match script, and
I would advocate having them all be normalized to hiragana.
They dont sound any different, so its just useless extra
information that could just as well be inferred from the
headword.

Taking it a step further, is there any point in having the
reading use the same vowel extensions?

Why support both ろーまじ、ろうまじ、and ろおまじ as three
separate entries, when they all sound exactly the same.
Personally, I would normalize the reading field and search
strings so that when searching for any word by how it sounds,
all homophones would match regardless of their particular
orthography.

I wonder whether you are not confusing different points.
Search strings in WWWJDIC (but not strings in the "translate
words" function) already pay no heed to hiragana vs katakana
differences.  Extending that so that ー is viewed as identical
to the appropriate long vowel in searches (while not something
I am in favour of) would not require or produce changes to
the WWWJDIC display or the (old) EDICT and EDICT2 dictionary
files.

Changing the actual 'reading' for all entries to normalise
on full hiragana is problematic on a number of ways.

There are semi-anomalous entries (or 'specially
distinguished' entries, if you prefer) where some or all of the
headword is in the roman alphabet (fullwidth characters).

ALS 【エーエルエス】 (n) (abbr) amyotrophic lateral sclerosis (ALS)

Changing to hiragana would give the (false) impression that
えええるえす might be used to search for this word, besides
which えええるえす is just ugly. エーエルエス, on the other
hand, can be used to search for relevant sites (although you
do a lot better with the roman letters which is why they are
given as the headword).