[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] jmnedict non-kanji keb elements

To: edict-jmdict@***************
Subject: RE: [edict-jmdict] jmnedict non-kanji keb elements
From: Stuart McGraw <smcg4191@********>
Date: Sat, 26 Apr 2008 21:06:16 -0600

Jim Breen wrote:
> On 24/04/2008, Stuart McGraw <smcg4191@frii.com> wrote:
> >  However there are many keb elements that consist of
> >  characters that do not contain any kanji, for example at
> >  line 5501 in the 2008-04-24 JMnedict.gz file:
> >
> >  <entry>
> >  <k_ele>
> >  <keb>あふひ</keb>
> 
> That's a result of an ENAMDICT entry like:
> 
> あふひ [あうい] /Aui (g)/
> 
> My simplistic XML generator treats the あふひ as though it
> were a regular kanji version.
> 
> I was tempted to simply change the documentation  8-)
> What I have done is to move the あふひ into the <reb> for
> those cases and to add an "ok" tag to show it's (usually)
> the old kana form.

Thanks for the explanation.
I see the expansion of the "ok" entity in jmnedict is "old
or irregular kana form"  In jmdict, there is also an "ok"
tag that expands to "out-dated or obsolete kana usage".

In my apps, I often use the entity string as a short-form
identifier for the tag when displaying entries.  For any
app that works with both jmdict and jmnedict, this creates
a conflict.  Of course I can make up my own app-specific
short-form string but I hate to do that as there are many
people already familiar with the jm{ne}dict forms.  I wonder
if it would be possible to use something like "oik" for the
jmnedict tag?

Alternatively, perhaps the meaning of the "ok" tags are
fundamentally the same in both jmdict and jmnedict (although
the inclusion of the "or irregular" in the jmnedict tag makes
me think not).  If that's the case, maybe the jmnedict and
jmdict entity expansion strings should be made identical?

> Note that JMdedict is a quick-and-dirty conversion to XML
> of the ENAMDICT entries. If we ever get to a *real* database
> for the names, I'd like to rethink the name structures, as
> the JMdict-like one really doesn't work.
> 
> Consider 一栄 which has the following readings/transliterations:
> 
> （かずえ） Kazue (m,f)
> （いちえい） Ichiei (s,g)
> （かずえい） Kazuei (m)
> （かずよし） Kazuyoshi (m)
> （いちえ） Ichie (f)
> （いつえ） Itsue (f)
> （かずしげ） Kazushige (u)
> （かずひで） Kazuhide (m)
> 
> At present that becomes 8 different entries in JMnedict. Ideally
> it should be one entry, but the structure needs to align readings
> and transliterations.

Isn't that what <stagr> does in the jmdict structure?
(Assuming that assigning one transliteration per sense
is a reasonable thing to do.)

Follow-Ups:
- Re: [edict-jmdict] jmnedict non-kanji keb elements
  - From: "Jim Breen" <jimbreen@*********>

Prev by Date: RE: [edict-jmdict] Re: Language codes (was: a few more jmdict errors)
Next by Date: jmdict/jmnedict inconsistency
Previous by thread: Re: [edict-jmdict] Examples (Tanaka) file double sense numbers
Next by thread: Re: [edict-jmdict] jmnedict non-kanji keb elements
Index(es):
- Date
- Thread