[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Errors in cross-references



On Sun, 4 Aug 2019 at 02:21, 'Adam Nohejl' adam@nohejl.name
[edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:
> while I was trying to resolve cross-references in JMdict, I found three
> references referring to JMnedict entries:
>
> 1. <xref>フィレオフィッシュ</xref> in seq 2137630
> 2. <xref>NHK</xref> in seq 2176930
> 3. <xref>タミフル</xref> in seq 2648970

Yes, they all got muddled because the target entries were moved to the
name dictionary and the cross-references were left there. I've cleaned them
up, although I'm wondering if the NHK entry should continue in JMdict.. I'm
also suggesting 2137630 (フィレオ) move to the name dictionary as it's an
abbreviated product name.

> Additionally, there are two references that are repeated twice within
> the same entry and sense:
>
> 1. <xref>豚トロ・とんトロ・1</xref> is repeated twice in seq
> 2677780 (and the sense number is unnecessary)
> 2. <xref>無線呼出符号・1</xref> is repeated twice in seq 2827567
> (and the sense number is unnecessary)

Fixed those two. I suspect they were hiccups in the database software.

> I have also noticed that there are about 4000 cross-references that
> specify a reading although the target entry has only one.

That's interesting, and it may be related to a software issue at the time the
entry was created or edited. I won't go hunting for them because the
changes we are considering for recording cross-references may remove
them. See below.

> Last but not least, in the references to the following entries, the use
> of the centre can be misleading for the parsing software (and the DTD
> disallows it: "The target keb or reb must not contain a centre-dot."):
>
> 1. シルキー・シャーク
> 2. シンガポール・スリング
> 3. カーゴ・スリング
> 4. ベイビー・スリング
> 5. マヌカ・ハニー
> 6. タックス・ヘイブン

I've made all those entries simply point to entry via the sequence number.
The XML now just has: "<xref>カーゴスリング</xref>", etc.

> Maybe the best solution would be for the xref to contain only
> XML-structured information (seq, type and optional restriction to a
> particular sense/kanji/readings). As for the restriction to
> kanji/readings it could be done in much the same way senses can now be
> restricted using stagr/stagk along these lines:
>
> <!ELEMENT xref ((xtagk*, xtagr*)|xtags*)>
> -- or if only one kanji/reading/sense is enough: <!ELEMENT xref
> ((xtagk|xtagr|(xtagk, xtagr)|xtags)?)>
>
> <!ATTLIST xref seq CDATA>
> <!ATTLIST xref type CDATA #IMPLIED>
>         <!-- Type of cross-reference, implied value "see".>
> <!ELEMENT xtagk (#PCDATA)>
> <!ELEMENT xtagr (#PCDATA)>
>         <!-- These elements, if present, indicate that the cross-reference is
> restricted
>         to the lexeme represented by the keb and/or reb of the entry identified
> by xref's
>         seq attribute. -->
> <!ELEMENT xtags (#PCDATA)>
>         <!-- These elements, if present, indicate that the cross-reference is
> restricted
>         to particular senses (represented by their numbers) of the entry
> identified by
>         xref's seq attribute. -->

What I'm thinking of for the revised xref structure is
(a) to define it purely in terms of the target entry and sense. Any
restrictions on kanji and reading within the target entry will be a matter
for the entry itself.
(b) include the target entry surface form and if needed the sense number
in the xref entity as a text string. This would be primarily to help generate
derived formats such as EDICT. Apps could use this, or use the sequence
numbers as they wish.

An example of this is <xref type="see" seq="1073760" sno="1">スライド・1</xref>

I expect the surface form put into the <xref> entity would be the
first in either the
"kanji" or readings sections.

Thanks for the feedback and suggestions. Most welcome.

Jim



-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/