[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Abbreviations (Was: Combining entries)
On 07/16/2010 09:20 PM, Jim Breen wrote:
>[...]
> I can see that others
[xref types]
> may be proposed, but I feel:
>
> (a) there is a really strong case for a simple double-headed "abbr"
> link between entries, one of which is an abbreviation of each other;
By "double-headed" do you mean two xrefs, one in each entry pointing
to the other?
> (b) the need is such that it justifies the extension of the see/ant
> exceptions to include it. We want to encourage people to enter them.
>
> (c) other xref types can probably be handled by the general
> [xref=type:value] construct, as they are more likely to be entered by
> skilled users.
I'm not sure I find the above all that convincing: people enter extra
characters now for the sake of consistent syntax and to maintain a
table-driven design (as opposed to hardwiring knowledge of every tag
into the parser). I've been reading the edict list for a number of years
and, unlike "synonym", don't think I've ever seen mention of an "abbr"
xref until a few days ago. I don't see any reason to believe that a
few months from now there won't be another xref type that is so needed
that it too has to be hardcoded into the parser.
But before I complain any more, I will take a look at the parser to
se what the options really are. I've been speaking from memory which
is a dangerous thing for me to do. :-)
>> I'll close by saying despite all the above I am not set on the syntax
>> I proposed, nor unmovingly opposed to an [abbr=...] tag -- I just wanted
>> to point out some of the factors that need to be considered.
>
> Appreciated.
>
> If we can "go forward" with this (to use our newish Prime Minister's
> catch-phrase), I'd be looking for the JMdict xml to have something like
> <abbr>ナニナニ</xref> (which in the long run may morph into
> <xref type="abbr">ナニナニ</xref> or <xref type="abbr" value="ナニナニ"/>)
I am discouraged to read this. As you'll recall, I have have been
advocating for several years that the XML for cross refs give the
type as an attribute and include an explicit mention (also as an
attribute) of the referenced entry's sequence number. (See for
example this April 2007 post:
http://tech.groups.yahoo.com/group/edict-jmdict/message/1490)
Rather than adding a new element that will later need to be changed
again to the more general
<xref type="abbr" seq="nnnnnnn">ナニナニ</xref>
(or similar) form, ISTM that it would make sense to make the change
to the latter form now. Either all xrefs could be changed to this
form now, or for backward compatibility, it could be used only for
the new abbr xrefs with <see> and <ant> remaining but growing a
"seq="nnnnnnn" attribute. (I believe that the common convention in
the xml/html world of ignoring unknown attributes would cause this
change to introduce at most only a very small amount of backward
incompatibility.)
The current method of cross-referencing solely by kanji/reading has
number of negative consequences. The resolution back to a sequence
number requires a two-step process when loading JMdict XML into a
JMdictDB database: 1) load everything but the xrefs, 2) run a second
process to resolve the xrefs against the loaded entries to generate
the database xrefs.
The xref resolution code uses heuristics, is significantly complex,
is slow, still has bugs that need attention, and will have to be modified
to adapt to a <abbr> tag. I would dearly love to eliminate this code
so that the time spent maintaining it can be applied to the core JMdictDB
code. A second problem is the heuristics used to identify the correct
xref target entry may result in different xrefs in my local JMdictDB
installation than exist in the EDRDG instance. This has significant
potential to cause confusion and waste time when I try to replicate
reported EDRDG problems locally.
When the master repository of JMdict data was a file, not including
target seq numbers for cross-refs was reasonable -- the master data
did not have that information either. But now, the EDRDG JMdictDB
database *does* have that information -- the xrefs (most anyway)
have been (or will be) resolved to a specific entry regardless of
the kanji/reading used when displaying it. The current XML discards
this information and supplies inferior, lossy information to the
file user, who then has to jump though hoops to recreate it.