[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Abbreviations (Was: Combining entries)



On 07/16/2010 09:20 PM, Jim Breen wrote:
>[...]
> I can see that others 
[xref types]
> may be proposed, but I feel:
> 
> (a) there is a really strong case for a simple double-headed "abbr"
> link between entries, one of which is an abbreviation of each other;

By "double-headed" do you mean two xrefs, one in each entry pointing
to the other?

> (b) the need is such that it justifies the extension of the see/ant
> exceptions to include it. We want to encourage people to enter them.
> 
> (c) other xref types can probably be handled by the general
> [xref=type:value] construct, as they are more likely to be entered by
> skilled users.

I'm not sure I find the above all that convincing: people enter extra
characters now for the sake of consistent syntax and to maintain a 
table-driven design (as opposed to hardwiring knowledge of every tag 
into the parser).  I've been reading the edict list for a number of years 
and, unlike "synonym",  don't think I've ever seen mention of an "abbr"
xref until a few days ago.  I don't see any reason to believe that a 
few months from now there won't be another xref type that is so needed
that it too has to be hardcoded into the parser.

But before I complain any more, I will take a look at the parser to
se what the options really are.  I've been speaking from memory which
is a dangerous thing for me to do. :-)

>> I'll close by saying despite all the above I am not set on the syntax
>> I proposed, nor unmovingly opposed to an [abbr=...] tag -- I just wanted
>> to point out some of the factors that need to be considered.
> 
> Appreciated.
> 
> If we can "go forward" with this (to use our newish Prime Minister's
> catch-phrase), I'd be looking for the JMdict xml to have something like
> <abbr>ナニナニ</xref> (which in the long run may morph into
> <xref type="abbr">ナニナニ</xref> or <xref type="abbr" value="ナニナニ"/>)

I am discouraged to read this.  As you'll recall, I have have been 
advocating for several years that the XML for cross refs give the 
type as an attribute and include an explicit mention (also as an 
attribute) of the referenced entry's sequence number.  (See for 
example this April 2007 post:
  http://tech.groups.yahoo.com/group/edict-jmdict/message/1490)

Rather than adding a new element that will later need to be changed 
again to the more general 
  <xref type="abbr" seq="nnnnnnn">ナニナニ</xref>
(or similar) form, ISTM that it would make sense to make the change
to the latter form now.  Either all xrefs could be changed to this 
form now, or for backward compatibility, it could be used only for 
the new abbr xrefs with <see> and <ant> remaining but growing a 
"seq="nnnnnnn" attribute.  (I believe that the common convention in 
the xml/html world of ignoring unknown attributes would cause this 
change to introduce at most only a very small amount of backward 
incompatibility.)

The current method of cross-referencing solely by kanji/reading has 
number of negative consequences.  The resolution back to a sequence 
number requires a two-step process when loading JMdict XML into a 
JMdictDB database: 1) load everything but the xrefs, 2) run a second 
process to resolve the xrefs against the loaded entries to generate
the database xrefs.

The xref resolution code uses heuristics, is significantly complex, 
is slow, still has bugs that need attention, and will have to be modified 
to adapt to a <abbr> tag.  I would dearly love to eliminate this code 
so that the time spent maintaining it can be applied to the core JMdictDB 
code.  A second problem is the heuristics used to identify the correct
xref target entry may result in different xrefs in my local JMdictDB 
installation than exist in the EDRDG instance.  This has significant 
potential to cause confusion and waste time when I try to replicate 
reported EDRDG problems locally.

When the master repository of JMdict data was a file, not including
target seq numbers for cross-refs was reasonable -- the master data
did not have that information either.  But now, the EDRDG JMdictDB 
database *does* have that information -- the xrefs (most anyway)
have been (or will be) resolved to a specific entry regardless of 
the kanji/reading used when displaying it.  The current XML discards 
this information and supplies inferior, lossy information to the 
file user, who then has to jump though hoops to recreate it.