[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] pos/misc entities



By the way, could the master file and generating scripts be packed
up and uploaded, even if they're officially "unsupported" and
undocumented?

(FWIW: I don't have a strong opinion on either POS culling or
the culling inconsistency.  I think the entity problem is orders
of magnitude more important than either, and while I wouldn't
mind seeing the inconsistent culling fixed, I think it's by far
the least important part of the discussion.)

On Thu, Jun 12, 2008 at 03:17:57PM -0600, Stuart McGraw wrote:
> But if the app does anything with it's internal
> representation of the entries, it would be a strange
> design that did not tag each sense with the POS that
> the sense *does* have.  Otherwise every operation that
> works with sense POS's (e.g. stripping out senses that
> are nouns, finding all the verb senses, etc), all have
> to implement the implied POS rule and worse, the code

Sure, since POS's are culled like this, you need to reverse the
transformation in code.  I can see cases where that would be
annoying; eg. "//entry/sense[pos/n]"[1] to select all noun
senses doesn't work; you'd need to make that query and then
iterate forward through every sense with no pos.  That's
algorithmically easy and fast, since you're just doing a forward
scan, and you'd presumably stick it in a function, eg.

 for x in entry.xpath(".//sense[pos/n]":
   for y in pos_scan(x):
    # receives y = x and every following sibling node that has no pos

but I can see where it could be annoying.

> No, there are other programs that want to generate
> XML and want it to be compatible with JMdict.  I my
> case I want to do so, so that I can textually compare
> XML generated from my database with the original XML
> that populated the database.  This provides a high-
> level validation of my XML parsing, data storage,
> data retrieval, and XML generation functions.

Not to say that your case isn't useful, but you're not describing
compatibility with JMdict; you're describing "outputting the exact
same file".  An XML file is compatible with JMdict if it follows the
DTD, and the ad hoc rules (eg. only POS entities in <pos>); not by
mimicing the particular whitespace, pos culling, attribute order,
element order (other than eg. gloss priority), culling rules, tag
capitalization, and whatever else.

I'd suggest that instead of outputting a text file and running diff,
instead write an XML file, and then use a standalone SAX parser
to read both files and compare them at a node level (expanding
POS nodes before comparing).

> Again I disagree.  I don't, and think no one should,
> "expect" inconsistency of any other problem in software,
> free or otherwise.  When such is encountered in commercial

Well, everyone should expect problems with all software, of any
cost, since they happen.  :)  But that's no reason not to fix them.
But to be fair, while this is a bit inconsistent, it's not a bug;
it follows the rules of the file and should cause no problems under
"in-spec" use.  So, don't forget that there's another aspect here:
fixing stuff (and keeping it fixed) takes time and energy, even
if it's easy, and this may just not be worth bothering with.

> > A real solution would be to add a new element around senses, because that's
> > what we mostly do with those POSes spreading from sense to sense : we group senses.
> 
> I recall that sub-senses have been mentioned here in
> the past.  It is an interesting idea but it is "changing
> the model".  I don't think it is necessary come up with
> a new model to get an answer to the POS representation
> issue I am concerned about.

Do you mean eg. <senses><pos>&n;</pos><sense>...</sense></senses>?

This assumes that the ordering of senses is unimportant, and that it's
okay to sort them by POS in order to group them.  I don't know if senses
are in any particular order (a quick scan of the docs only mentions that
glosses are by importance, not senses), but grouping senses by part of
speech in order to factor out the POS tac is arbitrary (more than one
thing might want such grouping, and you can only do it once), and being
able to sort senses by importance is useful, even if the current data
doesn't do that.

-- 
Glenn Maynard