[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] pos/misc entities



Jean-Luc Leger wrote:
On Thu, Jun 12, 2008 at 03:17:57PM -0600, Stuart McGraw wrote:
> Jean-Luc Leger wrote:
[...]
> * Any change would be a bandaid over a problem that
> should be solved more completely by using something
> like sub-senses (or super-senses).

Not exactly. I think using the explicit form would make harder a change to
super-senses. That's my main problem with that solution.

Harder to change programs that generate/use JMdict?
I would think the code changes needed to incorporate super-
senses would be so extensive that it would not matter much
whether the original code processed explicit or implicit POS's.

[...]
> * All the information related to a sense is contained in
> that sense.

Sure, though it is an application concern.

But I think it is also a data organization / representation
issue although the ease of converting between the distributed
and factored form does seem to make it more of an aesthetic
issue than a practical one.

> * Doing so is consistent with the way other tags such as
> misc, dial, etc, are managed.  No other sense tags are
> propagated across senses.  (I acknowledge that no other
> tags are required in each sense as POS tags are.)

Well all the others (except for lsource) are effectively specific of a sense.
POS is different. A word can have several POSes, each of them can have several
senses.
So, no it is not automatic to manage POS like the other tags.
It may be our choice to do so, though.

OK but I was basing my suggestion on the current situation
in which POS *is* a per-sense tag.  That it may be better
organized as a per-super-sense tag, I will accept on faith
from you and the other 99.9% of people here who know more
about such things than I. :-)

But wouldn't the grouping of senses be done by similarity
of meaning, and the grouping by POS a by-product of that,
resulting from the fact that different POS's will generally
have (possibly slightly) different meanings?  That seems to
be the case in the English dictionaries I've looked at, but
perhaps that is not true in Japanese, say with a word that
can be used as a "n" and an "adj-na".

[...]
> * JMdict's primary purpose is to be machine parseable and
> explicit POS tags simplify that.

implicit POS is not a problem either

Agreed, not a problem, but explicit is simpler is the sense
that is does not require parsers or generators to remember
POS value sets between senses.

[...]
> * It decreases the likelihood of errors.  It is too easy to
> add a new sense without a POS and have it unintentionally
> inherit an unintended POS.  Same thing likely to happen
> when reordering senses.  (Yes, the data is in master file,
> but same concern applies there, and some people use
> JMdict format for their own data.)

This should be checked by an input validation system.

How do you do such input validation?  A (non-AI/NLP) program
has no way to know what POS a sense should have so it can't
decide if a sense without an explicit POS is correctly defaulting
the previous sense's POS, or if an explicit POS should have
been entered but was forgotten.

I suppose you could have an entry input program that would
ask you to confirm the defaulted POS when one without an explicit
POS was entered, but I don't know how to get Windows Notepad
to do that. :-)

ISTM that requiring an explicit POS with each sense and
generating an error message when it is missing is a lot more
reliable.

> * It is very easy to programmatically factor out common
> POSs from sequential senses when using parsed XML data.

I think it is easier to spread POS to the following senses.
XML tools (XSLT for example) make that easy. But they can't easily
factor out
data (I think. I will try to do it, just to be sure of the difficulty)

I don't know XSLT (although it is on my list to learn :-)
so I accept this.

> * Most programs displaying JMdict data will likely do
> significant munging of the data so additional burden of
> factoring out common POS tags will be minimal.  (Those who
> don't want to do such processing are likely using EDICT).

Maybe. Maybe not ^^;
They will have to manage, anyway ^^

I always complain when I see other people claim some
feature is not needed because "nobody does that".  I think
that generally, most people don't have a good idea how
software (even their own) is used.  At best such statements
are usually just "gut feel", at worst, rationalizations
for doing (or not doing) what they want to.  So I probably
shouldn't have said that.
But it does sound plausible, even if there is no evidence
to support it. :-)

> * Making the change only to JMdict file by changing the
> master->XML generator is easy and safe to do.

What do you mean "only to JMDict" ?
You should make the change into the master file too
so no changes to the master->XML generator

I was referring to Jim's statement (b) in:
  + What I have toyed with is either:
  +
  + (a) amending my source database to have the POS on each sense, and
  + then amending the EDICT/EDICT2 generators to drop repetitions, but
  + retain a changed POS.
  +
  + (b) changing the XML generator to include a copy of the previous
  + sense's POS if none is explicitly stated.

If what you want, is really only changing JMDict, then
I can make a program to create JMDict(explicit POS) from JMDict(implicit POS).

Yes, I agree that the transformation from either form
to the other is nearly trivial (except possibly the
explicit-to-implicit transform in the case of XSLT).

> * I suspect (though Jim would need to confirm) that changing
> the master data and EDICT generator is not as hard or
> risky as feared.

He answered.

Yes.  And I also wasn't aware of the use of your JMdict->EDICT
tools which seems to me to be a pretty solid reason not to
go with the explicit POS I am advocating for.

> * If the master file data were changed, there would be minimal
> and beneficial change to EDICT. Currently the 225 senses with
> repeated POS are also repeated in EDICT/wwwjdict.  Those spurious
> repetitions would be removed.

Definitely, we need to correct those inconsistent entries.
Do you have the list ?

Yes, sent in a separate response.

Since for me, the inconsistent representation is the bigger
problem, I would be quite satisfied if those entries could be
changed to use the POS default rule.