[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] pos/misc entities



Jean-Luc Leger wrote:
Hi Jim,

On Thu, Jun 12, 2008 at 12:29:45PM +1000, Jim Breen wrote:
2008/6/12 Stuart McGraw <smcg4191@frii.com>:

Explicitly stating the PoS
in each sense is simple, regular, and orthogonal.
Yes, but from the point of view of humans reading the resulting
entry, rather a clutter.

What I have toyed with is either:

(a) amending my source database to have the POS on each sense, and
then amending the EDICT/EDICT2 generators to drop repetitions, but
retain a changed POS.

(b) changing the XML generator to include a copy of the previous
sense's POS if none is explicitly stated.

The latter would be the easiest.


(a) would be a pain to implement
(b) would break your motto "JMDict must stay close to the master database"
    and I don't want these duplicates !!

I am curious why this would be a problem for you?

The problem I see with the current situation is that
JMdict currently has entries like:

  <entry>
  ...
  <sense>
  <pos>&n;</pos>
  ...
  </sense>
  <sense>
  <pos>&n;</pos>
  ...
  </sense>

where the pos tags are actually duplicated, and
other entries like:

  <entry>
  ...
  <sense>
  <pos>&n;</pos>
  ...
  </sense>
  <sense>
  ...
  </sense>

where they are not.

Both entries will, when read by an application,
create similar internal data: an entry with two
noun senses.

But if the program wants to write the entries
back out in XML form, how does the program know
that it should suppress the noun pos in the second
sense of the second entry, but not do that in the
first entry?  It must record, when it parses the
xml, the fact that the noun pos was implied in the
second entry.  But that fact has zero information
content with regard to the semantic information
contained in the xml -- it is pure overhead, needed
solely to reproduce arbitrary formatting.  I think
it is better to dispense with the arbitrary formatting.

If not duplicating the pos tags is desirable,
that is fine (although it still seems to me that
the replicated pos form is simpler and more
canonical)... but it should be done consistently
which means the first entry above should use
an implied pos in the second sense.

Again, that means either changing data in the
master file, or modifying the xml and edict
generators to suppress the pos tags in the cases
where they are now duplicated, yes?