[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] pos/misc entities



On Wed, Jun 11, 2008 at 11:02:50PM -0600, Stuart McGraw wrote:
> >> (a) amending my source database to have the POS on each sense, and
> >> then amending the EDICT/EDICT2 generators to drop repetitions, but
> >> retain a changed POS.
> >>
> >> (b) changing the XML generator to include a copy of the previous
> >> sense's POS if none is explicitly stated.
> >>
> >> The latter would be the easiest.
> >>
> > 
> > (a) would be a pain to implement
> > (b) would break your motto "JMDict must stay close to the master database"
> >     and I don't want these duplicates !!
> 
> I am curious why this would be a problem for you?
> 
> The problem I see with the current situation is that
> JMdict currently has entries like:
> 
>    <entry>
>    ...
>    <sense>
>    <pos>&n;</pos>
>    ...
>    </sense>
>    <sense>
>    <pos>&n;</pos>
>    ...
>    </sense>
> 

only a few are like that

> where the pos tags are actually duplicated, and
> other entries like:
> 
>    <entry>
>    ...
>    <sense>
>    <pos>&n;</pos>
>    ...
>    </sense>
>    <sense>
>    ...
>    </sense>
> 
> where they are not.

this is good and used almost everywhere in Jmdict.

> 
> Both entries will, when read by an application,
> create similar internal data: an entry with two
> noun senses.

I hope they don't, but this is up to the application and what it want to
do with the data.

> But if the program wants to write the entries
> back out in XML form, how does the program know
> that it should suppress the noun pos in the second
> sense of the second entry, but not do that in the
> first entry?  It must record, when it parses the
> xml, the fact that the noun pos was implied in the
> second entry.  But that fact has zero information
> content with regard to the semantic information
> contained in the xml -- it is pure overhead, needed
> solely to reproduce arbitrary formatting.  I think
> it is better to dispense with the arbitrary formatting.

I am not sure I understand the base of your problem.
The only program generating XML form and where consistency with data is
important is the JMDict Generator from the Master File. This program already
has the distinction. Any other program can generate an inconsistent Jmdict, 
that's not a problem.

> 
> If not duplicating the pos tags is desirable,
> that is fine (although it still seems to me that
> the replicated pos form is simpler and more
> canonical)... but it should be done consistently
> which means the first entry above should use
> an implied pos in the second sense.

Yes it should. Like so many things in Edict/JMDict. But it is handmade
and free, so inconsistency are expected. Moreover, it is about linguistic
data so it should be more expected than anywhere else ^^;
I am not even talking about the conflicts due to two differents goals 
(a japanese dictionary _and_ a japanese to english dictionary)

A real solution would be to add a new element around senses, because that's
what we mostly do with those POSes spreading from sense to sense : we group senses.
But the expected answer to this is : wait till the Master File has been moved
into a Database.

> Again, that means either changing data in the
> master file, or modifying the xml and edict
> generators to suppress the pos tags in the cases
> where they are now duplicated, yes?

Yes and I would rather have data changed in the master file than modifying
generators (the edict generator is now almost correct though some parts
concerning POS and restrictions are so tricky they could easily become buggy
again ..)

	JL