Re: [edict-jmdict] Conjugations and PoS tags for だ, くれる

Subject: Re: [edict-jmdict] Conjugations and PoS tags for だ, くれる

From: Olivier Binda <olivier.binda@**********>

Date: Fri, 25 Jul 2014 08:33:39 +0200

It is not so much that it makes table-driven conjugation simpler,
it extracts and isolates a particular bit of information and thus
makes any machine processing simpler. That is one of the major
goals/benefits of organizing information in a structured form
like a database or xml.

I think your position (and 大辞林's) is perfectly reasonable when
the information is seen as being for presentation to a human
consumer, as-is. That is obviously the case with 大辞林, arguably
with Edict, but should it be the case with JMdict?

If an information source is also to be used for machine processing,
a human-friendly form that requires understanding a note becomes
sub-optimal. Another example is allowing glosses to have a 'lit'
tag rather than just throwing a "lit:" or "(lit)" or "literally"
in front of the gloss text. Another reason the former is preferable
to the latter is it is easier/simpler to generate the latter form
from the former than the reverse. In the くれる case it is easy
to present it as 'v1' verb (should you want to) if you are told
it is a 'v1-ik' verb. To go the other way requires code which
currently looks only at the PoS to also be given access to the
kanji, entry seq number or some other auxiliary information in
order to provide special handling to tiny subset (of 1) of 'v1'
entries.

I completely agree with these points. IMO, JMdict is really targeted at developers.

Consumers of Jmdict aren't humans, they are mostly computer algorithms.
Software developers take Jmdict and turn the machine readable bits into apps.
They mostly throw away (or display to english speaking users) the human readable bits.

99.99% of Jmdict consumers do it through these apps.

This is why, I personnaly would love every move that would make jmdict significantly more easily read by machines

<rant>
This is also why I am a bit concerned by the way i18n is handled/hispadic is merged for jmdict :

IMO, Dutch/German/hispadic/french should be merged once, with aligned senses whenever it is possible (using pos, I had some success desambiguating with this), and then human editors correction could happen to improve the data accuracy/coverage with time

And as I have understood it, hispadic is merged every day (without using pos)... which means that the spanish senses are put (lost) in the first sense english.
The issues I have with that are :

No sense alignement (-> my app breaks for spanish users : I need sense alignement for glossing texts)
The more time passes, the more jmdict wanders away from hispadic (the pos of jmdict changes while those of hispadic are frozen, aligning senses through pos becomes harder/fails more often)
No chance for human editors/checks to improve the spanish data
</rant>

Also, I have been computing glosses to jmdict for other languages through wordnets (many thanks to the open multilingual wordnets) with some (good enough for my Apps/users but never excellent) success rate
http://www.spartan-entertainment.com/android/languageSupport.html
and the human readable entry with "lit:" or "(lit)" or "literally" in front of the gloss text, or with human readable precision put between parenthesis at the end are making this much harder

I would much rather have a xml attribute or tag providing such info in a machine readable form, avoiding filtering/processing/painfull disambiguation.

Olivier

I don't think it is at all a stretch to call くれる irregular, any
more than calling 行く irregular is. Obviously definitions of
"irregular" vary but I think a definition that requires words
of the same "conjugation class" to follow the same conjugation
rules is more useful than a squishy one that says they mostly
follow the same rules except for a few exceptions, whether few=1
or 1000.

The commonly used imperative form of くれる is くれ. This is
different from the imperative form of any other v1 verb. IMO
this information should be captured in a form that is easily
understandable by an algorithm, particularly since a mechanism
to do this (a unique PoS per conjugation type) is already in use
and applies to every other word (with 'aux-v' as a catchall for
"irregular in a way we don't care about capturing")

> [We are bit spoiled by Japanese regularity. I remember when I was
> boning up on French prior to a sabbatical in France in 93/4 I wrote a
> verb conjugator to help me do drills. I gave up eventually, as it
> seemed every second verb that is regarded as regular has an odd twist
> somewhere.]
>
>> 2) いい could be handled simply by splitting it out from the entry
>> for よい and making よい [adj-i] while making いい an [exp] or [unc] with
>> an xref to よい and a note that いい doesn’t inflect. I don’t think an
>> additional PoS is needed and if one is added, it definitely
>> shouldn’t include よい; よい is just a regular old i-adjective. And いい
>> is just a modified version of よい that has no inflections of its
>> own, so I think it would be wrong to say that it has its own PoS
>> with its own inflection pattern that includes よくない, etc.; those
>> forms belong to the regular adjective よい.
>
> I found this suggestion, splitting the いい and よい into different
> entries, a bit radical to start with, but as I have thought it over,
> it's gained appeal. Part of the appeal is that we have a heap of
> entries with structures like XX[の|が][よい|いい], and they are rather
> messy with all the restrictions to line the kanji surface forms with
> the readings. I just added a lot more because I noticed that quite a
> few had crept in the noun tags, and as I corrected them I also added
> a lot of ...[よい|いい], forms. Splitting would certainly result in much
> cleaner entries, indeed I've never been happy with the rather messy い
> い/よい situation.
>
> I notice that GG5, apart from in the 良い entry itself, never writes いい
> as 良い. If we go ahead with a よい/いい split, and I have to say I'm
> tempted, I'm inclined not to use 良い in the kanji form for the いい.
> Thus we'd have entry pairs such as:
>
> 頭のいい [あたまのいい] /(exp,adj-f) (See 頭がいい) bright/intelligent/ 頭の良い;頭のよい
> [あたまのよい] /(exp,adj-i) (See 頭のいい) bright/intelligent/

頭のいい is conjugatable, is it not? (Google turns up a lot of
頭がよかった's). That is not the case with other 'adj-f' words
(eg Ａ級). Why throw away information by taking entries that
currently form a distinct class (that conjugates consistently)
by rolling them into a larger, vaguer class? (This seems like
taking oddly conjugating verbs like する and calling them 'aux-v'
to avoid needing to maintain a 'vs-i' PoS tag.)

And would 'adj-f' also be the PoS tag for いい? Or would there
still be a 'adj-y' (or whatever) tag for いい? If not then one
again needs くれる-like tricks to provide a table of conjugations
for いい -- a word of fundamental interest to Japanese learners.
This seems a step in the wrong direction.

However, if an 'adj-f' tag makes sense in describing the use of
いい, then what about an additional PoS tag to indicate that
this is an 'adj-f' word that follows a specific conjugation
pattern?

>[...]