[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Conjugations and PoS tags for だ, くれる



I would like to re-raise an issue that was previously discussed 
here around 2010-10-17, Subject: "A 'cop' PoS tag?"
  https://groups.yahoo.com/neo/groups/edict-jmdict/conversations/topics/4315
I have been discussing the issue with Jim Breen in email and he 
suggested raising the issue here...

tl;dr...  I would like to request that だ and くれる get unique
PoS tags that convey the fact they conjugate differently than
other words that share their current PoS's.

Back in 2010 I asked about a 'cop' PoS tag for だ because of 
its usefulness in conjugating words from JMdict.  It has become 
of more than theoretical interest to me lately because, as you 
may have seen, the JMdictDB submission system entry pages now 
have a "Conjugations" link.  (This is my own implementation; it 
was developed independently of Jim's conjugator code although I 
made considerable use of the conjugations WWWjdic provides in 
building the data tables for my version).

It seems to work well [*1] and provides conjugations for a 
number of PoS classes that WWWjdic doesn't.  However there 
are problems with the following classes of words:

 良い・いい -- Jim has agreed to a special PoS for these
    and it and its conjugations are already in my local
    code base so this problem will go away soon. 

 くれる -- I would (as in 2010) like the 'v1i-k' PoS tag that 
    this word formerly had restored so that its irregular
    imperative form can be automatically generated.

 だ -- Right now this has an 'aux' tag which is not conjugatable
    at all.  In much the same way as WWWjdic conjugates 'vs' words 
    by conjugating an affixed する, I "conjugate" 'n' and 'adj-na'
    words by conjugating an affixed だ.  I'd prefer to instead
    just conjugate the word だ directly.  If だ had a PoS that
    identified it as being in a unique conjugation class, doing
    that would be much easier.

Below are some points from the previous discussion and my recent 
discussion with Jim in a Q&A format...

Why don't I just special-case だ and くれる?

Because the conjugator is table-driven: all the information 
needed to conjugate a word is obtained (primarily) from a 
table that is indexed by PoS and conjugation type.  The assumption 
is that the PoS in effect defines the rules for conjugating 
words of that class.  This table-driven approach allowed 
me to implement the conjugations feature completely in SQL 
which has advantages for the JMdictDB project.

There also seems to be a dearth of open-source code for doing
Japanese conjugations.  The conjugations tables in JMdictDB are
open-source and need not be implemented in a database; they 
can easily be read by or embedded in code in any programming 
language to do conjugations.  The actual code needed to generate 
a conjugated form from the info extracted from the tables is 
trivial [*2].  They are thus of wider benefit than to just the 
JMdictDB project.  However, if every code that wanted to use 
them had to also write code to special case some words, their 
value would be substantially reduced.

Aren't there a lot of words with unique conjugation rules
which will lead to a lot of unique PoS tags?

I don't know.  In the 2010 discussion that was a point raised
but most of the words mentioned I think were archaic.  AFAIK
the only common modern words in JMdict that violate the PoS-
defines-conjugation-rules assumption are the three mentioned 
above.

It would be nice to provide conjugations for archaic words as
well and there may be ways to do so (maybe by structuring the
PoS tags into a two-level hierarchy?) but I think being able to
uniformly handle just modern words has enough value to justify
being addressed independently.

Shouldn't 'cop' (copula) also apply to other entries like である?

Perhaps.  The word "copula" has meaning that extends beyond
the syntactical.  Perhaps some abbreviation other than 'cop' 
would be better for what I am requesting.  'da'?  'cop-da'?  
'da-predicate'?  I care less about the actual abbreviation used 
than that there is one that says that this word だ conjugates 
in this particular way.

Comments?

----
[*1]
Corrections will of course be gratefully received. 

[*2]
I wrote a simple command line demo program that conjugates 
words and runs completely independently of the JMdictDB database. 
The conjugation is done in the 4-line function construct() which
uses only info extracted from the conjugation tables (read 
from .csv files) to do its work.  See
  http://www.edrdg.org/~smg/cgi-bin/hgweb-jmdictdb.cgi/file/eb393788c541/python/conj.py#l237
Although the rest of the code is a little lengthy that is most
because of copious comments, argument parsing, output formatting
and rearranging the tables for more efficient lookups.