[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Conjugations and PoS tags for だ, くれる
I would like to re-raise an issue that was previously discussed
here around 2010-10-17, Subject: "A 'cop' PoS tag?"
https://groups.yahoo.com/neo/groups/edict-jmdict/conversations/topics/4315
I have been discussing the issue with Jim Breen in email and he
suggested raising the issue here...
tl;dr... I would like to request that だ and くれる get unique
PoS tags that convey the fact they conjugate differently than
other words that share their current PoS's.
Back in 2010 I asked about a 'cop' PoS tag for だ because of
its usefulness in conjugating words from JMdict. It has become
of more than theoretical interest to me lately because, as you
may have seen, the JMdictDB submission system entry pages now
have a "Conjugations" link. (This is my own implementation; it
was developed independently of Jim's conjugator code although I
made considerable use of the conjugations WWWjdic provides in
building the data tables for my version).
It seems to work well [*1] and provides conjugations for a
number of PoS classes that WWWjdic doesn't. However there
are problems with the following classes of words:
良い・いい -- Jim has agreed to a special PoS for these
and it and its conjugations are already in my local
code base so this problem will go away soon.
くれる -- I would (as in 2010) like the 'v1i-k' PoS tag that
this word formerly had restored so that its irregular
imperative form can be automatically generated.
だ -- Right now this has an 'aux' tag which is not conjugatable
at all. In much the same way as WWWjdic conjugates 'vs' words
by conjugating an affixed する, I "conjugate" 'n' and 'adj-na'
words by conjugating an affixed だ. I'd prefer to instead
just conjugate the word だ directly. If だ had a PoS that
identified it as being in a unique conjugation class, doing
that would be much easier.
Below are some points from the previous discussion and my recent
discussion with Jim in a Q&A format...
Why don't I just special-case だ and くれる?
Because the conjugator is table-driven: all the information
needed to conjugate a word is obtained (primarily) from a
table that is indexed by PoS and conjugation type. The assumption
is that the PoS in effect defines the rules for conjugating
words of that class. This table-driven approach allowed
me to implement the conjugations feature completely in SQL
which has advantages for the JMdictDB project.
There also seems to be a dearth of open-source code for doing
Japanese conjugations. The conjugations tables in JMdictDB are
open-source and need not be implemented in a database; they
can easily be read by or embedded in code in any programming
language to do conjugations. The actual code needed to generate
a conjugated form from the info extracted from the tables is
trivial [*2]. They are thus of wider benefit than to just the
JMdictDB project. However, if every code that wanted to use
them had to also write code to special case some words, their
value would be substantially reduced.
Aren't there a lot of words with unique conjugation rules
which will lead to a lot of unique PoS tags?
I don't know. In the 2010 discussion that was a point raised
but most of the words mentioned I think were archaic. AFAIK
the only common modern words in JMdict that violate the PoS-
defines-conjugation-rules assumption are the three mentioned
above.
It would be nice to provide conjugations for archaic words as
well and there may be ways to do so (maybe by structuring the
PoS tags into a two-level hierarchy?) but I think being able to
uniformly handle just modern words has enough value to justify
being addressed independently.
Shouldn't 'cop' (copula) also apply to other entries like である?
Perhaps. The word "copula" has meaning that extends beyond
the syntactical. Perhaps some abbreviation other than 'cop'
would be better for what I am requesting. 'da'? 'cop-da'?
'da-predicate'? I care less about the actual abbreviation used
than that there is one that says that this word だ conjugates
in this particular way.
Comments?
----
[*1]
Corrections will of course be gratefully received.
[*2]
I wrote a simple command line demo program that conjugates
words and runs completely independently of the JMdictDB database.
The conjugation is done in the 4-line function construct() which
uses only info extracted from the conjugation tables (read
from .csv files) to do its work. See
http://www.edrdg.org/~smg/cgi-bin/hgweb-jmdictdb.cgi/file/eb393788c541/python/conj.py#l237
Although the rest of the code is a little lengthy that is most
because of copious comments, argument parsing, output formatting
and rearranging the tables for more efficient lookups.