[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Enhancements to the English translations
Hi Stuart
Thanks for the comments. I'll leave your text intact and insert comments
(sorry for the length of this.)
On 12/04/07, Stuart McGraw <smcg4191@frii.com> wrote:
Jim Breen wrote:
> I want to raise some enhancements to the glosses/translational
> equivalents which Francis Bond and I have been discussing in
> recent months, and have begun doing some work towards.
> [...]
> (a) marking where the gloss is a direct translation and
> where it is an explanation. The example in the paper is 点/てん
> where "spot" is a translation and "counter for goods and items"
> is an explanation. Having these marked would be of benefit in a
> number of areas, including the process of reversal, i.e. using
> the file in an EJ direction.
Sometime in the past I suggested adding a PoS tag of "counter".
If that were done, and the counter sense of 点/てん had that
PoS, then tagging the gloss as "explanation" seems fine to me.
But if it retains its current "n-suf" PoS, then the only way to figure
out that it is a counter is to parse the english explanatory gloss
which I don't think is ideal. Alternatively the gloss could be tagged
"counter" (rather than "explanation") with the understanding (on the
part of applications) that a "counter" gloss is a specialized
explanation that describes what the counter counts. But I think a
PoS would be better.
Yes, a "counter" PoS was on my RoundToIt list until a few minutes ago. I
took the plunge, grepped for all the entries containing "counter.*[for|used]"
and isolated the gloss in a sense of its own with a "ctr" PoS tag. There were
59 of them.
I guess there will also be some entries without "equ", glosses,
only an "expl" one? For example, things like:
[1000000] ヽ 【くりかえし】 repetition mark in katakana
I guess that will be the case. It will probably be useful to consider
a default classification. There are probably more "equ" than "expl".
Something related that I've thought about in the past are
literal and idiomatic glosses. I am thinking of things like:
[2112870] 一目十行 outstanding reading ability (lit. one glance, ten lines)
This really seem to me like two glosses:
outstanding reading ability [idiomatic]
one glance, ten lines [literal]
where "[...]" indicates a marker (gloss meta-data) not textually
part of the gloss. (Feel free to substitute better words for "idiomatic"
and "literal".)
Such a pair would normally be displayed with the [literal] gloss visually
marked as in the current single gloss jmdict entry. But from a data
perspective, since nearly all glosses are "literal", perhaps the actual
lexicon glosses should have the idiomatic gloss tagged:
outstanding reading ability [idiomatic]
one glance, ten lines
and leave it to the UI to suppress the [idiomatic] tag and add a
[lit.] notation to the non-tagged gloss.
I think I'd take a slightly different tack here. Those (lit: ...)
annotations at present are only added when a literal translation
of the Japanese doesn't lead to the correct or meaningful English.
For that reason I think it's important that they remain explicitly
tagged.
Another example (that also involves an explanation gloss):
[2093310] 釈迦に説法 teach your grandmother to suck eggs; expression
meaning teaching something to someone who knows more than
you; lit. to lecture the buddha
teach your grandmother to suck eggs [idiomatic]
expression meaning teaching something to someone who knows more than you [expl]
to lecture the buddha
Yes, in that case I think we'd need a [lit] (or whatever) associated
with the "lecture the buddha".
> Implementing this probably involves:
> (i) adding another table to contain the marker
> (ii) in JMdict using the marker to generate an attribute for
> the gloss, e.g., <gloss g_type="equ">spot</gloss>.
> (iii) in the EDICT versions, either use the markers to
> generate subsets, e.g. just of direct translations, or add
> a tag like § or † to the text.
One question that needs to be answered is whether a gloss
will always have at most a single tag, of whether sometimes
multiple tags might be required (e.g. "idiomatic" and "mt" (see
below) on the same gloss). Since multiple tags are more expensive
in terms of resources and efficiency, it's desirable to not provide
for that if not neccessary. However the choice made now is not
cast in stone and can be changed later if need be.
I can't see how it can be kept to just one in the long term. Already
in the multi-lingual versions we have gloss-language and gender.
At the SQL level they could be packed in the one tag and unpacked as
the XML, etc. is generated, if that could work.
For now, I've changed the database schema in the CVS development
sources to support a single gloss tag (since it is a trivial change.)
This and a bunch of other not-yet-commited changes will make it
over to Arakawa real soon now.
> (b) identifying and marking/breaking disjunctive glosses of the
> "A or B" variety. At present 田地 has "rice field or paddy" and
> this could potentially be rewritten "rice field/rice paddy".
>
> As mentioned in the paper, this has been experimented with and some
> good results achieved that could be used to do a semi-automatic
> conversion. In looking over the results, however, a few questions
> arose. While some of the splitting is uncontentious, some of it
> can lead to rather ugly text. For example, 削る has "to shave
> (wood or leather)" which on the screen or paper looks better
> than "to shave (wood); to shave (leather)".
>
> One thing that could be done is to encode these patterns in the
> gloss such that either form could be generated. For example,
> if [[...]] were to encapsulate a set of two or more "or members",
> the (database) entry could be: "to shave ([[wood]][[leather]])".
> >From this could be generated dictionary entries which either
> were: "to shave (wood or leather)" where it was aimed at a
> human reader, or "to shave (wood)/to shave (leather)" where it was
> aimed at providing material for an NLP system, MT system, etc.
> The JMdict XML version(s) could do the same, or even maintain
> the markup, e.g.
> <gloss>to shave (<or_str>wood</or_str><or_str>leather</or_str>)
>
> (c) marking representative arguments [this was not actually canvassed
> in the paper.] Examples of these are: "pair (of screens or vases, etc.)"
> and "success or failure (in examinations)". For various NLP uses
> it would be good to have these identifiable. Again some form
> of internal markup ({{..}}, <<..>>, \[..\], etc.) could be used
> which could convert to simple parentheses for the human-
> readable forms, and something parseable for other systems would
> be good.
Another alternative might be just to represent the information
explicitly (with tagged glosses):
to shave (wood)
to shave (leather)
to shave (wood or leather) [human]
or (better?):
to shave (wood) [mt]
to shave (leather) [mt]
to shave (wood or leather)
Using encodings within the glosses (in the database) is problematical
(IMO) because although apps accessing the database will understand
them, the database itself won't, which may result in some tasks, which
could otherwise be done completely within the database (with queries),
to require the help of an external app.
A good point, but it's worth considering that the database is really
a source of data for other apps, and the generation of that data
would probably result in either the [mt] versions or the [human] versions
(where they are delineated) but not both.
I also wonder if there might arise other cases where simpler or more
orthogonal glosses are desirable for NLP/MT but which aren't simple
disjunctions.
Indeed there might, and for things like marking representative
arguments, I don't think there's an alternative to embedding them
in the text somehow.
The above proposal is not ideal either though. Besides the obvious
duplication of information and danger of inconsistency, if there were
other glosses in addition to the three shown above, there is no expicit
information that the first two and third glosses are mutually redundant
(and the additional ones presumably not). Not sure if this will be a real
problem in practice though.
I don't think there is a really clean solution, but
I think it will come down to which is easier to enter and maintain.
The "to shave ([[wood]][[leather]]) certainly avoids duplication but
you may need to do a course in writing them. OTOH, we could envisage
just letting people put in glosses, and having a bot sniff them over
and propose amendments. That'd be a nice NLP application.
Thanks again.
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/