G'day,
thanks for the discussion.
>> Jeroen: would you be interested in maybe writing some draft
guidelines
>> on how we want the final entries to be?
>>
>> E.g. for the following two entries: 勉強 and 勉強する
>> should they both xref
>> each other? should they share senses?
>>
> I'm pretty sure that Jim's plan is just to separate the (n) and (vs)
> into different senses in the same entry, as was done recently
> for 敗北:
>
> 敗北 【はいぼく】 (n) (1) defeat; (vs,vi) (2) to be
> defeated; (P) [V][Ex][G][GI][S][A][W]
I think that misses the fact that there is some kind of connection
between the "defeat" and "to be defeated" senses.
Compare 勉強, where the three senses are completely different.
勉強 【べんきょう】 (n,vs) (1) study; (2) diligence; (3)
discount; reduction;
I would like to explicitly note the defeat/be defeated connection.
Either in sub senses:
敗北 【はいぼく】 (1.1) [n] defeat; (1.2) [vs,vi] to be
defeated;
勉強 【べんきょう】 (1) [n,vs] study; (2.1) [n]
diligence; (2.2) [vs] be
diligent; (3.1) [n] discount; reduction; (3.2) [vs] discount; reduce
or as two entries
敗北 【はいぼく】 (1) [n] defeat; [link to 敗北する]
敗北する 【はいぼくする】 (1) [vs,vi] to be
defeated; [link to 敗北]
勉強 【べんきょう】 (1) [n] study; (2) [n] diligence;
(3) [n] discount;
reduction; [link to 勉強する]
勉強する 【べんきょうする】 (1) [vs] study; (2) [vs] be
diligent; (3) [vs]
discount; reduce [link to 勉強する]
[BTW I think that 勉強 <-> diligence is not actually a good
entry.]
I realize that either of these approaches adds some (a lot of)
redundancy, but I don't really mind. As long as verb-noun derivations
are not fully predictable, I think we should include both forms, and
link them to show the redundency. On the other hand, its only
electrons....
The thing I care most about is that someone looking for "destroy"
should find 絶滅(する), and currently that is not
possible.
>> I should be able to start an automatic mapping soon, and will
then ask
>> for feedback as to how we can make it better.
>>
>> My basic plan is:
>> (a) look up all (n, vs)
>>
>> (b) for each English translation try to convert noun->verb (using
>> wordnet derivation links)
>> http://wordnet.princeton.edu/perl/webwn?
>> o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=destruction&i=3&h=100000#c
>>
>> (c) validate
>> - search for the verb - verb entries in other lexicons (EDR, ...)
>> - search for the verb - verb entries in other parallel texts
>>
>> If anyone has any ideas about how to improve any of the steps,
please
>> let me know.
>>
> Personally, I don't think that automation will work very well, and
> all (vs) entries (11000, I think Jim said) will have to be gone
> through by hand.have to link them.
>
> For example, the reason that 敗北 was given its own (vs) sense
> is that it's unclear from the (n) whether the (vs) means "to defeat"
> or "to be defeated". And there are all other sorts of vagaries that
> have to be dealt with. (How can an automation derive the (vs) form
> of "あっさり - easily; readily; quickly" from the base form?
> Especially since it's mostly used as an (adj-f) as "あっさりし
> た", in which case it means more along the lines of "light",
> "simple", "plain", etc.
I think we won't be able to do this one automatically. And I really
think we need two entries for it...
> And that's *the first one on the list* when
> you search for (vs) in Edict.) I really doubt an automated system
> would be able to handle a huge number of these entries, and I think
> it's better to have nothing than to have something incorrect. But if
> it doesn't take you a lot of effort to set up such a system, I would
> like to see what it can do.
A previous attempt found matches for about 10,000, and we found them
useful in our MT research.
> I'm pretty sure I'm alone on this, but I don't think (vs) should
even
> be given their own senses. A properly written (n) should make it
> painfully obvious what the (vs) is. In 敗北 above, the noun
> should be "defeat (i.e. being defeating)". This also helps people
> know they can't use it in a phrase like "Napoleon's defeat of the
> Habsburgs". あっさり doesn't work because it's missing a
> sense. (The fact that a lot of entries are incomplete doesn't bode
> well for automation either.)
Obvious for a good English speaker looking up the Japanese word
certainly. I think the goal here is to make it more useful for (a)
non-natives, who don't necessarily know the English noun-verb
derivations (b) computers (ditto) and (c) people looking up English to
find the Japanese.
> I think the (vs) senses just add redundancy and make things bulky
and
> ugly. Having just the noun is so elegant.
Whereas I view the redundancy as a good thing. The more redundant,
the more robust.
If we add enough redundant links, it should be possible to collapse
the entries in the interface:
e.g. see that 勉強 【べんきょう】 (1) [n] study;
and 勉強する 【べんきょうする】 (1) [vs]
study; are the same, and then show them together. This could also be
done off-line, to make an elegant version. Unfortunately, I don't
think we can reliably automate it the other way, which is why I want
to add all the information to JMDict proper. In particular, I think
we should add as many cross links as possible (in my ideal
world 勉強
<-> 勉強する and study <-> study should both be linked), which
is why I
want to discuss the structure we are aiming towards before rushing off
and making the new entries.
Yours,
--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group