[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Text parsing funnies



Sorry, I probably didn't get the point. Did you just mean that from the point of view of someone trying to understand the meaning of the phrase 飲み足り+ない is a less useful segmentation than 飲み+足りない?

In my experience MeCab/Unidic is mostly consistent (compared to Ipadic) in segmenting compound verbs. Practically all lexical compound verbs are included in the dictionary (as "morphemes") and thus not segmented as V1+V2. Although perhaps not in every case, it seems to me like a more sensible approach from the semantic point of view.

Are you referring to masu-stems of the whole compound verbs or just V1s in "[Unidic] doesn't always include the masu-stems that way, particularly of 複合動詞"? AFAICT masu-stems (as well as irrealis/未然形/"nai-stems") are handled just fine by Unidic.

(Thanks for the link - I remember having read the paper, but forgot you are one of it authors:-).)

Best regards,

Adam

2019/10/06 13:34、Jim Breen jimbreen@gmail.com [edict-jmdict] <edict-jmdict@yahoogroups.com>のメール:

There's no "misusing" of the MeCab/Unidic output. The parse/gloss function using MeCab/Unidic works fine most of the time. Had 飲み足りる been in JMdict
it would have identified it OK.

It's a little odd that Unidic has 飲み足り included as a morpheme when it doesn't always include the masu-stems that way, particularly of 複合動詞. I guess it's because 飲み足りない is often encountered whereas 飲み足りる is rather rare.

I've done a fair bit of quantitative analysis of 複合動詞 over the years,
and I agree
that 足りる is not a productive V2. There's a paper from about 10 years ago at: http://www.edrdg.org/~jwb/paperdir/jcv.pdf and it shows the most productive V1
and V2s. The approx. 64k real and potential verbs identified (list
available from
the ACL repository) has 飲み足りる at about no. 14,000.

iCheers

Jim

On Sat, 5 Oct 2019 at 21:20, 'Adam Nohejl' adam@nohejl.name
[edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:



Hi,

I've done some quantitative research on compound verbs (and also struggled a with morphological analysis), so I may have some experience that could help:

As for the analysis 飲み足り+ない (MeCab/Unidic): It seems that the tool you are using (I cannot see the option for MeCab at the WWWJDIC web site) is misusing MeCab's output.. If you want to feed the output into translation via a dictionary, you should be looking at the lemma (基本形) not the surface form/token (表層形), which would give you 飲み足りる+ない. (It's not going to help in this particular case since JMdict does not have an entry for 飲み足りる, but searching for forms instead of lemmas in a dictionary just doesn't make much sense.) You can have a look at the full output of MeCab/Unidic online here:

http://chamame.ninjal.ac.jp/chamamebin/webchamame.php

V+足りる is not a very productive pattern (and usually occurs in negative phrases), so including 飲み足りない as separate ("expression") entry may be a good idea. 広辞苑 for instance has a phrase (成句) sub-entry 食い足りない in the 食う entry. The thing is that "食い足りない" has a figurative meaning "物足りない" too, which I guess is the reason for listing the phrase. (飲み足りない seems to have only a literal meaning.)

FYI, here are the token counts of V(連用形)+"足りる" in BCCWJ (which shows how unproductive it is). It's worth noting that MeCab/Unidic (which is used to analyze BCCWJ) recognizes all of them as a single unit (what BCCWJ calls SUW). "言い足りない" seems like another good candidate for addition if you decide to add "飲み足りない".

満ち足りる 150
飽き足りる 25
言い足りる 12
飲み足りる 12
遊び足りる 5
書き足りる 5
食べ足りる 5
寝足りる 5
食い足りる 4
話し足りる 4
洗い足りる 3
眠り足りる 2
焼き足りる 2
暴れ足りる 1
謝り足りる 1
歩き足りる 1
生み足りる 1
聞き足りる 1
切り足りる 1
喋り足りる 1
嘗め足りる 1
憎み足りる 1
煮足りる 1
見足りる 1

Only the following two seem to be analyzed as two morphological unit by MeCab/Unidic:
しゃべり足りる 1 (but see 喋り足りる above)
取れ足りる 1 (occurs in "陰陽のバランスがとれ足りない", IMHO "should" be とり足りない.)

You can search for the concordances of the verbs/phrases above here: http://nlb.ninjal.ac.jp/search/ (The numbers are a little off: maybe a different version/analysis of the corpus.)

Best regards,

--
Adam Nohejl

On 5 Oct 2019, at 2:22, Jim Breen jimbreen@gmail.com [edict-jmdict] wrote:



We've been discussing the proposed term "飲み足りない". One
of the tests I apply is to see whether WWWJDIC's "Text Glossing"
functions makes sense of a compound term like that.

The "traditional" glossing function works fine (summary):
飲み 【のみ】 (n) (1) (abbr) drink; drinking; ....
足りない 【たりない】 (adj-i) (1) insufficient; not enough; lacking; ....

When I try the alternative glossing function on edrdg, which uses
MeCab/Unidic for parsing instead of the usual greedy dictionary
string-matching, I get:
飲み足り 【ノミタリル】 Unknown morpheme - possible new entry
ない (aux-v) (1) not; (suf,adj-i) (2) emphatic suffix

which is (a) interesting and (b) not a lot of use. 飲み足り is
both fairly common (26477 in the Google n-grams) and not
lexicalized. It sort-of shows the perils of using sophisticated parsing
when text glossing, and one of the reasons I've left the MeCab/Unidic
option in testing mode.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/





--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/