[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Text parsing funnies



We've been discussing the proposed term "飲み足りない". One
of the tests I apply is to see whether WWWJDIC's "Text Glossing"
functions makes sense of a compound term like that.

The "traditional" glossing function works fine (summary):
飲み  【のみ】 (n) (1) (abbr) drink; drinking; ....
足りない 【たりない】 (adj-i) (1) insufficient; not enough; lacking; ....

When I try the alternative glossing function on edrdg, which uses
MeCab/Unidic for parsing instead of the usual greedy dictionary
string-matching, I get:
飲み足り 【ノミタリル】 Unknown morpheme - possible new entry
ない (aux-v) (1) not; (suf,adj-i) (2) emphatic suffix

which is (a) interesting and (b) not a lot of use. 飲み足り is
both fairly common (26477 in the Google n-grams) and not
lexicalized. It sort-of shows the perils of using sophisticated parsing
when text glossing, and one of the reasons I've left the MeCab/Unidic
option in testing mode.

Jim



-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/