[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Text parsing funnies

To: edict-jmdict <edict-jmdict@***************>
Subject: Text parsing funnies
From: Jim Breen <jimbreen@*********>
Date: Sat, 5 Oct 2019 10:22:45 +1000

We've been discussing the proposed term "飲み足りない". One
of the tests I apply is to see whether WWWJDIC's "Text Glossing"
functions makes sense of a compound term like that.

The "traditional" glossing function works fine (summary):
飲み  【のみ】 (n) (1) (abbr) drink; drinking; ....
足りない 【たりない】 (adj-i) (1) insufficient; not enough; lacking; ....

When I try the alternative glossing function on edrdg, which uses
MeCab/Unidic for parsing instead of the usual greedy dictionary
string-matching, I get:
飲み足り 【ノミタリル】 Unknown morpheme - possible new entry
ない (aux-v) (1) not; (suf,adj-i) (2) emphatic suffix

which is (a) interesting and (b) not a lot of use. 飲み足り is
both fairly common (26477 in the Google n-grams) and not
lexicalized. It sort-of shows the perils of using sophisticated parsing
when text glossing, and one of the reasons I've left the MeCab/Unidic
option in testing mode.

Jim



-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/

Follow-Ups:
- Re: [edict-jmdict] Text parsing funnies
  - From: "Adam Nohejl" <adam@***********>

Prev by Date: Re: [edict-jmdict] Need for a timestamped meta-entry
Next by Date: Re: [edict-jmdict] Text parsing funnies
Previous by thread: JMdict passes 185,000 entries.
Next by thread: Re: [edict-jmdict] Text parsing funnies
Index(es):
- Date
- Thread