[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

N-gram puzzles

To: edict-jmdict <edict-jmdict@***************>
Subject: N-gram puzzles
From: Jim Breen <jimbreen@*********>
Date: Wed, 15 Mar 2017 21:50:16 +1100

An occasional contributor (a 二世 lawyer in the US) sent me a proposal
for two entries: 公の秩序 and 善良の風俗, which are related to the
existing 公序 and 公序良俗 entries. Their counts in the Kyoto/Melb
n-grams are:

公序10632
公序良俗10143
公の秩序752
善良の風俗443

That's all pretty clear, and it's plain that most of the n-gram counts for
公序 are coming from the yoji 公序良俗.

I happened to look at the counts in the (not public) Google n-grams, and
saw:

公序7391
公序良俗783669
公の秩序36586
善良の風俗11976

My first thought was that the counts were screwed up. Then I realised
why the two n-gram sets had such different counts. The Kyoto/Melb
ones were done breaking the morphemes up using the Unidic lexicon,
which has 公序 and 良俗 as distinct morphemes. Thus MeCab treats
公序良俗 as 公序 + 良俗. OTOH the Google n-grams were done using
the older IPADIC lexicon, which had a broader view of what are
morphemes. I checked and sure enough it has 公序良俗 as a single
morpheme, so in that case MeCab segmented it that way.

A bit of a trap. I far prefer Unidic for this reason.

Cheers

Jim



-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University

Prev by Date: Re: [edict-jmdict] Russian source words
Next by Date: N-grams unavailable
Previous by thread: Re: [edict-jmdict] node.js packages for jmdict?
Next by thread: N-grams unavailable
Index(es):
- Date
- Thread