[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Nihon vs. Nippon, Nihongo vs. Nippongo



On 31 March 2014 11:44, Darren Cook <darren@dcook.org> wrote:
I (Jim) wrote:
>> As for 日本語, if we look for にほんご and にっぽんご in the Google n-gram
>> corpus, i.e. a collection of Japanese words taken from the WWW along
>> with their counts, we see:
>>
>> にほんご397256 にっぽんご21818
>>
>> That demonstrates that both are used *by Japanese people* and that にほ
>> んご is much more common.

> To be more precise, it means Google found the string of hiragana
> characters "にっぽんご" and decided to put those characters together as
> a word 21,818 times. (?)

Not really. The Google n-gram corpus is drawn from a dump of all
Japanese WWW pages in mid-2007. The text was extracted, and
strings which passed various filters, e.g. had to have a minimum
length, had to have a proportion of kana, etc. was segmented using
MeCab/IPADIC and the sequences of grams (i.e. morphemes) were counted.
See:  http://catalog.ldc.upenn.edu/LDC2009T08 and
http://catalog.ldc.upenn.edu/docs/LDC2009T08/README.utf8.english

The segmentation of sentences containing にっぽんご would probably
have resulted in にっぽん + ご as separate morphemes, and my lookup
system counts the adjacent pairs.

> I wonder if practically all of those 21,800 hits are referring to the
> textbook:
>   http://ja.wikipedia.org/wiki/にっぽんご;

Well, looking through the first 30 hits for にっぽんご on the WWW I see
that a few are for that text, but by no means all.

> Sometimes I search for a Japanese word on Google, and all the pages I
> get appear to be dictionaries! Sometimes I find a few blog posts,
> amongst them, and in the cases where the word I'm actually searching for
> was a typo, the characters are actually split across two words in the
> blog post.

I see that too. The process used in assembling the n-gram corpus is
quite different, and it doesn't bridge strings across punctuation and
whitespace.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University