[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] "P" Markers - Google as corpus?



I guess you used the "Japanese language" restriction when you
ran the searches?  I think many web pages are not tagged with
the language.

I'm not sure how you are defining 'tagging' here but there
is not that big a difference.

田中さん 1,490,000 (no language restriction)
田中さん 1,360,000 (Japanese language restriction)
田中さん の 1,420,000

An alternate way would be to restrict the search to .jp domains rather
than language.  Of course this would result in inclusion of some undesirable
pages (such as Chinese ones hosted in Japan) but hopefully these would
be few enough to not affect the results.

You also lose a _lot_ of .com and such pages.

site:.jp 田中さん 1,220,000

As you can see a bigger drop than the Japanese language
restriction one.

In any case I know from experience that a search not restricted
by language or domain but including some kana returns the
highest number of valid hits.  (If you _don't_ include kana
nasty things happen).

Regarding headwords found in multiple entries obviously a
some of them should be merged, but I'm sure there are many
validly different words as well.  It is often the case that
one entry will be overwhelmingly more common than the other.
If you can confirm that then you can effectively load the
Google results all on to one of the two words.