[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] aruyouni



I don't know what ちゃせん is, but I'm rather sure that Google has some kind of corpus that it uses to parse input strings into constituent words.

Of course I could be dead wrong about this. This is what Jim and Paul specialize in.


Rene


On 18-Feb-08, at 9:44 PM, Jim Rose wrote:

What your suggesting then is that an unquoted string is run through an analysis such as ChaSen before being run through a database? Otherwise how would Google know to parse that string into three words.





On Feb 18, 2008, at 8:29 AM, René Malenfant wrote:

Well, that comment was directed at Paul, but I just tried what you
suggested and I got the same 140,000,000 result as you.

The [G] link does not appear to be putting the search string in
quotation marks. i.e., it looks for あるように, not "あるよ
うに", so it drastically overestimates the number of hits. (AFAICT,
it searches for any page that has ある、よう and に, but not
necessarily as "あるように". With three such common words, it's
basically returning every Japanese page on the web.)

If there's no technical difficulty preventing it, perhaps the [G]
links should be changed to use quotation marks in their search strings?

Rene