[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"P" Markers - Google as corpus?



I think I heard talk about this here before, by as an
experiment I thought I would harvest the frequency
information out of a search engine (people seem to use
Google a lot here) and use that to rate how "common" words
are.  

I've taken the liberty of grabbing and edict.gz file and
striping everything but ther readings from it but keeping
something to delineate that some words are just different
"readings" of the same word(probably this would be very
easy for you to do Jim). Put quotation marks around each
word, and ran it through this site's frequency checker:

http://www.linguistics.ucla.edu/people/hayes/QueryGoogle/

The only problem is I got around 20% done before Google
blocked my automated queries for today (on this computer
at least) -- they are probably trying to shut down email
harvesters and such things.  I when to their help page and
they said to contact them before doing automatic searches,
then proceeded to leave nothing but a paper-mail address
to contact them by. 

Anyways, if someone wants to help me out (do a part of the
file to speed things up) please contact me, and/or if Jim
or anyone else is interested in the results please speak
up, and I'll post them somewhere when I'm finished.  

-Kale