[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] English n-gram counts



> Very fast search!

Not too bad given the mass of data.

> ...For users less familiar with the corpus, it might be good to show on the page the total number of ngrams (1-5) so that people can calculate the relative frequency

Good idea. I'll do that.

Unlike the Japanese n-grams which I merged into a single set so that things
like パンを食べた (a 4-gram) and パンを食べなかった (a 5-gram) would be close
together, I've kept the English ones in their original 1-5-gram blocks, so if
you look up "carn the crows" (150 n-grams), it only searches the 3-grams.
The 5 n-gram files are actually humongous text files which I access by using an
index file based on the first three characters of the n-gram which points to
where that sequence starts. I seek to that point and start reading. All quite
compact and fast.

> ... (and maybe mention that there is a cut-off:  low frequency ngrams will not appear).

Done.

Jim

>
> On Thu, Nov 28, 2019 at 10:39 AM Jim Breen jimbreen@gmail.com [edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:
>>
>>
>>
>> Earlier today I was mentioning the Google English n-gram corpus
>> in the context of finding the frequency of certain phrases. I realised
>> that I'd implemented a system for searching that corpus years ago
>> for my gairaigo segmenter at:
>> http://nlp.cis.unimelb.edu.au/jwb/gairaigo.html
>> but I'd never actually made it more generally available. Here it is:
>>
>> http://nlp.cis.unimelb.edu.au/jwb/engngrams.html
>>
>> Someone may find it useful. (FWIW the actual corpus is about 55Gb.)
>>
>> Jim
>>
>> --
>> Jim Breen
>> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
>> http://www.jimbreen.org/
>> http://nihongo.monash.edu/
>
>
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
> 



-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/


------------------------------------
Posted by: Jim Breen <jimbreen@gmail.com>
------------------------------------


------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/edict-jmdict/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/edict-jmdict/join
    (Yahoo! ID required)

<*> To change settings via email:
    edict-jmdict-digest@yahoogroups.com 
    edict-jmdict-fullfeatured@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    edict-jmdict-unsubscribe@yahoogroups.com

<*> Your use of Yahoo Groups is subject to:
    https://info.yahoo.com/legal/us/yahoo/utos/terms/