[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] English n-gram counts
> Very fast search!
Not too bad given the mass of data.
> ...For users less familiar with the corpus, it might be good to show on the page the total number of ngrams (1-5) so that people can calculate the relative frequency
Good idea. I'll do that.
Unlike the Japanese n-grams which I merged into a single set so that things
like パンを食べた (a 4-gram) and パンを食べなかった (a 5-gram) would be close
together, I've kept the English ones in their original 1-5-gram blocks, so if
you look up "carn the crows" (150 n-grams), it only searches the 3-grams.
The 5 n-gram files are actually humongous text files which I access by using an
index file based on the first three characters of the n-gram which points to
where that sequence starts. I seek to that point and start reading. All quite
compact and fast.
> ... (and maybe mention that there is a cut-off: low frequency ngrams will not appear).
Done.
Jim
>
> On Thu, Nov 28, 2019 at 10:39 AM Jim Breen jimbreen@gmail.com [edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:
>>
>>
>>
>> Earlier today I was mentioning the Google English n-gram corpus
>> in the context of finding the frequency of certain phrases. I realised
>> that I'd implemented a system for searching that corpus years ago
>> for my gairaigo segmenter at:
>> http://nlp.cis.unimelb.edu.au/jwb/gairaigo.html
>> but I'd never actually made it more generally available. Here it is:
>>
>> http://nlp.cis.unimelb.edu.au/jwb/engngrams.html
>>
>> Someone may find it useful. (FWIW the actual corpus is about 55Gb.)
>>
>> Jim
>>
>> --
>> Jim Breen
>> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
>> http://www.jimbreen.org/
>> http://nihongo.monash.edu/
>
>
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
>
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/
------------------------------------
Posted by: Jim Breen <jimbreen@gmail.com>
------------------------------------
------------------------------------
Yahoo Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/edict-jmdict/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/edict-jmdict/join
(Yahoo! ID required)
<*> To change settings via email:
edict-jmdict-digest@yahoogroups.com
edict-jmdict-fullfeatured@yahoogroups.com
<*> To unsubscribe from this group, send an email to:
edict-jmdict-unsubscribe@yahoogroups.com
<*> Your use of Yahoo Groups is subject to:
https://info.yahoo.com/legal/us/yahoo/utos/terms/