[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?

 c WWWG'day Darren,
On Thu, 14 Mar 2019 at 18:35, Darren Cook darren@dcook.org
[edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:

> > Google N-grams:
> > 眼鏡        2228370
> > めがね1206845
> > メガネ3949144
> Can someone remind me where those n-grams are, and their license?

They're from the LDC at UPenn. See: https://catalog.ldc.upenn.edu/LDC2009T08
I got access to it via the Univ. of Melbourne's LDC subscription.

The other set of n-grams, which I call the Kyoto/Melbourne N-gram
Corpus, was built
from the Kyoto corpus (500M sentences)
>> From both of these corpora I generated lists of n-gram sequences and
counts, then sorted
and merged them and set up ISAM-like indices. So the counts for 静かな
came from the
2-grams "静か+な".
The UPenn/LDC licence is rather restrictive(*), so I've been a bit
wary about making the
Google version widely known. The Kyoto people were more relaxed. The public
interface is via my CGI programs:
Google: http://nlp.cis.unimelb.edu.au/jwb/ngramcounts.html
Kyoto: http://nlp.cis.unimelb.edu.au/jwb/ngramcountswww.html

> Is it possible to keep them in a column in the jmdict sql database?
> (Even better: scaled to a 0.0 to 1.0 range)
> And then have the option of a special export that will include them in
> the xml export?

In theory, yes, especially for the Kyoto ones. I'm a bit chary about
letting a repository of data derived from the Google ones be available
like that.

> I'd love these numbers on all entries, rather than trying to draw an
> arbitrary line between  uK, rK, ofk, etc.

You can see them via WWWJDIC, e.g. go to
and drop the "Links" menu and click on "N-gram counts".

> If the license does not allow it, does it allow end users to download
> the data and this merge on their own machine?

The raw data files are humungous (36Gb and 45Gb) as they contain
raw n-grams sequences with particles, broken-up verb inflections, etc.
The ~800k surface forms and readings in JMdict only cover a very small
proportion. In theory I could generate Kyoto counts for them all (I do
have batch utilities for that) but I've found doing it dynamically works fine.

See if the online access works for you.



(*) I asked people at Google about freeing up the Japanese n-grams, as their
later n-gram corpora don't have any such restrictions. I got sympathy,
but for them
it's a 10-yo "20% project" by Kudo et al. and no-one can be bothered
putting in the
effort to sort it out with the LDC.

Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University