[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



 c WWWG'day Darren,
On Thu, 14 Mar 2019 at 18:35, Darren Cook darren@dcook.org
[edict-jmdict] <edict-jmdict@yahoogroups.com> wrote:

> > Google N-grams:
> > 眼鏡        2228370
> > めがね1206845
> > メガネ3949144
>
> Can someone remind me where those n-grams are, and their license?

They're from the LDC at UPenn. See: https://catalog.ldc.upenn.edu/LDC2009T08
I got access to it via the Univ. of Melbourne's LDC subscription.

The other set of n-grams, which I call the Kyoto/Melbourne N-gram
Corpus, was built
from the Kyoto corpus (500M sentences)
>> From both of these corpora I generated lists of n-gram sequences and
counts, then sorted
and merged them and set up ISAM-like indices. So the counts for 静かな
came from the
2-grams "静か+な".
The UPenn/LDC licence is rather restrictive(*), so I've been a bit
wary about making the
Google version widely known. The Kyoto people were more relaxed. The public
interface is via my CGI programs:
Google: http://nlp.cis.unimelb.edu.au/jwb/ngramcounts.html
Kyoto: http://nlp.cis.unimelb.edu.au/jwb/ngramcountswww.html

> Is it possible to keep them in a column in the jmdict sql database?
> (Even better: scaled to a 0.0 to 1.0 range)
>
> And then have the option of a special export that will include them in
> the xml export?

In theory, yes, especially for the Kyoto ones. I'm a bit chary about
letting a repository of data derived from the Google ones be available
like that.

> I'd love these numbers on all entries, rather than trying to draw an
> arbitrary line between  uK, rK, ofk, etc.

You can see them via WWWJDIC, e.g. go to
https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%B4%E3%B6%C0
and drop the "Links" menu and click on "N-gram counts".

> If the license does not allow it, does it allow end users to download
> the data and this merge on their own machine?

The raw data files are humungous (36Gb and 45Gb) as they contain
raw n-grams sequences with particles, broken-up verb inflections, etc.
The ~800k surface forms and readings in JMdict only cover a very small
proportion. In theory I could generate Kyoto counts for them all (I do
have batch utilities for that) but I've found doing it dynamically works fine.

See if the online access works for you.

Cheers

Jim

(*) I asked people at Google about freeing up the Japanese n-grams, as their
later n-gram corpora don't have any such restrictions. I got sympathy,
but for them
it's a 10-yo "20% project" by Kudo et al. and no-one can be bothered
putting in the
effort to sort it out with the LDC.

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/