[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



Hi,

On Tue, 19 Mar 2019 at 13:00, eiennohito@gmail.com [edict-jmdict]
<edict-jmdict@yahoogroups.com> wrote:

> I'm preparing n-grams for the whole 10B corpus, here are some teasers and general info.

Sounds good.

> The output will be something like files in here:
> https://tulip.kuee.kyoto-u.ac.jp/ngrams/test-03/
> The files here are computed from ~3M sentences.
>
> The format is line-based, each line is [ngram]\t[count].
> [ngram] is a Jumandic-based morpheme ngram.
> I chose to keep morpheme information because it could be useful for post-processing.
> You can discard everything except surface information to get the "usual" n-grams.

But at the price of having a huge amount of additional data.

> N-gram format is [morpheme]0x02[morpheme...],
> [morpheme] are seven fields separated with 0x01 byte.
> Fields are:
> * surface
> * reading
> * dictionary form
> * rough pos
> * fine pos
> * conjugation type
> * conjugation form
>
> The corpus itself is analyzed by Juman++.
> N-grams are sorted by frequency followed by the n-gram string.

I hope you are able to do a subset of just the n-grams and their counts. The
other information is not that useful, IMO.

> I am going to cutoff n-grams by frequency with threshold of 100 / [ngram order].

As I'm sure you know, the Google ones have a cutoff of 20. The ones I
did from the
smaller Kyoto corpus have no cutoff as I wanted to get even the low counts.

> Some size info:
> 10B sentences are something like 320GB of gzipped plain text.
> 10B analyzed sentences are something like 5.6TB of gzipped text.

I'm not surprised.

> High-order n-grams are going to be huge and we should ponder something on how it is better to transfer them.

Yes, You usually end up shipping DVDs with those sorts of file sizes.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/