[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] [rare] tag for obscure kanji?



Good day, Jim

I'm preparing n-grams for the whole 10B corpus, here are some teasers and general info.

The output will be something like files in here:
https://tulip.kuee.kyoto-u.ac.jp/ngrams/test-03/
The files here are computed from ~3M sentences.

The format is line-based, each line is [ngram]\t[count].
[ngram] is a Jumandic-based morpheme ngram.
I chose to keep morpheme information because it could be useful for post-processing.
You can discard everything except surface information to get the "usual" n-grams.

N-gram format is [morpheme]0x02[morpheme...],
[morpheme] are seven fields separated with 0x01 byte.
Fields are:
* surface
* reading
* dictionary form
* rough pos
* fine pos
* conjugation type
* conjugation form

The corpus itself is analyzed by Juman++.
N-grams are sorted by frequency followed by the n-gram string.

I am going to cutoff n-grams by frequency with threshold of 100 / [ngram order].

Some size info:
10B sentences are something like 320GB of gzipped plain text.
10B analyzed sentences are something like 5.6TB of gzipped text.

High-order n-grams are going to be huge and we should ponder something on how it is better to transfer them.

Arseny