[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] [rare] tag for obscure kanji?
Good day, Jim
I'm preparing n-grams for the whole 10B corpus, here are some teasers and general info.
The files here are computed from ~3M sentences.
The format is line-based, each line is [ngram]\t[count].
[ngram] is a Jumandic-based morpheme ngram.
I chose to keep morpheme information because it could be useful for post-processing.
You can discard everything except surface information to get the "usual" n-grams.
N-gram format is [morpheme]0x02[morpheme...],
[morpheme] are seven fields separated with 0x01 byte.
Fields are:
* surface
* reading
* dictionary form
* rough pos
* fine pos
* conjugation type
* conjugation form
The corpus itself is analyzed by Juman++.
N-grams are sorted by frequency followed by the n-gram string.
I am going to cutoff n-grams by frequency with threshold of 100 / [ngram order].
Some size info:
10B sentences are something like 320GB of gzipped plain text.
10B analyzed sentences are something like 5.6TB of gzipped text.
High-order n-grams are going to be huge and we should ponder something on how it is better to transfer them.
Arseny