Re: [edict-jmdict] [rare] tag for obscure kanji?

Good day, Jim

I'm preparing n-grams for the whole 10B corpus, here are some teasers and general info.

The output will be something like files in here:
https://tulip.kuee.kyoto-u.ac.jp/ngrams/test-03/

The files here are computed from ~3M sentences.

The format is line-based, each line is [ngram]\t[count].

[ngram] is a Jumandic-based morpheme ngram.

I chose to keep morpheme information because it could be useful for post-processing.

You can discard everything except surface information to get the "usual" n-grams.

N-gram format is [morpheme]0x02[morpheme...],

[morpheme] are seven fields separated with 0x01 byte.

Fields are:

* surface

* reading

* dictionary form

* rough pos
* fine pos
* conjugation type
* conjugation form

The corpus itself is analyzed by Juman++.

N-grams are sorted by frequency followed by the n-gram string.

I am going to cutoff n-grams by frequency with threshold of 100 / [ngram order].

Some size info:

10B sentences are something like 320GB of gzipped plain text.

10B analyzed sentences are something like 5.6TB of gzipped text.

High-order n-grams are going to be huge and we should ponder something on how it is better to transfer them.

Arseny

Follow-Ups:
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Jim Breen <jimbreen@*********>

References:
- [rare] tag for obscure kanji?
  - From: Marcus Richert <superbrightfuture@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Jim Breen <jimbreen@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Marcus Richert <superbrightfuture@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Marcus Richert <superbrightfuture@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Darren Cook <darren@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Jim Breen <jimbreen@*********>
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: eiennohito@*********
- Re: [edict-jmdict] [rare] tag for obscure kanji?
  - From: Jim Breen <jimbreen@*********>