Japanese Word Frequencies - Google N-Gram Corpus

James Breen
jimbreen@gmail.com
Introduction

This paper describes a WWW-based tool (https://www.edrdg.org/~jwb/ngramcounts.html) for deriving the frequency counts of sequences of Japanese text from the Google Japanese N-Gram Corpus. The paper is in several sections:

  1. the compilation of the original n-gram corpus;
  2. the contents, format and distribution of the corpus;
  3. the development of the WWW-based tool, including the data structure and format;
  4. the operation of the WWW-based tool.

There is also an appendix in which the author describes the processes which led to the development of the tool.

Compilation of the Google N-Gram Corpus

The corpus was compiled within Google Japan in 2007 in a "20%" or "side project" carried out by Taku Kudo and Hideto Kazawa, using a crawl of publicly accessible web pages made in July of that year. The process of the n-gram count generation was, in summary:

A total of over 20 billion sentences were analyzed, with n-gram counts ranging from about 2.5 million unique 1-grams up to 776 million unique 5-grams.

Contents, Format and Distribution of the Corpus;

Each set of n-grams was sorted and divided into files of 10 million items, totalling over 300 files. The compressed files amount to over 26Gb of data. The collection of files and documentation was initially distributed as 6 DVDs. In the distribution the files of n-grams and counts were placed in a directory structure as shown in the example below for the 2-grams.

data/2gms/
data/2gms/2gm.idx
data/2gms/2gm-0000.gz
data/2gms/2gm-0001.gz
data/2gms/2gm-0002.gz
data/2gms/2gm-0003.gz
data/2gms/2gm-0004.gz
data/2gms/2gm-0005.gz
data/2gms/2gm-0006.gz
data/2gms/2gm-0007.gz
data/2gms/2gm-0008.gz

The "2gm.idx" file is a text file listing the initial n-grams in the individual compressed data files.

2gm-0000.gz	! </S>
2gm-0001.gz	☆ 青のり
2gm-0002.gz	たくさん 臨時
2gm-0003.gz	も 追い返せ
2gm-0004.gz	デバイス CD
2gm-0005.gz	伯 朗
2gm-0006.gz	引き続き 指摘
2gm-0007.gz	発 叩き込も
2gm-0008.gz	高かろ ー

The following is a sample of 3-grams from the 3gm-0027 file:

天白 保呂 町    42
天白 信仰 の    21
天白 側 つや消し    30
天白 児童 館    251

The n-gram corpus was first announced on the Google Japan blog site. The formal title of the corpus is "Japanese Web N-gram Version 1".

The initial distribution was made via the Language Resources Association (言語資源協会 - GSK) in Japan. with the DVD sets priced from ¥22-88,000. In 2009 the corpus was included in the catalogue of the Linguistic Data Consortium at the University of Pennsylvania. It is only available via download and is free to members and costs $US150 for non-members. The site also includes the documentation of the corpus.

Development of a WWW-based Tool

The Japanese n-gram corpus is an invaluable resource for determining the relative frequencies of lexical items in Japanese. It is especially useful for examining the frequencies of text sequences, for such purposes as testing if nouns are being used in conjunction with the verb "する", whether verbs are are used transitively, etc. etc.

As distributed, the corpus is relatively difficult to use for finding the frequencies of individual terms. This is due in part to:

  1. the structure of the data, with the terms spread over hundreds of files;
  2. the fact than many lexical items are made up of several morphemes, and it would be necessary to determine them before searching for the appropriate n-gram sequence. For example, 循環器内科 (cardiovascular medicine) is a 3-gram (循環+器+内科), which may not be immediately obvious.

While it is possible to use a tailored approach to extracting n-gram frequencies for particular investigations, it would be much better if there were a general and easy-to-use system which enabled the corpus to be examined for the presence of segments of Japanese text and matching counts returned.

Accordingly, an approach was developed which enables segments of Japanese text to be quickly matched against the corpus. This approach, which is described in detail in the Appendix, involved reorganizing the contents of the corpus and establishing an indexing system which enables the rapid location of specified text segments and the extraction of the counts. A software module was developed for locating and extracting corpus entries, and this module has been used both in web server programs and batch programs, with considerable success.

Operation of the WWW-based Tool

The WWW-based n-gram corpus tool is accessed by a single page at https://www.edrdg.org/~jwb/ngramcounts.html. Several tasks can be initiated from this page. All involve entering one or more strings of Japanese characters and as appropriate choosing one of the options.

  1. Basic Operation

    This is the default operation with no options selected. It returns the counts for the terms entered. For example, entering "学校 學校 字体" results in the following page:

    Google N-gram Corpus Counts
    
    学校	48641365
    學校	27255
    字体	232177
    

  2. (Option) Most common 10 (100) terms

    For the entered term, this option initiates the searching for the most common 10 (or 100) n-gram sequences which begin with the characters provided. They are displayed in descending count order. It has sub-options of displaying the matches in a block order according to the characters after the search key, or in a tree order. As an example the most common 10 sequences starting with 学校 are shown below: (It also has the option of displaying the morpheme components of the displayed sequences, as determined by MeCab/Unidic.)

    Top 10 N-grams Lookup for 学校 (Frequency Order)
    
    学校	48633879
    学校の	7319568
    学校に	4398391
    学校で	3872167
    学校を	1881651
    学校が	1507934
    学校へ	1320734
    学校は	1223917
    学校・	1098759
    学校から	1072889
    学校教育	989795
    

  3. (Option) Term counts with selected affixes

    In this option, a search is made for the specified term combined with a selection of common particles and other affixes. In this example we see the (partial) results for 勉強.

    勉強	31502952
    勉強は	664738
    勉強が	840674
    勉強な	68089
    勉強の	1308524
    ...
    勉強を	2339108
    勉強する	1644637
    勉強して	3341856
    勉強しない	310272
    ...
    の勉強	5018765
    に勉強	1186926
    を勉強	2021162
    ...
    

  4. (Option) Term counts of verb inflections

    In this option, the search is carried out for 25 common inflections of a verb. The following example is for 食べる.

    食べる	19179416
    食べます	1168041
    食べない	1882110
    食べぬ	1561
    食べず	354272
    食べません	281265
    食べた	10032006
    食べました	3397730
    食べなかった	227567
    .....
    

  5. (Option) Raw n-gram counts

    In this option, the first 100 n-grams and counts starting from the search string are returned. The following example shows the terms starting with 勉強.

    勉強	2871
    勉強	31500081
    勉強々	115
    勉強々々	53
    勉強あ	191
    勉強あい	198
    勉強あいう	51
    勉強あいうえ	141
    勉強あいうえ	51
    勉強あいうえおかき	46
    勉強あいうえおかきくい	46
    勉強あいうえおっ	141
    勉強あいうえおって	141
    勉強あいうえおってなあに	141
    .....
    

    Note that both 勉強 and 勉強あいうえ have returned two counts. This is most likely the result of incorrect segmentation when the n-grams were compiled. Most of the time 勉強 was identified as a 1-gram, but on occasions it seems to have been treated as a 2-gram.

    Also note that the counts for 勉強あいうえおっ, 強あいうえおって and 勉強あいうえおってなあに are the same. This is indicating that in the original text collection 勉強あいうえおっ and 強あいうえおって only occurred as part of 勉強あいうえおってなあに, and hence were counted identically as 4-grams, 5-grams and 6-grams.

Conclusion

The WWW-based tool has proved to be very useful in several areas, especially in evaluating potential terms for inclusion in dictionaries. It has made the underlying data in the Google N-gram Corpus available to many people who would otherwise have been unable use it.

The main problem with the corpus and the search tool is that the data is from 2007 and does not capture neologisms and changes in word usage. Another problem arises from occasional segmentation issues. The IPADIC lexicon which was used for the task was known to have some flaws, for example the term 日本語, which is usually regarded as morpheme pair (日本+語), was treated as a single morpheme.

Appendix - Data Structure and Tool Development

As mentioned in the article, the structure of the data in the original corpus distribution does not cater very well for the process of looking up the frequency counts for single terms or sets of terms. Most users wishing to find the frequency of a term such as 公民権運動 (civil rights movement) are not really interested in the fact that it consists of 3 morphemes (公民+権+運動) and is hence classed as a 3-gram.

In order to make the data available for flexible identification of term frequencies, it was reorganized as follows:

  1. the contents of the 300 n-gram files were modified to convert the n-gram components to single text strings. At the same time n-grams which contained alphanumeric characters or non-text characters (other than the common nakaguro (middle-dot: ・) character) were discarded;
  2. the resulting terms and counts were sorted into a single file in character order (i.e. Unicode). This resulted in a 36Gb text file containing 1.3 billion items. (This process was carried out by sorting many smaller files and merging the results.)

Although it would be possible to create a database containing the text strings and counts, and use database retrieval commands to find selected strings, this would result in an extremely large database. Instead a simpler approach was devised which enables text strings to be retrieved rapidly. The approach is as follows:

  1. an index file was created consisting of 64-bit integers representing the byte offsets of the first line in the file for each of the kana, kanji, etc. characters in the basic plane of the Unicode set. A second index file was created containing the equivalent offsets for the initial pair of kana for lines beginning with those characters.
  2. to retrieve the n-gram counts for a term such as 公民権, the following process is applied:
    1. the text file containing all the n-gram data is opened in read-only mode and initialized for "memory-mapped" access;
    2. the offset to the first of the lines beginning with 公 is extracted from the index file;
    3. a "seek" command is issued using the offset to position the file-reading process to that line in the file;
    4. the file is then read sequentially until a match (if any) is made with the requested text sequence.

While it may seem that using sequential record reading is inefficient, in fact the process is quite rapid as it involves very few instructions and takes advantage of the efficiencies in the operating and file systems when handling read-only files. The two index files are very small in comparison with the n-gram data file.

Acknowledgements

Many people have contributed to the development of the Japanese N-gram tool. Among them are Slaven Bilak (then at Google Japan) who alerted the author to the release of the corpus, Tim Baldwin and the the people at the University of Melbourne who provided access to the LDC resources and supported the reorganization of the files, and the people in the JMdict team who made many useful suggestions and contributions.

Jim Breen
May 2024