This paper describes a WWW-based tool (https://www.edrdg.org/~jwb/ngramcounts.html) for deriving the frequency counts of sequences of Japanese text from the Google Japanese N-Gram Corpus. The paper is in several sections:
There is also an appendix in which the author describes the processes which led to the development of the tool.
Compilation of the Google N-Gram Corpus
The corpus was compiled within Google Japan in 2007 in a "20%" or "side project" carried out by Taku Kudo and Hideto Kazawa, using a crawl of publicly accessible web pages made in July of that year. The process of the n-gram count generation was, in summary:
A total of over 20 billion sentences were analyzed, with n-gram counts ranging from about 2.5 million unique 1-grams up to 776 million unique 5-grams.
Contents, Format and Distribution of the Corpus;
Each set of n-grams was sorted and divided into files of 10 million items, totalling over 300 files. The compressed files amount to over 26Gb of data. The collection of files and documentation was initially distributed as 6 DVDs. In the distribution the files of n-grams and counts were placed in a directory structure as shown in the example below for the 2-grams.
data/2gms/ data/2gms/2gm.idx data/2gms/2gm-0000.gz data/2gms/2gm-0001.gz data/2gms/2gm-0002.gz data/2gms/2gm-0003.gz data/2gms/2gm-0004.gz data/2gms/2gm-0005.gz data/2gms/2gm-0006.gz data/2gms/2gm-0007.gz data/2gms/2gm-0008.gz
The "2gm.idx" file is a text file listing the initial n-grams in the individual compressed data files.
2gm-0000.gz ! </S> 2gm-0001.gz ☆ 青のり 2gm-0002.gz たくさん 臨時 2gm-0003.gz も 追い返せ 2gm-0004.gz デバイス CD 2gm-0005.gz 伯 朗 2gm-0006.gz 引き続き 指摘 2gm-0007.gz 発 叩き込も 2gm-0008.gz 高かろ ー
The following is a sample of 3-grams from the 3gm-0027 file:
天白 保呂 町 42 天白 信仰 の 21 天白 側 つや消し 30 天白 児童 館 251
The n-gram corpus was first announced on the Google Japan blog site. The formal title of the corpus is "Japanese Web N-gram Version 1".
The initial distribution was made via the Language Resources Association (言語資源協会 - GSK) in Japan. with the DVD sets priced from ¥22-88,000. In 2009 the corpus was included in the catalogue of the Linguistic Data Consortium at the University of Pennsylvania. It is only available via download and is free to members and costs $US150 for non-members. The site also includes the documentation of the corpus.
Development of a WWW-based Tool
The Japanese n-gram corpus is an invaluable resource for determining the relative frequencies of lexical items in Japanese. It is especially useful for examining the frequencies of text sequences, for such purposes as testing if nouns are being used in conjunction with the verb "する", whether verbs are are used transitively, etc. etc.
As distributed, the corpus is relatively difficult to use for finding the frequencies of individual terms. This is due in part to:
While it is possible to use a tailored approach to extracting n-gram frequencies for particular investigations, it would be much better if there were a general and easy-to-use system which enabled the corpus to be examined for the presence of segments of Japanese text and matching counts returned.
Accordingly, an approach was developed which enables segments of Japanese text to be quickly matched against the corpus. This approach, which is described in detail in the Appendix, involved reorganizing the contents of the corpus and establishing an indexing system which enables the rapid location of specified text segments and the extraction of the counts. A software module was developed for locating and extracting corpus entries, and this module has been used both in web server programs and batch programs, with considerable success.
Operation of the WWW-based Tool
The WWW-based n-gram corpus tool is accessed by a single page at https://www.edrdg.org/~jwb/ngramcounts.html. Several tasks can be initiated from this page. All involve entering one or more strings of Japanese characters and as appropriate choosing one of the options.
This is the default operation with no options selected. It returns the counts for the terms entered. For example, entering "学校 學校 字体" results in the following page:
Google N-gram Corpus Counts 学校 48641365 學校 27255 字体 232177
For the entered term, this option initiates the searching for the most common 10 (or 100) n-gram sequences which begin with the characters provided. They are displayed in descending count order. It has sub-options of displaying the matches in a block order according to the characters after the search key, or in a tree order. As an example the most common 10 sequences starting with 学校 are shown below: (It also has the option of displaying the morpheme components of the displayed sequences, as determined by MeCab/Unidic.)
Top 10 N-grams Lookup for 学校 (Frequency Order) 学校 48633879 学校の 7319568 学校に 4398391 学校で 3872167 学校を 1881651 学校が 1507934 学校へ 1320734 学校は 1223917 学校・ 1098759 学校から 1072889 学校教育 989795
In this option, a search is made for the specified term combined with a selection of common particles and other affixes. In this example we see the (partial) results for 勉強.
勉強 31502952 勉強は 664738 勉強が 840674 勉強な 68089 勉強の 1308524 ... 勉強を 2339108 勉強する 1644637 勉強して 3341856 勉強しない 310272 ... の勉強 5018765 に勉強 1186926 を勉強 2021162 ...
In this option, the search is carried out for 25 common inflections of a verb. The following example is for 食べる.
食べる 19179416 食べます 1168041 食べない 1882110 食べぬ 1561 食べず 354272 食べません 281265 食べた 10032006 食べました 3397730 食べなかった 227567 .....
In this option, the first 100 n-grams and counts starting from the search string are returned. The following example shows the terms starting with 勉強.
勉強 2871 勉強 31500081 勉強々 115 勉強々々 53 勉強あ 191 勉強あい 198 勉強あいう 51 勉強あいうえ 141 勉強あいうえ 51 勉強あいうえおかき 46 勉強あいうえおかきくい 46 勉強あいうえおっ 141 勉強あいうえおって 141 勉強あいうえおってなあに 141 .....
Note that both 勉強 and 勉強あいうえ have returned two counts. This is most likely the result of incorrect segmentation when the n-grams were compiled. Most of the time 勉強 was identified as a 1-gram, but on occasions it seems to have been treated as a 2-gram.
Also note that the counts for 勉強あいうえおっ, 強あいうえおって and 勉強あいうえおってなあに are the same. This is indicating that in the original text collection 勉強あいうえおっ and 強あいうえおって only occurred as part of 勉強あいうえおってなあに, and hence were counted identically as 4-grams, 5-grams and 6-grams.
Conclusion
The WWW-based tool has proved to be very useful in several areas, especially in evaluating potential terms for inclusion in dictionaries. It has made the underlying data in the Google N-gram Corpus available to many people who would otherwise have been unable use it.
The main problem with the corpus and the search tool is that the data is from 2007 and does not capture neologisms and changes in word usage. Another problem arises from occasional segmentation issues. The IPADIC lexicon which was used for the task was known to have some flaws, for example the term 日本語, which is usually regarded as morpheme pair (日本+語), was treated as a single morpheme.
Appendix - Data Structure and Tool Development
As mentioned in the article, the structure of the data in the original corpus distribution does not cater very well for the process of looking up the frequency counts for single terms or sets of terms. Most users wishing to find the frequency of a term such as 公民権運動 (civil rights movement) are not really interested in the fact that it consists of 3 morphemes (公民+権+運動) and is hence classed as a 3-gram.
In order to make the data available for flexible identification of term frequencies, it was reorganized as follows:
Although it would be possible to create a database containing the text strings and counts, and use database retrieval commands to find selected strings, this would result in an extremely large database. Instead a simpler approach was devised which enables text strings to be retrieved rapidly. The approach is as follows:
While it may seem that using sequential record reading is inefficient, in fact the process is quite rapid as it involves very few instructions and takes advantage of the efficiencies in the operating and file systems when handling read-only files. The two index files are very small in comparison with the n-gram data file.
Acknowledgements
Many people have contributed to the development of the Japanese N-gram tool. Among them are Slaven Bilak (then at Google Japan) who alerted the author to the release of the corpus, Tim Baldwin and the the people at the University of Melbourne who provided access to the LDC resources and supported the reorganization of the files, and the people in the JMdict team who made many useful suggestions and contributions.
Jim Breen
May 2024