WWW Search Engines and Japanese Text

James Breen Clayton School of Information Technology, Monash University email: Jim.Breen@infotech.monash.edu.au

(Paper presented at the Sixth Symposium on Natural Language Processing 2005 (SNLP 2005), Chiang Rai, Thailand, December 2005)

Abstract

This paper reports on an investigation of the handling of Japanese text by two major commercial WWW search engines: Google and Yahoo. While generally satisfactory, a number of problems were identified with both the engines, as well as some differences between them.

1. Introduction

For many people using the WWW for linguistic purposes, such as translators and lexicographers, the major access mechanisms have been the commercial search engines, which regularly collect and index pages of text, a process which may also include language-specific activities such as conversion of inflections, removal of stop words, etc.

The use of search engines with WWW pages in Japanese, currently the second most common language on the WWW, raises some particular challenges. As Japanese is written without any inter-word markers such as spaces, a parser or segmenter must be used on the page text prior to indexing, and on occasions it may be appropriate to parse the search strings as well and conduct the search on separate component words. The parsing strategies in both cases may be critical to the success of the search engine.

Another challenge is the accurate discrimination between Japanese and Chinese as they use related character sets.

It must be recognized that the major commercial search engines have not been constructed to aid linguistic analysis of WWW pages, as their raison d'être is to present ranking "relevant" pages to the user, often accompanied by advertisements. As the user can only view at most the first 1,000 pages identified, and has no control over the order of presentation, traditional criteria such as precision and recall cannot be applied.

In this paper we report on a study of the behaviour of two major search engines: Google and Yahoo/AltaVista, when dealing with Japanese text. Relatively little is publically available about the language-specific NLP processing within these search engines, although it is known that both Google and Yahoo use Basis Technology Corp's Rosette Language Analyzer software (Basis,2005). The main metric used is the report by the respective Search Engine of the number of pages it has indexed with the requested search key(s). As a measure this is known to be both crude and unreliable (Véronis, 2005), however it can be taken as a broad indication of the relative outcomes of the indexing and searching.

2. Language discrimination

Determining the language of a WWW page can be difficult, particularly when dealing with pages using the Latin alphabet, as few pages use the HTML language indicator. With Japanese, where the text typically contains characters from the hiragana and katakana syllabaries which are unique to that language, most identification should be relatively straightforward. In fact a user can ensure only Japanese pages are encountered by adding a hiragana character such as の (no), which is highly likely to appear in Japanese text. We noted that for all the search engines, the Japanese language restriction option generally had the same outcome as adding a の, however for Yahoo the restriction was only effective when using the yahoo.co.jp engine. That site and the google.co.jp site have been used in this study.

The language discrimination was tested using some short kanji compounds: 社会 (society) and 世界 (the world) which are also used in Chinese, and some longer compounds unique to Japanese: 未収入金 (accounts receivable) and 転換社債 (convertible bond), and evaluating the reported pages for several language specifications: None, Japanese and Chinese (both traditional and simplified hanzi.)

	Google				Yahoo
Search Query	None	ja	zh (s)	zh (t)	None	ja	zh (s)	zh (t)
社会	52.9M	27.4M	45.8M	6,370	481M	166M	315M	83,300
世界	117M	39M	53.3M	11.5M	762M	252M	465M	45.6M
未収入金	224k	224k	20,600	20,600	240k	240k	5	1
転換社債	572k	570k	1	2	803k	805k	3	1

Table 1. Language discrimination

Examination of samples of pages confirmed that these identifications were correct, except in the cases of 未収入金 and 転換社債 where all the sampled pages identified as Chinese were actually in Japanese. It is difficult to conclude whether this is a significant problem.

3. Parsing Issues

3.1 Search String Parsing

It is usual for search engines to provide options for "All Words" and "Exact Phrase" searches. The operation of these options were tested with a number of compound words, both complete and with the components separated. The results in Table 2 show the outcome for the 未収入金 and 転換社債 compounds used above, and also for the (loanword) katakana phrase コンクリートブロック (konkrurîtoburokku: concrete block).

	All Words		Exact Phrase
Search Query	Google	Yahoo	Google	Yahoo
未収入金	224k	200k	185k	200k
未収+入金	224k	239k	N/A	N/A
転換社債	562k	760k	439k	760k
転換+社債	562k	760k	N/A	N/A
コンクリートブロック	165k	356k	165k	356k
コンクリート+ブロック	689k	1.23M	N/A	N/A

Table 2. Search string parsing

It is apparent from this that in the All Words option, Google is parsing long kanji compounds into their components and searching for pages containing both, whereas Yahoo treats the compounds as single words. This conclusion is supported when one examines the text samples returned by Google. The target word(s) are highlighted and in the case of the text: "場合は未収入金へ" the markup is: "場合は未収入金へ". As this markup is used in the results of both types of search, one can conclude that the components are indexed separately and an Exact Phrase search looks for adjacent occurrences.

In the case of the phrase コンクリートブロック, examination of the markup confirms that it too has been parsed into コンクリート (concrete) and ブロック (block) for indexing, however no parsing is done on the search string. A search for a different fragmentation, e.g. コンクリー and トブロック only returns a small number of results, which are usually caused by line-breaks in the source pages.

It is also observed that the parsing of search strings in the All Words option seems to be restricted to extended compounds. When given an English sentence such as "Half of the melon was eaten", most search engines will remove the stop-words (of, the, was) and search for pages with (half, melon, eaten). An equivalent Japanese sentence, e.g. "メロンが半分食べられた" is always handled as a single string. Presumably this is either a tactical decision on the part of the search engine companies, or a need to limit the complexity of parsing search strings.

3.2 Interaction of Page and Search-key Parsing

Many kanji words are formed by an affixation process. For example the noun 可能性 (kanôsei: potentiality) is formed from 可能 (potential) and 性 (nature, gender), and 不動産 (fudôsan: real estate) is from 不動 (immobility) and 産 (products). Japanese morphological analyzers, such as Chasen (Matsumoto,2005) will usually segment such words. Search engines, however, usually index them in their entirety, with the result that searches based on their components will not detect them.

Search Query	Google	Yahoo
不動産	6.37M	76.6M
不動+産	653k	882k
不動+産 not 不動産	335k	604k
可能性	17M	112M
可能+性	9.95M	164M
可能+性 not 可能性	6.93M	53.2M

Table 3. Interaction of index and search parsing

The results from these two compounds can be interpreted as follows:

for 不動産, the whole compound has usually been indexed as a single word. Searching for the components 不動 and 産 returns a much smaller set of pages, a proportion of which also contain 不動産. (Interestingly in this case Google will not match with the common word 不動産屋 (realtor) because that is indexed as 不動 and 産屋 (maternity room));
for 可能性, Google has behaved the same as with 不動産. Yahoo, however, when searching for the components 可能 and 性, appears to have returned the number of pages containing 可能 and 性, and also those containing 可能性. This is confirmed by inspection of the text in the indicated pages. In fact the returned page count suggest that for 可能+性, the count might consist of the sum of the pages with 可能+性 and the pages with 可能性, even though these sets overlap.

A further interaction between the parsing of page text and search string can be seen in the handling of extended inflections of verbs and adjectives. Table 4 shows to results for 暖かくなかったり (while (it) was not warm) and 食べられなかった (not eaten) for progressively reduced strings.

Search Query	Google	Yahoo	Remarks
暖かくなかったり	58	701k	Yahoo has matched on 暖かく + なかったり
暖かくなかった	1,080	3,030
暖かくなかっ	16	1.44M	Yahoo has matched on 暖かく + なかっ
暖かくなか	51	2.16M	Yahoo has matched on 暖かく + なか; Google has also matched on 暖かくなかっ
暖かくな	23k	15,800	Many matches are on 暖かくなる, etc.
暖かく	4.52M	11.5M	Usual adverb form
食べられなかった	339k	905k
食べられなかっ	332	206k	Yahoo has matched on 食べられな + かっ
食べられなか	547	326k	Yahoo has matched on 食べられ + なか
食べられな	547	1.35M	Yahoo has often matched on 食べられない, 食べられ + な, etc.
食べられ	1.41M	1.42M	Google matches include: 食べられない, 食べられなかった, etc. Yahoo has matched on 食べられます, 食べられて, etc.
食べら	20.3k	71.5k

Table 4. Inflections

From this it appears that both search engines are attempting to produce correct behaviour for the valid forms: 暖かく, 食べられなかった, etc. For the fragmentary forms, Google reports where they occur, which from inspection are largely cases of line breaks, abbreviations, etc. whereas Yahoo has often parsed the search string and returned large numbers of relatively irrelevant matches.

3.3 Parsing of Uncommon Words

As the parsers used by search engines are most likely based on lexicons, it is instructive to examine their behaviour with words not in their lexicons. This is illustrated by the handling of two rare kanji compounds 印電 and 最限 (discussed in (Breen,2004).)

	All Words		Exact Phrase
Search Query	Google	Yahoo	Google	Yahoo
印電	2.31M	80	1,590	78
最限	1.53M	315	518	315

Table 5. Uncommon word handling

The large number of pages indicated by Google is a result of the search keys being parsed into their constituent kanji, and the return of pages where those kanji are both singly indexed. It is only when "Exact Phrase" is selected that the results become meaningful.

4. Other Issues

4.1 Orthographical Variants

Japanese allows for considerable flexibility in orthography. For example sashimi can be written 刺身 or 刺し身. Little attempt is made in search engines to index canonical forms of these words, with the result that different pages are found according to the form used. One exception is an attempt by Google to regularize pairs of variant forms of katakana loanwords (Google,2005). Only a few such pairs, such as ダイアモンド/ダイヤモンド (diamond) and コンピュータ/コンピューター (computer) appear to be handled at present.

4.2 Non-text Characters

In addition to kanji, hiragana, katakana, alphabetics and numerics, Japanese text may contain other symbols which are regarded as part of the text. They include the kana repetition symbols (ヽ,ヾ,ゝ,ゞ), the kanji repetition symbol (々) and the kanji "zero" (〇). In the search engines these are usually treated as text characters, enabling, for example, the name of the poet 金子みすゞ (KANEKO Misuzu) to be searched. A surprising omission is Google's handling of the 々 symbol, which it ignores. This prevents indexing and searching of very common words such as 時々 (tokidoki: sometimes).

4.3 Punctuation

Both search engines match search strings across punctuation and other whitespace characters. For example, when searching for 印電 as an Exact Phrase, matches were returned for "雷神の印(電)太陽" and "空印、電爪、撃牙も". This is presumably because such punctuation characters are ignored during indexing. In any case it is unfortunate as it leads to numbers of incorrect matches.

On the other hand, neither search engine indexes across the HTML line-break ( ), which unlike in English can be validly inserted mid-word in Japanese.

4.4 Differences in Page Counts

It will be noticed that the two search engines used in this study produce quite different numbers of matched pages, with the differences ranging from quite small to an order of magnitude. Analysis of these differences is not the topic of this study. It is likely that both sets of counts of matched pages are only approximations.

4.5 Wildcards and Stemming

Some search engines allow for some form of wildcards in searches, for example Google has the ability to use an "*" to indicate words to be ignored in a search for a phrase. Also Google now allows for a degree of stemming of words in a search key. Neither of these features was detected in either search engine for Japanese search keys.

5. Conclusion

The following specific conclusions have been drawn from this investigation:

the language descrimination appears to be quite effective for Japanese, although it is of concern that with Yahoo the function is site-specific;
the parsing of page text, which is the key to the operation of search engines with Japanese text, appears to be quite effective;
the segmentation of search strings, which occurs with Google for some long kanji compounds and with Yahoo for long inflections, can produce confusing and erroneous results, with the result that many users have commented they always use the Exact Phrase option. The occasional segmentation of search strings may be of little real benefit in Japanese;
there is considerable room for improvement in the area of searching for words with multiple orthographical variants, at least as an option;
the failure by Google to treat the 々 character as part of text is a flaw that should be remedied;
the practice of indexing Japanese text across punctuation, parentheses, etc. has no valid reason, and leads to erroneous results.

In general both search engines are quite effective in their handling of Japanese, although there are clearly some problems and scope for improvement. Some of the shortcomings are possibly due to the systems being originally developed for languages like English. Users need to be aware of the nature of the parsing and indexing in order to make full use of the engines.

6. Caveat

The results reported in this paper were correct at the time of writing. Search engine software is being continually modified, and there is some evidence of search engine behaviour being modified as a result of critical comments in reports (Véronis,2005). Accordingly the future behaviour of search engines may not be as reported here.

(March 2006 update. Google now handles the 々 character correctly.)

7.Acknowledgement

The assistance of Mr Paul Blay in the preparation of this paper is gratefully acknowledged.

References

Basis Technology Corp. 2005. Customers - Google, http://www.basistech.com/customers/

James Breen. 2005. Expanding the Lexicon: the Search for Abbreviations, Papillon (Multi-lingual Dictionary) Project Workshop, Grenoble, August-September 2004.

Google. 2005. Katakana no hyoukiyure, http://www.google.co.jp/intl/ja/help/basics.html

Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. 2002. Morphological Analysis System ChaSen version 2.2.9 Manual. Nara Institute of Science and Technology. http://chasen.aist-nara.ac.jp/hiki/ChaSen/

Jean Véronis. 2005. Google's missing pages: mystery solved? http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html