(Paper presented at the Sixth Symposium on Natural Language Processing 2005 (SNLP 2005), Chiang Rai, Thailand, December 2005)
For many people using the WWW for linguistic purposes, such as translators and lexicographers, the major access mechanisms have been the commercial search engines, which regularly collect and index pages of text, a process which may also include language-specific activities such as conversion of inflections, removal of stop words, etc.
The use of search engines with WWW pages in Japanese, currently the second most common language on the WWW, raises some particular challenges. As Japanese is written without any inter-word markers such as spaces, a parser or segmenter must be used on the page text prior to indexing, and on occasions it may be appropriate to parse the search strings as well and conduct the search on separate component words. The parsing strategies in both cases may be critical to the success of the search engine.
Another challenge is the accurate discrimination between Japanese and Chinese as they use related character sets.
It must be recognized that the major commercial search engines have not been constructed to aid linguistic analysis of WWW pages, as their raison d'être is to present ranking "relevant" pages to the user, often accompanied by advertisements. As the user can only view at most the first 1,000 pages identified, and has no control over the order of presentation, traditional criteria such as precision and recall cannot be applied.
In this paper we report on a study of the behaviour of two major search engines: Google and Yahoo/AltaVista, when dealing with Japanese text. Relatively little is publically available about the language-specific NLP processing within these search engines, although it is known that both Google and Yahoo use Basis Technology Corp's Rosette Language Analyzer software (Basis,2005). The main metric used is the report by the respective Search Engine of the number of pages it has indexed with the requested search key(s). As a measure this is known to be both crude and unreliable (Véronis, 2005), however it can be taken as a broad indication of the relative outcomes of the indexing and searching.
2. Language discrimination
Determining the language of a WWW page can be difficult, particularly when dealing with pages using the Latin alphabet, as few pages use the HTML language indicator. With Japanese, where the text typically contains characters from the hiragana and katakana syllabaries which are unique to that language, most identification should be relatively straightforward. In fact a user can ensure only Japanese pages are encountered by adding a hiragana character such as の (no), which is highly likely to appear in Japanese text. We noted that for all the search engines, the Japanese language restriction option generally had the same outcome as adding a の, however for Yahoo the restriction was only effective when using the yahoo.co.jp engine. That site and the google.co.jp site have been used in this study.
The language discrimination was tested using some short kanji compounds: 社会 (society) and 世界 (the world) which are also used in Chinese, and some longer compounds unique to Japanese: 未収入金 (accounts receivable) and 転換社債 (convertible bond), and evaluating the reported pages for several language specifications: None, Japanese and Chinese (both traditional and simplified hanzi.)
|Search Query||None||ja||zh (s)||zh (t)||None||ja||zh (s)||zh (t)|
Examination of samples of pages confirmed that these identifications were correct, except in the cases of 未収入金 and 転換社債 where all the sampled pages identified as Chinese were actually in Japanese. It is difficult to conclude whether this is a significant problem.
3. Parsing Issues
3.1 Search String Parsing
It is usual for search engines to provide options for "All Words" and "Exact Phrase" searches. The operation of these options were tested with a number of compound words, both complete and with the components separated. The results in Table 2 show the outcome for the 未収入金 and 転換社債 compounds used above, and also for the (loanword) katakana phrase コンクリートブロック (konkrurîtoburokku: concrete block).
|All Words||Exact Phrase|
It is apparent from this that in the All Words option, Google is parsing long kanji compounds into their components and searching for pages containing both, whereas Yahoo treats the compounds as single words. This conclusion is supported when one examines the text samples returned by Google. The target word(s) are highlighted and in the case of the text: "場合は未収入金へ" the markup is: "場合は<b>未収</b><b>入金</b>へ". As this markup is used in the results of both types of search, one can conclude that the components are indexed separately and an Exact Phrase search looks for adjacent occurrences.
In the case of the phrase コンクリートブロック, examination of the markup confirms that it too has been parsed into コンクリート (concrete) and ブロック (block) for indexing, however no parsing is done on the search string. A search for a different fragmentation, e.g. コンクリー and トブロック only returns a small number of results, which are usually caused by line-breaks in the source pages.
It is also observed that the parsing of search strings in the All Words option seems to be restricted to extended compounds. When given an English sentence such as "Half of the melon was eaten", most search engines will remove the stop-words (of, the, was) and search for pages with (half, melon, eaten). An equivalent Japanese sentence, e.g. "メロンが半分食べられた" is always handled as a single string. Presumably this is either a tactical decision on the part of the search engine companies, or a need to limit the complexity of parsing search strings.
3.2 Interaction of Page and Search-key Parsing
Many kanji words are formed by an affixation process. For example the noun 可能性 (kanôsei: potentiality) is formed from 可能 (potential) and 性 (nature, gender), and 不動産 (fudôsan: real estate) is from 不動 (immobility) and 産 (products). Japanese morphological analyzers, such as Chasen (Matsumoto,2005) will usually segment such words. Search engines, however, usually index them in their entirety, with the result that searches based on their components will not detect them.
The results from these two compounds can be interpreted as follows:
A further interaction between the parsing of page text and search string can be seen in the handling of extended inflections of verbs and adjectives. Table 4 shows to results for 暖かくなかったり (while (it) was not warm) and 食べられなかった (not eaten) for progressively reduced strings.
|暖かくなかったり||58||701k||Yahoo has matched on 暖かく + なかったり|
|暖かくなかっ||16||1.44M||Yahoo has matched on 暖かく + なかっ|
|暖かくなか||51||2.16M||Yahoo has matched on 暖かく + なか;
Google has also matched on 暖かくなかっ
|暖かくな||23k||15,800||Many matches are on 暖かくなる, etc.|
|暖かく||4.52M||11.5M||Usual adverb form|
|食べられなかっ||332||206k||Yahoo has matched on 食べられな + かっ|
|食べられなか||547||326k||Yahoo has matched on 食べられ + なか|
|食べられな||547||1.35M||Yahoo has often matched on 食べられない, 食べられ + な, etc.|
|食べられ||1.41M||1.42M||Google matches include: 食べられない, 食べられなかった, etc.
Yahoo has matched on 食べられます, 食べられて, etc.
From this it appears that both search engines are attempting to produce correct behaviour for the valid forms: 暖かく, 食べられなかった, etc. For the fragmentary forms, Google reports where they occur, which from inspection are largely cases of line breaks, abbreviations, etc. whereas Yahoo has often parsed the search string and returned large numbers of relatively irrelevant matches.
3.3 Parsing of Uncommon Words
As the parsers used by search engines are most likely based on lexicons, it is instructive to examine their behaviour with words not in their lexicons. This is illustrated by the handling of two rare kanji compounds 印電 and 最限 (discussed in (Breen,2004).)
|All Words||Exact Phrase|
The large number of pages indicated by Google is a result of the search keys being parsed into their constituent kanji, and the return of pages where those kanji are both singly indexed. It is only when "Exact Phrase" is selected that the results become meaningful.
4. Other Issues
4.1 Orthographical Variants
Japanese allows for considerable flexibility in orthography. For example sashimi can be written 刺身 or 刺し身. Little attempt is made in search engines to index canonical forms of these words, with the result that different pages are found according to the form used. One exception is an attempt by Google to regularize pairs of variant forms of katakana loanwords (Google,2005). Only a few such pairs, such as ダイアモンド/ダイヤモンド (diamond) and コンピュータ/コンピューター (computer) appear to be handled at present.
4.2 Non-text Characters
In addition to kanji, hiragana, katakana, alphabetics and numerics, Japanese text may contain other symbols which are regarded as part of the text. They include the kana repetition symbols (ヽ,ヾ,ゝ,ゞ), the kanji repetition symbol (々) and the kanji "zero" (〇). In the search engines these are usually treated as text characters, enabling, for example, the name of the poet 金子みすゞ (KANEKO Misuzu) to be searched. A surprising omission is Google's handling of the 々 symbol, which it ignores. This prevents indexing and searching of very common words such as 時々 (tokidoki: sometimes).
Both search engines match search strings across punctuation and other whitespace characters. For example, when searching for 印電 as an Exact Phrase, matches were returned for "雷神の印(電)太陽" and "空印、電爪、撃牙も". This is presumably because such punctuation characters are ignored during indexing. In any case it is unfortunate as it leads to numbers of incorrect matches.
On the other hand, neither search engine indexes across the HTML line-break (<br>), which unlike in English can be validly inserted mid-word in Japanese.
4.4 Differences in Page Counts
It will be noticed that the two search engines used in this study produce quite different numbers of matched pages, with the differences ranging from quite small to an order of magnitude. Analysis of these differences is not the topic of this study. It is likely that both sets of counts of matched pages are only approximations.
4.5 Wildcards and Stemming
Some search engines allow for some form of wildcards in searches, for example Google has the ability to use an "*" to indicate words to be ignored in a search for a phrase. Also Google now allows for a degree of stemming of words in a search key. Neither of these features was detected in either search engine for Japanese search keys.
The following specific conclusions have been drawn from this investigation:
The results reported in this paper were correct at the time of writing. Search engine software is being continually modified, and there is some evidence of search engine behaviour being modified as a result of critical comments in reports (Véronis,2005). Accordingly the future behaviour of search engines may not be as reported here.
(March 2006 update. Google now handles the 々 character correctly.)
The assistance of Mr Paul Blay in the preparation of this paper is gratefully acknowledged.
Basis Technology Corp. 2005. Customers - Google, http://www.basistech.com/customers/
James Breen. 2005. Expanding the Lexicon: the Search for Abbreviations, Papillon (Multi-lingual Dictionary) Project Workshop, Grenoble, August-September 2004.
Google. 2005. Katakana no hyoukiyure, http://www.google.co.jp/intl/ja/help/basics.html
Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara. 2002. Morphological Analysis System ChaSen version 2.2.9 Manual. Nara Institute of Science and Technology. http://chasen.aist-nara.ac.jp/hiki/ChaSen/
Jean Véronis. 2005. Google's missing pages: mystery solved? http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html