The WWW in Japan: a threat to cultural identity,
or a domesticated system?

James Breen
Honorary Senior Research Fellow
School of Computer Science & Software Engineering
Monash University 3800
jwb@csse.monash.edu.au

Alison Tokita
Associate Professor
School of Languages, Cultures & Linguistics
Monash University 3800
Alison.Tokita@arts.monash.edu.au

Abstract

The dominance of English-speaking countries in research and development in communications and information technology has led to English being the default language for many technologies, and the limited 26-character alphabet often being the only character set available on all systems. This has often led to problems and extra challenges when these technologies are introduced into non-English speaking countries, and particularly in countries where non-alphabetic scripts are used. This has been evident in Japan, which has generally been an early adopter of such technologies, but which uses a writing system which was at first considered difficult or impossible to handle effectively with computers.

This paper reports on a study of the WWW in Japan which investigated the structure of the underlying Internet in Japan and the extent to which it differs from models employed elsewhere, the relative size of the WWW in Japan compared with other countries and languages in East Asia, the levels to which the WWW uses non-Japanese orthography, e.g. English or romanized Japanese and the applications of such text, and the nature of the use of the WWW in terms organizations, individuals, etc. and the levels of language employed in the WWW pages.

Introduction

Computer Character Coding - from monolingual to multilingual

With the initial development of computers taking place primarily in the USA and the UK in the 1940s and 1950s, it is no surprise that the text-handling capabilities of early systems were confined to support of the English language, and in particular to the limited 26-character alphabet. This was also convenient at a time when computer storage was expensive and small. The very early computer systems used 5-bit alphabetic and numeric coding systems derived from teleprinter systems[1], and it was not until the early 1970s, some 30 years after the development of the first computers, that typical computer systems could handle both upper and lower case alphabetics, as well as numeric characters and a reasonable selection of punctuation characters.

The dominance of English-speaking countries in research and development in communications and information technology also led to English being the default language for many associated technologies, a situation that still persists. For example, most computer languages use English constructs, commands, library facilities, etc.

The coding systems which arrived in the early 1970s were mostly associated with a 7-bit code established by the US standards agency (now ANSI) and generally known by the ASCII (American Standard Code for Information Interchange) acronym. This code provided 128 basic characters, and through an extension to 8 bits (by that period most computers could handle 8-bit characters), a further 128 characters could be added. The approach that was followed was that many countries adopted the basic ASCII set with an extension set suitable to the local writing system as a national standard. The extension took the form of a complete alphabet in the case of languages such as Russian or Greek, or a set of combined alphabetic characters and diacritic marks in the cases of many other European languages. The Japanese equivalent of ASCII (now JIS X 0201[2]) included a modified form of the katakana syllabary, thus allowing Japanese text to be recorded and handled, albeit not in its usual form. Eventually many of these code systems were brought together by the International Standards Organization as a consistent code family (ISO 8859).

The codes based on 8-bit characters, with their limit of 256 unique characters, were clearly inadequate for languages such as Chinese, Japanese and Korean which use large numbers of either Chinese characters (hanzi, kanji) or Hangul characters. During the 1970s sets of "double-byte character sets" (DBCS) were developed by the computer companies and standards bodies. By using pairs of bytes to represent each character, a total of over 65,000 characters could be defined. The Japanese DBCS code-set (now JIS X 0208[3]) defines 6,355 kanji, both the hiragana and katakana syllabaries, as well as several alphabets and many punctuation characters. Similar standards were developed in Taiwan (Big5), the PRC (GB2312) and Korea (KS X 1002).

The coding systems described above were largely confined to the countries which developed them, and used exclusively in domestic computer systems. It was not until the 1980s that attention began to be given to the issues associated with international movement of computerized text and the combination of several languages in the one document. One approach that was taken for alphabetic codes was to develop techniques for combining existing national codes in the one document, leading to the current ISO 2022 code encapsulation standard.

For the large sets of character codes used in Chinese, Japanese and Korean, the approach has been to attempt to establish a single code incorporating all the characters used in those languages. An early attempt was the Chinese Character Code for Information Interchange (CCCII) which, with a similar system proposed by Japan's National Diet Library, led to the East Asian Character Code (EACC) for bibliographic use, standardized as ANSI Z39.64-1989. This code was recognized as too limited for general use, and in the late 1980s work began on an expanded code-set which was eventually developed into the "unified CJK ideographs" first incorporated in the Unicode/ISO 10646 standard in 1993. Initially some 21,000 hanzi/kanji were included, and this has since been extended to nearly 70,000. Similarly approximately 4,000 hangul were initially included, which has been extended to over 10,000. (A thorough treatment of code sets for East Asian languages is available in Lunde[4].)

Character Codes and Networks

Although networks of computers began to be developed in the late 1950s, and by the early-mid 1970s were receiving considerable attention in developed countries, they were largely associated with military, government and industrial applications, and little or no attention was paid to issues associated with text coding. The first major initiative aimed at an integrated world-wide computer network was the ISO's Open Systems Interworking (OSI) standards project, which began formally in 1977. This major project, which showed considerable initial potential, but which eventually faded out of existence as the Internet began to dominate in the early 1990s, paid very little attention to language coding issues, with many of its major elements only being able to handle European languages.

The Internet, with its overall architecture of a set of autonomous computer networks interconnected by some very basic communications protocols, was originally even more limited. Developed initially by research institutions in the US, much of the Internet's underlying command, message and address structure was, and still is, limited to use of the (English) alphabet plus numerals and some punctuation characters. While many elements of this structure are hidden from most users, one which is highly visible is the "domain address", which makes up part of email addresses, WWW addresses, etc. These addresses are associated with numerical network addresses used by the Internet routers, and the mapping between the domain addresses and the network addresses is carried out by a distributed application, the Domain Name System (DNS).

As the Internet began to be deployed throughout the world in the early 1990s, some users and networking organizations complained about the restriction of addresses to such a limited character set. After initial exploration of the issues by Dillon[5] and Duerst[6] in 1996, an experimental implementation of an expanded character set for domain names was carried out in the Center for Internet Research at the National University of Singapore[7]. The success of this experiment led to the Internet's controlling standards body, the Internet Engineering Task Force (IETF) formally establishing an International Domain Name (IDN) task force in January 2000. In March 2003 an Internet standard: Internationalizing Domain Names in Applications (IDNA)[8] was approved. The IDNA approach is to specify a standard conversion algorithm, referred to as the ASCII Compatible Encoding (ACE), between an international domain name written using Unicode/ISO 10646 and an internal format which only uses the limited subset of ASCII. Thus the large number of existing DNS implementations can continue to operate without change, and only software such as WWW browsers and email clients need to be modified or enhanced if users wish to use such domain names.

The Internet Address Structure in Japan

The domain addresses of the Internet are structured as a hierarchy in order to facilitate distributed management of the addresses. At the top of the hierarchy are a set of "generic" domains (com, edu, org, net, etc.) and a much larger set of "country code" domains (au, jp, nz, ca, uk, etc.). These codes are based on the ISO country-name abbreviation codes. In many countries the top-level domain is divided into a set of "second-level" domains, again for purposes of management, commercialization, etc., although in some countries, e.g. France and Germany, there are no further levels in the hierarchy. Administration of the domains within each country is carried out by an organization approved by the Internet Corporation for Assigned Names and Numbers (ICANN).

In Japan the domain names are controlled by the Japan Registry Service (JPRS), successor to the Japan Network Information Center (JPNIC), which carried out this role until mid-2003. From the inception of the Internet in Japan the .jp domain has been divided into second-level domains according to organization type, using a two-letter code (co, ac, ne, etc.). In 2001, JPNIC/JPRS introduced "General Use" domain names, which do not use a second-level code, and which can be written using either ASCII or Japanese (kanji and kana) according to the IDNA standard.

The current major domain-name types and the numbers of registered names are as set out below.

Domain Type Code Number Example
Higher Educational Institutions .ac.jp 3,020 u-tokyo.ac.jp
Other Educational Institutions .ed.jp 4,293 kaminokawa-h.ed.jp
Companies .co.jp 246,664 fujitsu.co.jp
Government .go.jp 812 kantei.go.jp
Organizations .or.jp 17,932 keidanren.or.jp
Network Service Providers .ne.jp 17,482 gol.ne.jp
General Use (ASCII) .jp 192,147 densha.jp
General Use (Japanese) .jp 45,588 千代田不動産.jp

Figure 1. Internet Domain Structure in Japan.

The use of a hybrid organization/general-use domain-name structure is unique to Japan, and is a departure from the usual practices in domain-name administration.

The introduction of domain names in Japan in kanji and kana was one of the first applications of internationalized domain names. That Japan was an early adopter of such names is not surprising given the role that Japanese organizations played in the establishment of the IDN system[9]. In fact by introducing such names prior to the formal establishment of the standards for the coding of non-ASCII names, JPNIC/JPRS had to take the risk of anticipating the actual form the coding would take. Initially the "Row-based ASCII Compatible Encoding (RACE)" method was adopted, however as this was not selected as the standard, the coding had to be migrated to the standard form in mid-2003. Another problem with the early use of IDN was the lack of software to enable users to access such addresses. JPNIC/JPRS obtained from Verisign, a US software and domain registration company, a "plugin" for the Microsoft Internet Explorer browser which can use the ACE/RACE domain names, and made this freely available to users.

Although domain names in Japanese now make up about 8% of registered names in Japan, the actual usage of these names appears to be quite limited. From inspection of the domain-name databases it appears that many names are not yet associated with numerical network addresses, and hence are not in use. This confirms comment in the IT industry that many companies were registering their names for possible future use. Very few company advertisements have been detected in the print media or on billboards with Japanese domain names, and in several cases it has been noted that companies which did advertise such names have reverted to the ASCII forms. It is clearly too early to predict the success or otherwise of Japanese domain names, as there are still systemic impediments, in particular the very limited quantity of software which supports the special coding of such names.

In the longer term, it will be important to follow the progress of the application of Japanese domain names, as it will be an indicator of the usefulness of IDN in a global sense. While the desire to have names available in a locally and culturally relevant script is understandable, its widespread adoption would inevitably result in the Internet developing into a set of semi-enclosed communities. A name such as 千代田.JP (CHIYODA.JP) would be difficult for a non-Japanese user to enter or even display, while the ACE form of this name (XN--MNQ89HQW2B.JP) does not have any useful mnemonic characteristics, and would be difficult to use in that form without errors.

The WWW in Japan

In considering issues associated with WWW usage, it is necessary to establish a technique for estimating the numbers of WWW pages available in various categories. A useful metric is the number of pages being reached by the major "search engines", which periodically examine the WWW, accumulate copies of pages, and carry out a degree of analysis of their contents. While the search engines cannot cover the entire WWW, as many pages are blocked by passwords, only generated after a user dialogue, etc. they do provide an indication of the size of the WWW, and a technique for categorizing text within the pages. At present, the "Google.com" search engine has the largest number of WWW pages (approximately 3.3 billion) under examination, and since it stores accumulated pages in Unicode coding and allows searching for pages containing text in any language, it is appropriate to use it as the primary measurement tool.

As expected, English clearly dominates in WWW pages. Examining the counts of pages containing typical English words such as "and", "if", "the", etc. indicates that approximately 2.4 billion pages contain some English material. As a contrast, an examination of a set of typical French words ("la", "le", "à", etc.) indicates that approximately 140 million pages contain some French. A similar examination using short kana sequences, which are (fortunately) unique to Japanese and an essential component of the written language, indicates 168 million pages contain が, 200 million pages contain を, 188 million pages contain は and 66 million pages contain です. From this it is reasonable to estimate that in the order of 200 million pages contain Japanese, which is about 1.6 pages per Japanese speaker, compared to about 2 pages per French speaker.

A similar comparison can be made between Japanese and Chinese using compound kanji/hanzi words which are common to and occur frequently in both languages. Testing 社会 (社會 in traditional Chinese characters): (society, public), 最近 (latest, nowadays), and 世界 (the world, society, universe) via the Google search engine using filters on domains (jp, cn (PRC), tw (Taiwan)) and languages (Japanese, Chinese-simplified characters, Chinese-traditional characters) we obtain the following:

  Word 社会/社會 最近 世界
Domain cn 1.9 0.67 1.69
tw 0.68 0.38 0.90
jp 4.04 3.69 3.37
Language Japanese 6.18 6.08 3.88
Chinese
(simplified)
3.76 2.51 2.50
Chinese
(traditional)
1.24 0.85 1.54

Figure 2. Word Frequencies of Chinese & Japanese
(millions of pages containing the specified word).

Despite difficulties in making such cross-language comparisons, this data is consistent with a conclusion that Japanese text is present in WWW pages approximately 50% more often than Chinese, and along with the estimate of French pages above supports conclusions reached by market research companies, an example of which follows:

wwwstats.gif

Figure 3. Estimate of WWW content by language.
Source: http://global-reach.biz/globstats/refs.php3

It is worth noting that not all WWW pages containing Japanese text are confined to the .jp domain of addresses. Many large Japanese companies, such as NEC, Sony and Fujitsu, choose to use addresses from the generic .com domain, while some Japanese organizations use addresses from the .org domain. This is the reason for the difference in page numbers between the Japanese pages and the jp domain pages which can be seen in Figure 2. It is also worth noting that Japanese pages are not often accessed by non-Japanese, the main exceptions being language learners and scholars of Japanese studies[10].

Compared to many of the application systems which operate over the Internet, such as email, the WWW was developed at a time when the issues associated with the coding of different languages were well recognized, and hence structures were in place from the beginning to enable languages such as Japanese to be readily handled in WWW pages in a standard fashion. Also, although the take-up of personal computers in Japan had initially been slower than in other developed countries, partly due to the difficulties associated with the input of Japanese text[11], by the time of the introduction of the WWW in the early 1990s these problems had largely been overcome, and computer usage was rising rapidly. Thus there seemed to be no barrier to the use of the Japanese script in WWW pages. It is appropriate to test this hypothesis by determining if there is significant use of English or of romanized Japanese in WWW pages originating in Japan and for domestic use.

Examination of WWW pages in the .jp domain does indicate a very low level of use of romanized Japanese. Common words such as kuruma and arimasu appeared in 12,300 and 1,200 pages respectively, which less than 0.1% of the occurrences of the words in Japanese script (車 and あります). Inspection of a selection of pages which used significant amounts of romanized Japanese indicated that it was almost entirely confined to material aimed at tourists and language learners at an elementary level.

Similarly, inspection of WWW pages in the .jp domain which contain English text does not provided any evidence that English is used for primary communication. Many WWW pages in Japan do contain some English, however it is clear that this material falls into two categories:

  1. use of English in product names, catch-phrases, slogans, etc. as is common in modern Japanese society. In these cases the English text is generally embedded within Japanese material;
  2. provision of English material aimed at English speakers. Many organizations provide an "English version" linked from their WWW pages, and containing a summary of the material in the main (Japanese) pages. In none of the pages surveyed did this take the place of pages in Japanese.

Japanese Text in WWW Pages

The Japanese language is rich and varied with many styles and levels of register which are reflected in such things as vocabulary, honorific forms and conjugations. It is instructive to examine the language being used in Japanese WWW pages to identify and quantify the styles, levels, etc. in use.

One method of carrying out such an examination would be to collect a large sample of pages, and examine the text, either manually or automatically, with a view to classification. Such a task would be complex, and would require significant resources to gather a large enough collection of pages to ensure the sample was random. As the only way of identifying a WWW page is to follow a link from another page, the inherent tree structure in WWW pages is a significant impediment to randomized selection of pages.

The method used in the present study is to draw on the collections of pages assembled by search engines, in this case Google, and examine the frequency of occurrence of selected words and phrases which can be associated with particular language styles and levels. For the purposes of this study, the key data is the number of times a word or phrase occurs; not the ranking of the pages. In carrying out this examination, only the pages in the .jp domain have been examined, and in addition pages have been limited to those identified by Google to contain Japanese text as it is necessary to avoid counting pages containing Chinese text.

Using common Japanese words such as が, を, etc. it is possible to establish the approximate distribution of WWW pages across the second-level domains in Japan. These are as set out in the following table.

Domain ad ac co go or ne gr ed lg geo
Percentage
of pages
0.7 13.4 40.7 6.2 13.7 21.9 2.5 0.7 - -

Figure 4. WWW Page distribution by domain

It is possible to determine the percentages of pages containing selected words. For example, for the words 経済 (keizai: economics; business; finance; economy), 政府 (seifu: government; administration) and 写真 (shashin: photograph), we see the following pattern of occurrences. (The figures are the percentages of pages in the domain containing the word.)

Domain ac co go or ne
経済 13 14 28 16 11
政府 7 10 25 13 8
写真 26 50 13 34 42

Figure 5a. WWW Page by selected word

The levels of occurrence of these words across the domains are consistent with what one would expect: that government pages are more likely to be dealing with administrative matters, and that private and company pages are more likely to have or be discussing photographs.

Turning to some examples of relatively formal Japanese, we examined ございます (gozaimasu: a polite form of the copula), いただきます (itadakimasu: a polite form of the humble verb to receive), and である (dearu: a plain form of the copula). The first two usually only occur in spoken language and reported speech, and the latter usually only in written language).

Domain ac co go or ne
ございます 1 10 2 3 5
いただきます 2 7 4 5 4
である 39 38 49 41 36

Figure 5b. WWW Page by selected word

It is not surprising that the spoken forms occur relatively rarely in WWW pages. This is even more evident when colloquialisms are examined, for example すごく (sugoku: awfully, extremely) and どうかな (doukanaa: I wonder...).

Domain ac co go or ne
すごく 3 6 1 6 11
どうかな 0 1 0 1 1

Figure 5c. WWW Page by selected word

That these words occur at all in academic or government pages is surprising, however on inspection it emerged that they were mostly due to reported speech in newsletters and similar publications.

Conclusion

This study has examined the adoption of the Internet and the WWW in Japan from the position of the impact of the underlying domination of the English alphabet, the adaptations to use Japanese scripts, and the usage of Japanese language in WWW material. It has also briefly examined the usage of the WWW in terms of numbers of pages compared with other languages.

The conclusions reached are that: (a) the English alphabet still dominates in structural aspects of the Internet and WWW, such as in addresses, and although there has been an attempt to address this in Japan, it has met with limited success to date; (b) the WWW has been fully developed in Japan in the local language and script in a manner which appears to be consistent with other communications media. There is no evidence of English or romanized Japanese playing a significant role in the use of the WWW by Japanese people; (c) the level of penetration of the WWW in Japan is high and is comparable with major European languages, although on a per capita basis is well behind English. The penetration is still significantly higher than is the case with Chinese both on absolute and per capita bases.

References

  1. International Telecommunications Union: CCITT International Telegraph Alphabet No. 2 (ITA2), ITU, Geneva, 1930.

  2. Japanese Industrial Standards Committee, JIS X 0201-1997 7-bit and 8-bit Coded Character Sets for Information Interchange, Japanese Standards Association, 1997 (originally designated JIS C 6220-1976)

  3. Japanese Industrial Standards Committee, JIS X 0208-1997 7-bit and 8-bit Coded Kanji Sets for Information Interchange, Japanese Standards Association, 1997.

  4. Lunde, Ken: CJKV Information Processing, O'Reilly & Associates Inc., 1999

  5. Dillon, Michael: Multilingual Domain Names, Internet Draft (http://www.gtld-mou.org/gtld-discuss/mail-archive/01067.html) December 1996.

  6. Duerst, Martin: Internationalization of Domain Names Internet Draft (http://www.w3.org/International/1996/draft-duerst-dns-i18n-00.txt) December 1996.

  7. Centre for Internet Research: iDNS - Internationalized Domain Name System, (http://www.apng.org/idns/) National University of Singapore, 1998.

  8. Faltstrom, P; Hoffman, P and Costello, A: Internationalizing Domain Names in Applications (IDNA), RFC3490, The Internet Society, March 2003.

  9. Hotta, Hirofumi: Multilingual Domain Names - ITU Briefing Paper, (http://www.itu.int/mlds/briefingpaper/itu/), Joint ITU/WIPO Symposium, Geneva, 2001.

  10. Gottlieb, Nanette: Globalisation and Language: Japanese Online, International Symposium on Globalisation, Localization and Japanese Studies in the Asia-Pacific Region, University of Sydney, November 2003.

  11. Unger. J. Marshall: The Fifth Generation Fallacy, New York: Oxford University Press, 1987