School of Languages, Cultures & Linguistics
Monash University 3800
The dominance of English-speaking countries in research and development in communications and information technology has led to English being the default language for many technologies, and the limited 26-character alphabet often being the only character set available on all systems. This has often led to problems and extra challenges when these technologies are introduced into non-English speaking countries, and particularly in countries where non-alphabetic scripts are used. This has been evident in Japan, which has generally been an early adopter of such technologies, but which uses a writing system which was at first considered difficult or impossible to handle effectively with computers.
This paper reports on a study of the WWW in Japan which investigated the structure of the underlying Internet in Japan and the extent to which it differs from models employed elsewhere, the relative size of the WWW in Japan compared with other countries and languages in East Asia, the levels to which the WWW uses non-Japanese orthography, e.g. English or romanized Japanese and the applications of such text, and the nature of the use of the WWW in terms organizations, individuals, etc. and the levels of language employed in the WWW pages.
Computer Character Coding - from monolingual to multilingual
With the initial development of computers taking place primarily in the USA and the UK in the 1940s and 1950s, it is no surprise that the text-handling capabilities of early systems were confined to support of the English language, and in particular to the limited 26-character alphabet. This was also convenient at a time when computer storage was expensive and small. The very early computer systems used 5-bit alphabetic and numeric coding systems derived from teleprinter systems, and it was not until the early 1970s, some 30 years after the development of the first computers, that typical computer systems could handle both upper and lower case alphabetics, as well as numeric characters and a reasonable selection of punctuation characters.
The dominance of English-speaking countries in research and development in communications and information technology also led to English being the default language for many associated technologies, a situation that still persists. For example, most computer languages use English constructs, commands, library facilities, etc.
The coding systems which arrived in the early 1970s were mostly associated with a 7-bit code established by the US standards agency (now ANSI) and generally known by the ASCII (American Standard Code for Information Interchange) acronym. This code provided 128 basic characters, and through an extension to 8 bits (by that period most computers could handle 8-bit characters), a further 128 characters could be added. The approach that was followed was that many countries adopted the basic ASCII set with an extension set suitable to the local writing system as a national standard. The extension took the form of a complete alphabet in the case of languages such as Russian or Greek, or a set of combined alphabetic characters and diacritic marks in the cases of many other European languages. The Japanese equivalent of ASCII (now JIS X 0201) included a modified form of the katakana syllabary, thus allowing Japanese text to be recorded and handled, albeit not in its usual form. Eventually many of these code systems were brought together by the International Standards Organization as a consistent code family (ISO 8859).
The codes based on 8-bit characters, with their limit of 256 unique characters, were clearly inadequate for languages such as Chinese, Japanese and Korean which use large numbers of either Chinese characters (hanzi, kanji) or Hangul characters. During the 1970s sets of "double-byte character sets" (DBCS) were developed by the computer companies and standards bodies. By using pairs of bytes to represent each character, a total of over 65,000 characters could be defined. The Japanese DBCS code-set (now JIS X 0208) defines 6,355 kanji, both the hiragana and katakana syllabaries, as well as several alphabets and many punctuation characters. Similar standards were developed in Taiwan (Big5), the PRC (GB2312) and Korea (KS X 1002).
The coding systems described above were largely confined to the countries which developed them, and used exclusively in domestic computer systems. It was not until the 1980s that attention began to be given to the issues associated with international movement of computerized text and the combination of several languages in the one document. One approach that was taken for alphabetic codes was to develop techniques for combining existing national codes in the one document, leading to the current ISO 2022 code encapsulation standard.
For the large sets of character codes used in Chinese, Japanese and Korean, the approach has been to attempt to establish a single code incorporating all the characters used in those languages. An early attempt was the Chinese Character Code for Information Interchange (CCCII) which, with a similar system proposed by Japan's National Diet Library, led to the East Asian Character Code (EACC) for bibliographic use, standardized as ANSI Z39.64-1989. This code was recognized as too limited for general use, and in the late 1980s work began on an expanded code-set which was eventually developed into the "unified CJK ideographs" first incorporated in the Unicode/ISO 10646 standard in 1993. Initially some 21,000 hanzi/kanji were included, and this has since been extended to nearly 70,000. Similarly approximately 4,000 hangul were initially included, which has been extended to over 10,000. (A thorough treatment of code sets for East Asian languages is available in Lunde.)
Character Codes and Networks
Although networks of computers began to be developed in the late 1950s, and by the early-mid 1970s were receiving considerable attention in developed countries, they were largely associated with military, government and industrial applications, and little or no attention was paid to issues associated with text coding. The first major initiative aimed at an integrated world-wide computer network was the ISO's Open Systems Interworking (OSI) standards project, which began formally in 1977. This major project, which showed considerable initial potential, but which eventually faded out of existence as the Internet began to dominate in the early 1990s, paid very little attention to language coding issues, with many of its major elements only being able to handle European languages.
The Internet, with its overall architecture of a set of autonomous computer networks interconnected by some very basic communications protocols, was originally even more limited. Developed initially by research institutions in the US, much of the Internet's underlying command, message and address structure was, and still is, limited to use of the (English) alphabet plus numerals and some punctuation characters. While many elements of this structure are hidden from most users, one which is highly visible is the "domain address", which makes up part of email addresses, WWW addresses, etc. These addresses are associated with numerical network addresses used by the Internet routers, and the mapping between the domain addresses and the network addresses is carried out by a distributed application, the Domain Name System (DNS).
As the Internet began to be deployed throughout the world in the early 1990s, some users and networking organizations complained about the restriction of addresses to such a limited character set. After initial exploration of the issues by Dillon and Duerst in 1996, an experimental implementation of an expanded character set for domain names was carried out in the Center for Internet Research at the National University of Singapore. The success of this experiment led to the Internet's controlling standards body, the Internet Engineering Task Force (IETF) formally establishing an International Domain Name (IDN) task force in January 2000. In March 2003 an Internet standard: Internationalizing Domain Names in Applications (IDNA) was approved. The IDNA approach is to specify a standard conversion algorithm, referred to as the ASCII Compatible Encoding (ACE), between an international domain name written using Unicode/ISO 10646 and an internal format which only uses the limited subset of ASCII. Thus the large number of existing DNS implementations can continue to operate without change, and only software such as WWW browsers and email clients need to be modified or enhanced if users wish to use such domain names.
The Internet Address Structure in Japan
The domain addresses of the Internet are structured as a hierarchy in order to facilitate distributed management of the addresses. At the top of the hierarchy are a set of "generic" domains (com, edu, org, net, etc.) and a much larger set of "country code" domains (au, jp, nz, ca, uk, etc.). These codes are based on the ISO country-name abbreviation codes. In many countries the top-level domain is divided into a set of "second-level" domains, again for purposes of management, commercialization, etc., although in some countries, e.g. France and Germany, there are no further levels in the hierarchy. Administration of the domains within each country is carried out by an organization approved by the Internet Corporation for Assigned Names and Numbers (ICANN).
In Japan the domain names are controlled by the Japan Registry Service (JPRS), successor to the Japan Network Information Center (JPNIC), which carried out this role until mid-2003. From the inception of the Internet in Japan the .jp domain has been divided into second-level domains according to organization type, using a two-letter code (co, ac, ne, etc.). In 2001, JPNIC/JPRS introduced "General Use" domain names, which do not use a second-level code, and which can be written using either ASCII or Japanese (kanji and kana) according to the IDNA standard.
The current major domain-name types and the numbers of registered names are as set out below.
|Higher Educational Institutions||.ac.jp||3,020||u-tokyo.ac.jp|
|Other Educational Institutions||.ed.jp||4,293||kaminokawa-h.ed.jp|
|Network Service Providers||.ne.jp||17,482||gol.ne.jp|
|General Use (ASCII)||.jp||192,147||densha.jp|
|General Use (Japanese)||.jp||45,588||千代田不動産.jp|
The use of a hybrid organization/general-use domain-name structure is unique to Japan, and is a departure from the usual practices in domain-name administration.
The introduction of domain names in Japan in kanji and kana was one of the first applications of internationalized domain names. That Japan was an early adopter of such names is not surprising given the role that Japanese organizations played in the establishment of the IDN system. In fact by introducing such names prior to the formal establishment of the standards for the coding of non-ASCII names, JPNIC/JPRS had to take the risk of anticipating the actual form the coding would take. Initially the "Row-based ASCII Compatible Encoding (RACE)" method was adopted, however as this was not selected as the standard, the coding had to be migrated to the standard form in mid-2003. Another problem with the early use of IDN was the lack of software to enable users to access such addresses. JPNIC/JPRS obtained from Verisign, a US software and domain registration company, a "plugin" for the Microsoft Internet Explorer browser which can use the ACE/RACE domain names, and made this freely available to users.
Although domain names in Japanese now make up about 8% of registered names in Japan, the actual usage of these names appears to be quite limited. From inspection of the domain-name databases it appears that many names are not yet associated with numerical network addresses, and hence are not in use. This confirms comment in the IT industry that many companies were registering their names for possible future use. Very few company advertisements have been detected in the print media or on billboards with Japanese domain names, and in several cases it has been noted that companies which did advertise such names have reverted to the ASCII forms. It is clearly too early to predict the success or otherwise of Japanese domain names, as there are still systemic impediments, in particular the very limited quantity of software which supports the special coding of such names.
In the longer term, it will be important to follow the progress of the application of Japanese domain names, as it will be an indicator of the usefulness of IDN in a global sense. While the desire to have names available in a locally and culturally relevant script is understandable, its widespread adoption would inevitably result in the Internet developing into a set of semi-enclosed communities. A name such as 千代田.JP (CHIYODA.JP) would be difficult for a non-Japanese user to enter or even display, while the ACE form of this name (XN--MNQ89HQW2B.JP) does not have any useful mnemonic characteristics, and would be difficult to use in that form without errors.
The WWW in Japan
In considering issues associated with WWW usage, it is necessary to establish a technique for estimating the numbers of WWW pages available in various categories. A useful metric is the number of pages being reached by the major "search engines", which periodically examine the WWW, accumulate copies of pages, and carry out a degree of analysis of their contents. While the search engines cannot cover the entire WWW, as many pages are blocked by passwords, only generated after a user dialogue, etc. they do provide an indication of the size of the WWW, and a technique for categorizing text within the pages. At present, the "Google.com" search engine has the largest number of WWW pages (approximately 3.3 billion) under examination, and since it stores accumulated pages in Unicode coding and allows searching for pages containing text in any language, it is appropriate to use it as the primary measurement tool.
As expected, English clearly dominates in WWW pages. Examining the counts of pages containing typical English words such as "and", "if", "the", etc. indicates that approximately 2.4 billion pages contain some English material. As a contrast, an examination of a set of typical French words ("la", "le", "à", etc.) indicates that approximately 140 million pages contain some French. A similar examination using short kana sequences, which are (fortunately) unique to Japanese and an essential component of the written language, indicates 168 million pages contain が, 200 million pages contain を, 188 million pages contain は and 66 million pages contain です. From this it is reasonable to estimate that in the order of 200 million pages contain Japanese, which is about 1.6 pages per Japanese speaker, compared to about 2 pages per French speaker.
A similar comparison can be made between Japanese and Chinese using compound kanji/hanzi words which are common to and occur frequently in both languages. Testing 社会 (社會 in traditional Chinese characters): (society, public), 最近 (latest, nowadays), and 世界 (the world, society, universe) via the Google search engine using filters on domains (jp, cn (PRC), tw (Taiwan)) and languages (Japanese, Chinese-simplified characters, Chinese-traditional characters) we obtain the following:
Despite difficulties in making such cross-language comparisons, this data is consistent with a conclusion that Japanese text is present in WWW pages approximately 50% more often than Chinese, and along with the estimate of French pages above supports conclusions reached by market research companies, an example of which follows:
It is worth noting that not all WWW pages containing Japanese text are confined to the .jp domain of addresses. Many large Japanese companies, such as NEC, Sony and Fujitsu, choose to use addresses from the generic .com domain, while some Japanese organizations use addresses from the .org domain. This is the reason for the difference in page numbers between the Japanese pages and the jp domain pages which can be seen in Figure 2. It is also worth noting that Japanese pages are not often accessed by non-Japanese, the main exceptions being language learners and scholars of Japanese studies.
Compared to many of the application systems which operate over the Internet, such as email, the WWW was developed at a time when the issues associated with the coding of different languages were well recognized, and hence structures were in place from the beginning to enable languages such as Japanese to be readily handled in WWW pages in a standard fashion. Also, although the take-up of personal computers in Japan had initially been slower than in other developed countries, partly due to the difficulties associated with the input of Japanese text, by the time of the introduction of the WWW in the early 1990s these problems had largely been overcome, and computer usage was rising rapidly. Thus there seemed to be no barrier to the use of the Japanese script in WWW pages. It is appropriate to test this hypothesis by determining if there is significant use of English or of romanized Japanese in WWW pages originating in Japan and for domestic use.
Examination of WWW pages in the .jp domain does indicate a very low level of use of romanized Japanese. Common words such as kuruma and arimasu appeared in 12,300 and 1,200 pages respectively, which less than 0.1% of the occurrences of the words in Japanese script (車 and あります). Inspection of a selection of pages which used significant amounts of romanized Japanese indicated that it was almost entirely confined to material aimed at tourists and language learners at an elementary level.
Similarly, inspection of WWW pages in the .jp domain which contain English text does not provided any evidence that English is used for primary communication. Many WWW pages in Japan do contain some English, however it is clear that this material falls into two categories:
Japanese Text in WWW Pages
The Japanese language is rich and varied with many styles and levels of register which are reflected in such things as vocabulary, honorific forms and conjugations. It is instructive to examine the language being used in Japanese WWW pages to identify and quantify the styles, levels, etc. in use.
One method of carrying out such an examination would be to collect a large sample of pages, and examine the text, either manually or automatically, with a view to classification. Such a task would be complex, and would require significant resources to gather a large enough collection of pages to ensure the sample was random. As the only way of identifying a WWW page is to follow a link from another page, the inherent tree structure in WWW pages is a significant impediment to randomized selection of pages.
The method used in the present study is to draw on the collections of pages assembled by search engines, in this case Google, and examine the frequency of occurrence of selected words and phrases which can be associated with particular language styles and levels. For the purposes of this study, the key data is the number of times a word or phrase occurs; not the ranking of the pages. In carrying out this examination, only the pages in the .jp domain have been examined, and in addition pages have been limited to those identified by Google to contain Japanese text as it is necessary to avoid counting pages containing Chinese text.
Using common Japanese words such as が, を, etc. it is possible to establish the approximate distribution of WWW pages across the second-level domains in Japan. These are as set out in the following table.
It is possible to determine the percentages of pages containing selected words. For example, for the words 経済 (keizai: economics; business; finance; economy), 政府 (seifu: government; administration) and 写真 (shashin: photograph), we see the following pattern of occurrences. (The figures are the percentages of pages in the domain containing the word.)
The levels of occurrence of these words across the domains are consistent with what one would expect: that government pages are more likely to be dealing with administrative matters, and that private and company pages are more likely to have or be discussing photographs.
Turning to some examples of relatively formal Japanese, we examined ございます (gozaimasu: a polite form of the copula), いただきます (itadakimasu: a polite form of the humble verb to receive), and である (dearu: a plain form of the copula). The first two usually only occur in spoken language and reported speech, and the latter usually only in written language).
It is not surprising that the spoken forms occur relatively rarely in WWW pages. This is even more evident when colloquialisms are examined, for example すごく (sugoku: awfully, extremely) and どうかな (doukanaa: I wonder...).
That these words occur at all in academic or government pages is surprising, however on inspection it emerged that they were mostly due to reported speech in newsletters and similar publications.
This study has examined the adoption of the Internet and the WWW in Japan from the position of the impact of the underlying domination of the English alphabet, the adaptations to use Japanese scripts, and the usage of Japanese language in WWW material. It has also briefly examined the usage of the WWW in terms of numbers of pages compared with other languages.
The conclusions reached are that: (a) the English alphabet still dominates in structural aspects of the Internet and WWW, such as in addresses, and although there has been an attempt to address this in Japan, it has met with limited success to date; (b) the WWW has been fully developed in Japan in the local language and script in a manner which appears to be consistent with other communications media. There is no evidence of English or romanized Japanese playing a significant role in the use of the WWW by Japanese people; (c) the level of penetration of the WWW in Japan is high and is comparable with major European languages, although on a per capita basis is well behind English. The penetration is still significantly higher than is the case with Chinese both on absolute and per capita bases.