This Technical Report is an expanded form of a Project Report submitted to the ILCAA in June 2001.
This report describes work carried out at the Institute for the Study of the Languages and Culture of Asia and Africa (ILCAA) in the Tokyo University of Foreign Studies during 2001. The work was within a project for the development and application of software to identify, extract, manipulate and analyze characters in a number of old text documents in Chinese and Japanese. The project has led to the development of software tools and techniques for the isolation and extraction of characters, along with their coordinates; identification of characters; and analysis of the spatial characteristics of and between characters.
In terms of outcomes, the project has resulted in the development of
The project has demonstrated that tailored software has a considerable contribution to make in assisting the analysis of textual documents.
The current project at ILCAA is to identify and classify the hanzi in the texts according to orthographical variations. Scanned images of the texts are being used, and the individual hanzi are being extracted and catalogued using image manipulation software such as Adobe Illustrator.
The software developed for the Sekikei texts was aimed at extracting the characters automatically, thus avoiding tedious manual methods, and improving the overall productivity of the project.
The software typically processes each image by holding it in RAM as one pixel per byte. Since only a single bit is necessary for a pixel in a black/white image, other bits can be used for holding the status during processing.
The basic approach taken was to:
Some of the issues encountered in the development of the software were:
In order to determine the locations of the inter-column regions and the breaks between characters, it was necessary to deal with the levels of noise in the images, as it has the effect of masking out the locations of the characters themselves. Removal of such image noise is a common aspect of OCR software, and a common technique is to assume that joined sets of pixels below a certain size are redundant and can therefore be excised from the image. This approach was implemented in the present software with considerable success.
Figure 3 depicts the same image after a low level of noise removal and after stronger noise removal.
It will be noted that the edges of the images are very rough, probably an outcome of the transfer of the image to paper using ink. Some experimentation was carried on smoothing these edges in an attempt to obtain a more "solid" image. There are a number of techniques available for carrying out such edge smoothing. Among them are:
The spatial filter technique was tested using image enhancement software developed in Professor Henry Wu's laboratory at Monash University. The left-hand character in figure 4 shows a character that has been enhanced using this technique. Note that it has also resulted in portions of the character being eliminated.
The "sharp feature heuristic" was implemented by developing an algorithm which "walks" the black/white border of each character or portion of a character and fills in each fissure less than a few pixels in width. The right-hand character in figure 4 shows a character that has been enhanced using this technique. At the same time any residual noise within the character is removed.
Both noise removal and image enhancement need to be handled very carefully in cases of severe image degradation. For example, in Figure 5, the elements of the 樂 (modern form: 楽) character are difficult to discern as a result of both severe noise in the picture and degradation of the strokes. In cases such as this it is usually best if the image extraction is carried out manually.
The Sekikei software is at the stage where it can fairly reliably extract characters from pages, and without too much more effort could be brought to the stage where it could carry out the automatic identification and extraction of most of the characters in the scanned text. The software has been written in the C language for Unix/Linux platforms.
Work on this aspect of the project is currently suspended, as it was considered more appropriate to focus on another application of the software.
The Movable Type texts being studied were published in Japan from the late 16th century to the mid-17th century, initially by members of the Society of Jesus (Jesuits).
The texts were the first produced in Japan using movable metal type. Characters were initially carved in wood and the (relief) carvings were used to produce first clay then metal moulds. The final moulds were used to produce pieces of metal type (bronze or brass) for printing.
One of the reasons for studying the printed texts in detail is to determine key aspects of the type-setting techniques employed, and the relationship, if any, between the techniques used by the Jesuits and the techniques later used in the Edo period for printing Japanese texts.
The current text being studied is the "Guiado Pecador" (Guide for The Sinner) written in 1556 by Louis of Granada (1504-1588), a Spanish Dominican priest. The Japanese translation was published in 1599. Figure 6 shows the front page and first page of this publication. The document being studied is a scan of a photocopy of a copy in the Vatican library. The scan is at 300dpi, a resolution which matched the quality of the copy.
One aspect of the texts of considerable interest is the use of ligatures for many of the kana sequences, and also use of inter-character spacing for punctuation and phrase/clause separation. Figure 7 shows example of ligatures for the kana sequences いへ, よき, なり and ども.
One of the projects under way with these texts is to identify and characterize the ligatures, and in particular to attempt to identify "inherent ligatures", i.e. cases where two characters or ligatures were cast in a single piece of metal type. In addition the inter-character and inter-line spacing is being studied in an attempt to determine whether single "inter" pieces (i.e. strips of metal to separate the rows or columns) were inserted in the type-setting process.
Initially the measurements were to be carried out manually, using image manipulation such as Adobe Illustrator. However it was decided to attempt to apply the techniques developed in the Sekikei analysis to this problem. At the same time it was decided to extend the software to include a Graphical User Interface (GUI) so that it could also be used on Windows systems.
In order to carry out the appropriate analysis of the texts, the following software has been developed:
Identification of Characters/Ligatures
As the precise boundaries of each printed character are required, simply searching for inter column and character spaces is not adequate.
The approach taken has been to:
A number of problems still remain to solved with regard to the correct vertical aggregation of character fragments, including:
Processing the entire Guia do Pecador document (some 240 A4 pages) to extract the characters and record their coordinates takes about 2 hours with the present software.
Analysis of the characters on each page typically requires the scanning of each column in turn. As the database of characters is not in this order, the analysis software finds the top right-hand character in the page of text, then "navigates" down the columns by calculating which is the closest following character. Each character is then compared with a small database of character information to determine if it is to be selected for study.
Character identification is done in two stages in order to speed up the evaluation:
The character analysis utility can operate in two modes:
Examination of the entire Guia do Pecador document for a small set of target characters takes in the order of two minutes.
The Search for Inherent Ligatures
As an initial test of the software an analysis of the いへ + とも sequence was made. This sequence occurs approximately 30 times in the document, and may have been cast as a single piece of type. Figure 9 shows an example of this pair.
The horizontal separation and vertical alignment of the pairs were identified and plotted in Figure 10.
From analysis of the plot, it appears that the cluster of measurements in the top left-hand corner of the plot are probably within the scope of a single piece of type (note that 5 pixels approximates to 1mm.) The few pairs with widely differing measurements are certainly using different pieces of type.
A short measurement of the "inter" (inter-column) spacing was made by analyzing the character coordinates. In several places it was found that characters in adjacent columns overlap.
In the text extract in Figure 11, the topmost kanji in the fourth column from the right (以) overlaps with two ふ kana in the third column by up to 17 pixels. This amount (approximately 2mm) may be due in part to factors such as warping of the paper, spread of ink, etc. However if repeated elsewhere in the document it tends to support a view that single inter spaces were not always used when the type was being set.
Some Issues Arising from the Analysis
A number of issues have emerged from the work so far:
The continuation of the analysis of the Guiado Pecadorand other similar texts will involve additional work in a number of areas:
Most of the software has been written in the C language, and developed and tested using the Linux operating system. The Graphical User Interface module uses Tcl/Tk (Tool Command Language/Tool Kit). Tcl/Tk was chosen because it is available on Windows, Macintosh and Unix/Linux systems. The combination of C and Tcl/Tk has been effected using the `mktclapp' application environment.
The software has been ported to Windows 98 using the Cygwin (Cygnus GNU/Windows) system, which provides a Unix-like environment in Windows.
For further information about the software, refer to the installation and operation documentation.