Software Tools for Text Analysis

J.W. Breen

School of Computer Science & Software Engineering

Monash University

October 2001

This Technical Report is an expanded form of a Project Report submitted to the ILCAA in June 2001.

Abstract

This report describes work carried out at the Institute for the Study of the Languages and Culture of Asia and Africa (ILCAA) in the Tokyo University of Foreign Studies during 2001. The work was within a project for the development and application of software to identify, extract, manipulate and analyze characters in a number of old text documents in Chinese and Japanese. The project has led to the development of software tools and techniques for the isolation and extraction of characters, along with their coordinates; identification of characters; and analysis of the spatial characteristics of and between characters.

Introduction

The Text Analysis software project was undertaken to determine whether special software techniques could be applied to assist the analysis of a number of texts being examined with in the Information Resources Centre at ILCAA. Initially the target was to provide tools which could assist the manual processing of texts, removing some of the time-consuming manual aspects. However as the project proceeded it became apparent that the techniques opened up different areas of application, and it was extended to include software which carried out some of the analysis as well.

In terms of outcomes, the project has resulted in the development of

software tools and techniques for the isolation and extraction of characters, along with their coordinates;
software for the:
1. identification of characters (i.e. a form of OCR)
2. analysis of the spatial characteristics of characters, in particular their size, shape, and precise placement within the source document.
3. analysis of the spatial relationships between adjacent characters, and between columns of characters.

The project has demonstrated that tailored software has a considerable contribution to make in assisting the analysis of textual documents.

The SekiKei Texts

The first texts for which software was developed were the KaiCheng Sekikei (開成石経); Confucian canons carved in stone in about 837 CE during the Tang Dynasty in China, with the intention of setting the standards for both the texts and character (glyph) forms. The stones, of which there are several thousand remaining, have been restored on several occasions. Copies have been published, usually based on inked-over sheets of paper, which show the script in a white-on-black form. An example of one page of these texts is in Figure 1.

The current project at ILCAA is to identify and classify the hanzi in the texts according to orthographical variations. Scanned images of the texts are being used, and the individual hanzi are being extracted and catalogued using image manipulation software such as Adobe Illustrator.

The software developed for the Sekikei texts was aimed at extracting the characters automatically, thus avoiding tedious manual methods, and improving the overall productivity of the project.

Figure 1: Example page from the KaiCheng Sekikei

The software typically processes each image by holding it in RAM as one pixel per byte. Since only a single bit is necessary for a pixel in a black/white image, other bits can be used for holding the status during processing.

The basic approach taken was to:

identify the "pages" of text within the scanned documents. Typically these are in pairs on each document, surrounded by white borders.
separate the columns of text (usually 6/page)
separate the characters (usually 10/column)

Some of the issues encountered in the development of the software were:

the amount of "background" noise within the images. This appears to be due to such things as patterns in the paper, unevenness in the ink, the condition of the surface of the stone, etc. As accurate determination of the characters has employed statistical techniques using the the density of white pixels on a black background, the treatment of noise is an important issue.
the degree to which the images of the characters themselves are degraded, both through the effects of noise, as described above, and to damage and restoration of the stones.
other image artifacts, such as the edges of the stones, chips in the stones, folds in the paper, etc.

Figure 2 depicts a "raw" image. Note the background noise, fold mark, etc.

Figure 2: Raw image of a character

In order to determine the locations of the inter-column regions and the breaks between characters, it was necessary to deal with the levels of noise in the images, as it has the effect of masking out the locations of the characters themselves. Removal of such image noise is a common aspect of OCR software, and a common technique is to assume that joined sets of pixels below a certain size are redundant and can therefore be excised from the image. This approach was implemented in the present software with considerable success.

Figure 3 depicts the same image after a low level of noise removal and after stronger noise removal.

Figure 3: Character after two levels of noise removal

It will be noted that the edges of the images are very rough, probably an outcome of the transfer of the image to paper using ink. Some experimentation was carried on smoothing these edges in an attempt to obtain a more "solid" image. There are a number of techniques available for carrying out such edge smoothing. Among them are:

determining a mathematical equation for each edge, e.g. a Bezier or b-spline, then re-establishing the edge using this derived curve;
applying a spatial filter to the overall image using a transform coding (these techniques are commonly used in image compression and transmission);
using heuristics to remove sharp features on the image borders.

The spatial filter technique was tested using image enhancement software developed in Professor Henry Wu's laboratory at Monash University. The left-hand character in figure 4 shows a character that has been enhanced using this technique. Note that it has also resulted in portions of the character being eliminated.

The "sharp feature heuristic" was implemented by developing an algorithm which "walks" the black/white border of each character or portion of a character and fills in each fissure less than a few pixels in width. The right-hand character in figure 4 shows a character that has been enhanced using this technique. At the same time any residual noise within the character is removed.

Figure 4: Character after two types of image enhancement.

Both noise removal and image enhancement need to be handled very carefully in cases of severe image degradation. For example, in Figure 5, the elements of the 樂 (modern form: 楽) character are difficult to discern as a result of both severe noise in the picture and degradation of the strokes. In cases such as this it is usually best if the image extraction is carried out manually.

Figure 5: Badly degraded character with a high level of noise.

The Sekikei software is at the stage where it can fairly reliably extract characters from pages, and without too much more effort could be brought to the stage where it could carry out the automatic identification and extraction of most of the characters in the scanned text. The software has been written in the C language for Unix/Linux platforms.

Work on this aspect of the project is currently suspended, as it was considered more appropriate to focus on another application of the software.

Movable Type texts

Introduction

The Movable Type texts being studied were published in Japan from the late 16th century to the mid-17th century, initially by members of the Society of Jesus (Jesuits).

The texts were the first produced in Japan using movable metal type. Characters were initially carved in wood and the (relief) carvings were used to produce first clay then metal moulds. The final moulds were used to produce pieces of metal type (bronze or brass) for printing.

One of the reasons for studying the printed texts in detail is to determine key aspects of the type-setting techniques employed, and the relationship, if any, between the techniques used by the Jesuits and the techniques later used in the Edo period for printing Japanese texts.

The current text being studied is the "Guiado Pecador" (Guide for The Sinner) written in 1556 by Louis of Granada (1504-1588), a Spanish Dominican priest. The Japanese translation was published in 1599. Figure 6 shows the front page and first page of this publication. The document being studied is a scan of a photocopy of a copy in the Vatican library. The scan is at 300dpi, a resolution which matched the quality of the copy.

Figure 6: Pages from the Guiado Pecador

Ligatures

One aspect of the texts of considerable interest is the use of ligatures for many of the kana sequences, and also use of inter-character spacing for punctuation and phrase/clause separation. Figure 7 shows example of ligatures for the kana sequences いへ, よき, なり and ども.

The Project

One of the projects under way with these texts is to identify and characterize the ligatures, and in particular to attempt to identify "inherent ligatures", i.e. cases where two characters or ligatures were cast in a single piece of metal type. In addition the inter-character and inter-line spacing is being studied in an attempt to determine whether single "inter" pieces (i.e. strips of metal to separate the rows or columns) were inserted in the type-setting process.

Figure 7: Examples of kana ligatures

Initially the measurements were to be carried out manually, using image manipulation such as Adobe Illustrator. However it was decided to attempt to apply the techniques developed in the Sekikei analysis to this problem. At the same time it was decided to extend the software to include a Graphical User Interface (GUI) so that it could also be used on Windows systems.

Software

In order to carry out the appropriate analysis of the texts, the following software has been developed:

utility software which can process a scanned image of a page of text, identify the boundaries of each character and write to disk a file containing the locations of each character, and the individual character images. This utility can operate either via a GUI, or as a stand-alone program controlled by a script file or command-line options.
utility software to analyze the character locations in sequence to determine such things as:
1. the identity of the characters/ligatures (as required);
2. the spacing between particular characters;
3. the spacing between columns.

Figure 8 depicts the editing controls of the software, and a portion of an image being processed.

Figure 8: Example screens of the software in operation

Identification of Characters/Ligatures

As the precise boundaries of each printed character are required, simply searching for inter column and character spaces is not adequate.

The approach taken has been to:

identify the horizontal and vertical coordinates of each character fragment, using a form of the algorithm developed for the Sekikei texts, that "walks" the black/white boundary of each shape in the image accumulating the coordinates of the character's horizontal and vertical extremities;
apply a rule base to aggregate the identified fragments of a character or ligature so that they can be defined by a single set of coordinates. The rule base is necessary to combine the fragments appropriately, without gratuitously combining the fragments with adjacent characters. The rule base covers such things as:
1. overlapping fragments;
2. small fragments close to each other;
3. adjacent fragments.
with the combination being subject to overall constraints on the sizes of characters. The nature of the script is such that a fairly tight horizontal constraint can be applied, while the presence of ligatures leads to some comparatively tall characters.

The most reliable method of combining the fragments involves a sequence of operations with varying parameters defining amounts of overlap.

A number of problems still remain to solved with regard to the correct vertical aggregation of character fragments, including:

the failure to complete some tall characters, usually ligatures, when there has been an incomplete printing of part of the character;
the erroneous joining of characters where they appear to overlap, possibly due to the spread of ink.

Processing the entire Guia do Pecador document (some 240 A4 pages) to extract the characters and record their coordinates takes about 2 hours with the present software.

Character Analysis

Analysis of the characters on each page typically requires the scanning of each column in turn. As the database of characters is not in this order, the analysis software finds the top right-hand character in the page of text, then "navigates" down the columns by calculating which is the closest following character. Each character is then compared with a small database of character information to determine if it is to be selected for study.

Character identification is done in two stages in order to speed up the evaluation:

initially the height and width of a character are checked to see if they are close to any in the database. At present a 5-pixel range either side of the database character is used (approximately 1mm);
for characters that are close in dimension, a series of pixel-by-pixel comparisons positioning the character under test over the one in the database. The comparison is made successively for up to a 10-pixel offset in the both the horizontal and vertical axes, and the lowest level of mismatch is selected. If this mismatch is below 8% of the pixels in the character, it is selected as a match.

The above approach has been tested exhaustively, and shown to be highly reliable for small numbers of characters in the database (up to 10), and reasonably reliable for larger numbers.

The character analysis utility can operate in two modes:

a "training" mode in which it steps through selected pages of the document and under instruction from the user, adds character details to the database along with identification information;
in automatic mode in which it processes a complete page, recording information about matches to a log file. In this mode the utility will process a series of pages under the control of a command script.

Examination of the entire Guia do Pecador document for a small set of target characters takes in the order of two minutes.

The Search for Inherent Ligatures

As an initial test of the software an analysis of the いへ + とも sequence was made. This sequence occurs approximately 30 times in the document, and may have been cast as a single piece of type. Figure 9 shows an example of this pair.

Figure 9: A kana and kanji sequence under analysis as a possible inherent ligature

The horizontal separation and vertical alignment of the pairs were identified and plotted in Figure 10.

Figure 10: Plot of the horizontal separation and vertical alignment of the pair of characters.

From analysis of the plot, it appears that the cluster of measurements in the top left-hand corner of the plot are probably within the scope of a single piece of type (note that 5 pixels approximates to 1mm.) The few pairs with widely differing measurements are certainly using different pieces of type.

Inter Spacing

A short measurement of the "inter" (inter-column) spacing was made by analyzing the character coordinates. In several places it was found that characters in adjacent columns overlap.

In the text extract in Figure 11, the topmost kanji in the fourth column from the right (以) overlaps with two ふ kana in the third column by up to 17 pixels. This amount (approximately 2mm) may be due in part to factors such as warping of the paper, spread of ink, etc. However if repeated elsewhere in the document it tends to support a view that single inter spaces were not always used when the type was being set.

Figure 11: Page of text with overlapping characters in adjacent columns.

Some Issues Arising from the Analysis

A number of issues have emerged from the work so far:

the current project has clearly demonstrated the effectiveness and precision of tailored software techniques for carrying out detailed analysis of the scripts in the movable type texts.
it has also demonstrated that the analysis is subject to many factors such as state of the copy of the text, variations in the printing, etc. which add to the complexity of the task of drawing the appropriate conclusions from the analysis.

Future Work

The continuation of the analysis of the Guiado Pecadorand other similar texts will involve additional work in a number of areas:

creation of a new digital version of the source document with greater precision than the current copy. Fortunately a copy of high quality has recently been acquired by Sophia University in Tokyo, and permission has been obtained to make a direct digital image;
development of a model for ligature analysis which can cater for variations in individual characters caused by such things as the spread of ink, the shrinkage of the paper, etc. The model should cater for the comparison in such a way as to distinguish between character placement when they are are cast as a single piece of type and when they are set by hand;
enhancement of the character aggregation rules to overcome some remaining problems. Artificial Intelligence techniques such as training a Neural Network may be applicable.
refinement of the character identification techniques to improve reliability, particularly for larger sets of characters.

Software Details.

Most of the software has been written in the C language, and developed and tested using the Linux operating system. The Graphical User Interface module uses Tcl/Tk (Tool Command Language/Tool Kit). Tcl/Tk was chosen because it is available on Windows, Macintosh and Unix/Linux systems. The combination of C and Tcl/Tk has been effected using the `mktclapp' application environment.

The software has been ported to Windows 98 using the Cygwin (Cygnus GNU/Windows) system, which provides a Unix-like environment in Windows.

For further information about the software, refer to the installation and operation documentation.