Difference between revisions of "KANJIDIC Project"

From EDRDG Wiki
(Introduction)
(The KANJIDIC Project)
Line 17: Line 17:
 
** a [http://www.edrdg.org/kanjidic/kd2examph.html sample entry]
 
** a [http://www.edrdg.org/kanjidic/kd2examph.html sample entry]
 
* the KANJIDIC file, which in in [https://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP EUC-JP] coding and covers the 6,355 kanji in JIS X 0208. For this there is the
 
* the KANJIDIC file, which in in [https://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP EUC-JP] coding and covers the 6,355 kanji in JIS X 0208. For this there is the
 +
 
** [http://www.edrdg.org/kanjidic/kanjidic_doc.html original documentation]
 
** [http://www.edrdg.org/kanjidic/kanjidic_doc.html original documentation]
 
* the KANJD212 file, which also is in EUC-JP coding and covers the 5,801 kanji in JIS X 0212. For this there is the
 
* the KANJD212 file, which also is in EUC-JP coding and covers the 5,801 kanji in JIS X 0212. For this there is the
 
** [http://www.edrdg.org/kanjidic/kanjd212_doc.html original documentation]
 
** [http://www.edrdg.org/kanjidic/kanjd212_doc.html original documentation]
 
There is also a [http://www.edrdg.org/kanjidic/kanjidic.html combined overview] of the KANJIDIC/KANJD212 files.
 
There is also a [http://www.edrdg.org/kanjidic/kanjidic.html combined overview] of the KANJIDIC/KANJD212 files.
 +
==Content & Format==
 +
The database and distributed data files contain an entry for each of the kanji, with each entry containing a number of fields of data about the kanji. The data is described in the following table. The format of the distribute files as as follows:
 +
* the  KANJIDIC and KANJD212 files are text files with one line per kanji and the information fields separated by spaces. The format of each line is:
 +
** the kanji itself followed by the hexadecimal form of the JIS "ku-ten" coding, e.g. "亜 3021";
 +
** information fields beginning with one or two-letter codes as per the table below. For example "S10" indicates a stroke count of 10;
 +
** the Japanese readings of the kanji. ON readings (音読み) are generally in ''katakana'' and KUN readings (訓読み) in ''hiragana''. An exception is the set of ''kokuji'' for measurements such as centimetres, where the reading is in ''katakana''. There may be several classes of reading fields, with ordinary readings first, followed by members of the other classes, if any. The current other classes, and their tagging, are:
 +
***where the kanji has special ''nanori'' (i.e. name) readings, these are preceded the marker "T1";
 +
***where the kanji is a radical, and the radical name is not already a reading, the radical name is preceded the marker "T2".

Revision as of 03:06, 6 September 2018

The KANJIDIC Project

(Note that this page in the process of being rewritten, so be patient with any aspects that seems incomplete.)

Introduction

The KANJIDIC project, which began in 1991, has the goal of compiling and distributing comprehensive information on the kanji used in Japanese text processing. It covers the 13,108 kanji in three main Japanese standards:

Three data files are distributed by this project:

  • the KANJIDIC2 file, which is in XML format and Unicode/UTF-8 coding, and contains information about all 13,108 kanji. For this file the following information is available:
  • the KANJIDIC file, which in in EUC-JP coding and covers the 6,355 kanji in JIS X 0208. For this there is the

There is also a combined overview of the KANJIDIC/KANJD212 files.

Content & Format

The database and distributed data files contain an entry for each of the kanji, with each entry containing a number of fields of data about the kanji. The data is described in the following table. The format of the distribute files as as follows:

  • the KANJIDIC and KANJD212 files are text files with one line per kanji and the information fields separated by spaces. The format of each line is:
    • the kanji itself followed by the hexadecimal form of the JIS "ku-ten" coding, e.g. "亜 3021";
    • information fields beginning with one or two-letter codes as per the table below. For example "S10" indicates a stroke count of 10;
    • the Japanese readings of the kanji. ON readings (音読み) are generally in katakana and KUN readings (訓読み) in hiragana. An exception is the set of kokuji for measurements such as centimetres, where the reading is in katakana. There may be several classes of reading fields, with ordinary readings first, followed by members of the other classes, if any. The current other classes, and their tagging, are:
      • where the kanji has special nanori (i.e. name) readings, these are preceded the marker "T1";
      • where the kanji is a radical, and the radical name is not already a reading, the radical name is preceded the marker "T2".