Kanji and Reading Information Fields
Introduction
The JMdict dictionary structure, both database and XML file, begins with two fields/entities which contain:
- the surface form(s) of the terms (if any) that contain one or more kanji or special characters;
- the surface form(s) of the readings or hiragana/katakana versions of the terms.
A number of information fields may be associated with these surface forms to provide information about their source, status, etc. These fields take the form of a tag associated with the form: in the database, they appear, for example, as "[ateji]", and in the XML they are included in the <ke_inf> element.
This page describes the major information fields associated with the kanji and reading terms.
Kanji Information Fields
[ateji] Form With Kanji Used Phonetically
The [ateji] (当て字) tag is used with kanji forms where one or more of the kanji are used to phonetically represent native or borrowed words with less regard to the underlying meaning of the characters. Approximately 600 entries have kanji forms with this tag.
A typical example is the term 寿司 (すし: sushi), where the kanji are used for the す and し readings and have no relationship to the meaning of the term.
[iK] Form Containing Irregular Kanji
The [iK] tag is associated with kanji forms and is currently defined as a "term containing irregular kanji usage". It is currently used in about 650 entries. It probably should be redefined as "form containing irregular kanji".
The purpose of the tag is to inform users that one or more of the kanji in the surface form is not correct. The tag dates from the early days of the project and is typically associated with kanji that appear very similar to the correct one, for example, the 旧弊 entry also has the 旧幣 form with an [iK] tag.
Most instances of the [iK] tag can probably replaced with the more recently introduced [sK] tag as there is little reason to keep these erroneous forms visible.
There are cases where the tag will continue to be useful. For example, in the 痤瘡 (acne) entry, the 座瘡 form is also included. This form was commonly used, along with the ざ瘡 form, as the correct "痤" kanji was not initially available in the kanji character codes. As the 座瘡 form is still widely used, it probably should continue to be visible.
[oK] Form Containing 旧字体/Old Kanji
The [oK] tag is associated with kanji forms and is currently defined as a "term containing outdated kanji or kanji usage". It is currently used in about 700 entries. It probably should be redefined as "form containing a kyūjitai/old character form" or something similar. (This has been discussed in issue #103.)
The purpose of the tag is to inform users that one or more of the kanji in the surface form is of an older variety (旧字体 or similar) and that forms using more recent kanji (e.g. 新字体) are in use. An example is the 合気道/合氣道 entry, where the 合氣道 form is tagged as [oK]. Other kanji pairs which could lead to use of the tag include 国/國, 学/學 and 竜/龍.
[rK] Form Containing Rarely-used Kanji
The [rK] tag is associated with kanji forms and is currently defined as "rarely-used kanji form". It is currently used in 3,006 entries.
The purpose of the tag is to inform users that the form is rarely used in comparison with other kanji form(s) or kana-only forms, but is being kept visible because it occurs in major references such as 国語辞典. It would typically be added to forms that occur with frequencies less than 3% of those of the more common forms.
An example is the 付近 (neighborhood) entry, which also has the 附近 form with an [rK] tag. The respective n-gram counts are 7,671,290 (95.3%) and 85,884 (1.1%). The 附近 form is in most reference dictionaries, and hence should not be hidden.
The [rK] tag is not typically used for forms in entries tagged as being archaic.
There has been discussion of this tag in the associated Github issue.
[sK] Kanji Form Recommended as Search-Only
The [sK] tag is associated with kanji forms and is currently defined as "search-only kanji form". It is currently used in almost 4,000 entries.
The purpose of the tag is to indicate to applications using the database that the form should be used as a lookup key for the entry, but should not be included in a regular entry display due to its relative rarity when compared with the other forms.
The types of kanji forms which would receive the [sK] tag include:
- uncommon 混ぜ書き forms. For example in the 向こう岸 entry, the むこう岸 and 向こうぎし forms, while valid and occurring "in the wild", have been given this tag as they are quite rare.
- forms containing uncommon variant kanji such as 異体字 and 旧字体. For example, in the 倭寇/わこう entry, the 倭冦 form uses the rare 冦 variant and has received the tag. Typically a 3% threshold would be used for splitting 旧字体 forms into [oK] and [sK]. (Note that if the forms also occur in reference dictionaries, they will typically be given an [rK] tag instead and remain visible in entry displays.)
- forms containing incorrect kanji, e.g. as a result of a 変換ミス during entry.
- uncommon irregular okurigana forms.
The criteria for assigning the [sK] tag will vary according to the circumstances. In the cases of uncommon variant kanji and irregular okurigana forms an n-gram-based threshold of about 5% of occurrences would typically apply. For 混ぜ書き forms, the threshold may be as high as 20%. For 変換ミス cases, there would be no particular threshold unless there was a specific reason to keep the form visible.
There is some discussion of the [sK] tag on the associated Github issue.
[io] Form Containing Irregular Okurigana Usage
The [io] tag is associated with kanji forms and is currently defined as "irregular okurigana usage". It is currently used in about 900 entries.
The purpose of the tag is to inform users that the okurigana form is not that which is generally used. The tag is only used on forms that are visible, i.e. do not have an [sK] tag.
An example is the 向こう岸 (opposite bank) entry, which includes the 向う岸 and 向岸 forms, both of which have [io] tags. As they have low n-gram counts (3.8% and 1.9% respectively) they could be made [sK], however as 向う岸 is in 広辞苑 it may be preferable to keep it visible.
Reading Information Fields
[gikun] Form With Reading Based on the Meaning
The [gikun] (義訓) tag is used with readings that are based on the meaning of the term and not the readings of the kanji form. Approximately 130 entries have readings with this tag.
Typical examples of terms where the reading has the [gikun] tag are 今日/きょう, 田舎/いなか, 明日/あした and 煙草/タバコ.
Note that the tag only applies to the kanji in the term and not to the parts in kana, such as okurigana or the inflecting parts of verbs and adjectives.
[ik] Form Containing Irregular Kana Usage
The [ik] tag is usually associated with kanji and reading forms and is currently defined as a "term containing irregular kana usage". It is currently used in about 460 entries, mostly in the reading fields, however several of these are in the process of being edited to replace the tags with [sK] or [sk]. The tags should probably be redefined as "form containing irregular kana usage".
The usual purpose of the tag is to inform users that the form contains a kana sequence that either does not match the accepted readings of the matching kanji, or in the case of a外来語 does not represent a correct transliteration. Examples include:
- the お待ち遠様/お待ちどおさま entry, where the form お待ちどうさま is included with the [ik] tag. The どう reading is irregular for the 遠 kanji. N-gram count data reveals that the お待ちどうさま form and the matching おまちどうさま reading amount to over 65% of the usage of the term, despite them not being in most references.
- the アサインメント (assignment) entry, which includes the incorrect, albeit common, アサイメント form.
[sk] Kana Form Recommended as Search-Only
The [sk] tag is associated with kana forms and applies to readings, loanwords in katakana, etc. It is currently used in about 1,300 entries.
The purpose of the tag is to indicate to applications using the database that the form should be used as a lookup key for the entry, but should not be included in a regular entry display due to its relative rarity when compared with the other forms.
The types of kana forms which would receive the [sk] tag include:
- rare or irregular readings of kanji forms, typically those which make up less than 5% of the counts of the regular readings. (More common cases would typically tagged as [ik] or [rk].)
- rare or irregular versions of terms typically written only in hiragana.
- irregular transcriptions of loanwords. For example "archive" is usually transcribed as アーカイブ, which accounts for about 99% of the usage, however it is occasionally written アーカイヴ and this latter form has been tagged as [sk].
[rk] Rarely-used Kana Form
The [rk] tag is associated with kana forms, particularly readings of kanji forms, and is currently defined as "rarely-used kana form". It is currently used in about 50 entries.
The purpose of the tag is to inform users that the form is rarely used in comparison with other kana forms, but is being kept visible because it occurs in major references such as 国語辞典. It would typically be added to forms that occur with frequencies less than 10% of those of the more common forms.
An example is the entry for 客車 (passenger car), which is usually read きゃくしゃ. The かくしゃ reading is rare, but is being kept visible with this tag as it is in at least one reference.
[ok] Outdated or Obsolete Kana Form
The [ok] tag is used to indicate that a kana form, usually the reading of a kanji or kanji compound, is now considered out-of-date or obsolete. Approximately 800 entries have one or more kana forms with this tag.
The usual reason for attaching the tag to a kana form is because one or more 国語辞典 have it flagged as old kana usage. For example, the term 分身 (other self; alter ego) is usually read as ぶんしん. 大辞林 adds 古くは「ふんじん」とも〕, and thus the ふんじんis included with an [ok] tag.