JMdict: Next Generation
- 1 The Next Generation of JMdict
- 1.1 Introduction
- 1.2 Element Changes
- 1.3 Part-of-Speech Separation
- 1.4 Additional Attribute Values
- 1.5 Additional Entity Values
The Next Generation of JMdict
This page has been set up to record proposed changes to the JMdict microstructure, i.e. the way the information in the dictionary is recorded and laid out. The file as distributed is in XML format and the structure is defined in the JMdict DTD (document type definition). The current DTD can be viewed here, and a sample of an entry here.
The changes are likely to involve:
- additional or changed XML elements. These are the data items of groups of items that carry the information. For example the kanji forms of a Japanese term are located in "keb" elements within a "k_ele" element.
- additional or changed attributes. These attributes and their values provide information about the element, for example the "gloss" element uses the "xml:lang" attribute to carry the language code.
- additional or changed entity values. These are standardized codes covering such things as part-of-speech, dialects, etc.
Entry-wide Information Elements
It is proposed to introduce between the reading (<r_ele>) and sense (<sense>) elements an information element for carrying relevant information about the lexical item as a whole. At present such information can only be recorded about senses. Thus the top level of the DTD would change from:
- <!ELEMENT entry (ent_seq, k_ele*, r_ele+, sense+)>
- <!ELEMENT entry (ent_seq, k_ele*, r_ele+, info*, sense+)>
There could be zero, one or more <info> elements. The contents would be unstructured text, and an attribute (inf_type) would be used to indicate the type of information, e.g. literal translation, derivation, etc. The DTD description would be:
- <!ELEMENT info (#PCDATA)>
- <!ATTLIST info inf_type CDATA #IMPLIED>
Entry-wide Language Source Elements
It is proposed to combine the current <lsource> element move from within the <sense> element to become entry wide. The <lsource> element would retain its current attributes (xml:lang, ls_type, ls_wasei). As an example of this, the current アンジョ entry would simply see <lsource xml:lang="por">anjo</lsource> move from the first (and only) sense to be entry wide.
Implicit in this change is that entries, such as パン, which record loanwords from several source languages, will need to be split into an entry for each source language term.
(An earlier suggestion that the <dial> elements would also become entry wide as an attribute of <lsource> has been withdrawn.)
Entry-wide Inflection Pattern Elements
It is proposed to include an entry-wide <infl> element containing information about conjugation or inflection patterns of the entry. This element would typically only be used for entries which comprise or end with a verb or adjective, and would indicate the appropriate inflections for tense, mood, aspect, etc. It would supplement and partially replace the present system where such information is embedded in the part-of-speech coding as the sense level (v1, v5m, adj-i, etc.) The format of the element has yet to be decided.
Pitch Accent Elements
It is proposed to provide for pitch accent information to be included with each reading of a Japanese term. This will be an additional element associated with each reading, and the proposed change to the DTD would from:
- <!ELEMENT r_ele (reb, re_nokanji?, re_restr*, re_inf*, re_pri*)>
- <!ELEMENT r_ele (reb, re_pa*, re_nokanji?, re_restr*, re_inf*, re_pri*)>
- <!ELEMENT re_pa (#PCDATA)>
There could be zero, one or several <re_pa> elements per reading. The actual format of the content of the <re_pa> has yet to be decided, however there should be the potential to support multiple systems for describing pitch accent information. A possible approach would be to have an attribute value such as:
- <!ATTLIST re_pa pa_type CDATA #IMPLIED>
An example for entry 1584660 (明日/あした) might be "<re_pa pa_type="am">3</re_pa>" with the あした reading indicating an accent on the 3rd mora.
The Wikipedia page on Japanese pitch accent contains some useful information.
At present the <xref> element within <sense> simply states a target surface form and if specified a sense number, e.g. "<xref>スライド・1</xref>". It is proposed to expand this by including the target entry sequence number and sense number as attributes, and also to allow for clearer identification of the preferred surface forms used in apps, etc. Examples:
- <xref type="see" seq="1073760" sno="1">スライド</xref>
- <xref type="see" seq="1375820" xr="なるほど">なるほど</xref>
- <xref type="see" seq="1585480" sno="2" xk="傀儡" xr="くぐつ">傀儡(くぐつ)</xref>
The attributes would be:
- type. Either "see" or "ant".
- seq. The sequence number of the target entry.
- sno. The sense within the target entry to which the cross-reference refers. If absent it will refer to the whole entry.
- xk. The kanji surface form in the target entry to be associated with the cross-reference. The default will the first form in the kanji field of the target entry, however it can be set during the creation or editing of the entry.
- xr. The reading surface form in the target entry to be associated with the cross-reference. Would only be used if there was no kanji field or if a specific reading is the target.
The text portion would be retained in a modified form as this makes it easier to generate legacy versions such as EDICT. The "・" (nakaguro) character would no longer be used to separate the parts of the target surface form as this character is used within some entry terms.
In addition an optional "dict" attribute would be available to indicate a cross-reference to a related dictionary, e.g.
- <xref type="see" dict="jmnedict" seq="5524869">朝日新聞</xref>
The current <ant> entity would be removed and instead "<xref type="ant" ......>" would be used.
At present the <pos> element within the <sense> element records both actual parts of speech, e.g. "n", "v5s", "adj-i", etc., as well as supplementary information that is not actually a POS, e.g. "adj-no", and general information which is not usually regarded as a POS at all, e.g. "exp", "int", etc.
It was proposed that an additional element <pos_sup> be introduced to record the information which is not an actual POS. After some discussion that proposal has been withdrawn. A better approach may be to simply document which elements are actually parts of speech and which are supplementary information.
Additional Attribute Values
The main use of XML attributes in JMdict is to identify different types of <gloss> elements using the "g_type" attribute. At present the values used are "lit", "fig" and "expl". It is proposed to add the "descr" value to indicate a gloss which is a description of the Japanese term rather than a translation or an explanation of the meaning.
Additional Entity Values
JMdict uses an extensive set of standard entity values for such things as part-of-speech tags, dialect names, fields, etc. It is proposed to add a number of additional values. Some which have been added recently are:
- Christn - term associated with Christianity, as with the current "Buddh" and "Shinto" values
- net-sl - Internet slang
- dated - dated term
- hist - historical term
- litf - literary or formal term