JMdict: Next Generation

From EDRDG Wiki

The Next Generation of JMdict

Introduction

This page has been set up to record proposed changes to the JMdict microstructure, i.e. the way the information in the dictionary is recorded and laid out. The file as distributed is in XML format and the structure is defined in the JMdict DTD (document type definition). The current DTD can be viewed here, and a sample of an entry here.

The changes are likely to involve:

  • additional or changed XML elements. These are the data items of groups of items that carry the information. For example the kanji forms of a Japanese term are located in "keb" elements within a "k_ele" element.
  • additional or changed attributes. These attributes and their values provide information about the element, for example the "gloss" element uses the "xml:lang" attribute to carry the language code.
  • additional or changed entity values. These are standardized codes covering such things as part-of-speech, dialects, etc.

Element Changes

Entry-wide Information Elements

It is proposed to introduce between the reading (<r_ele>) and sense (<sense>) elements an information element for carrying relevant information about the lexical item as a whole. At present such information can only be recorded about senses. Thus the top level of the DTD would change from:

  • <!ELEMENT entry (ent_seq, k_ele*, r_ele+, sense+)>

to

  • <!ELEMENT entry (ent_seq, k_ele*, r_ele+, info*, sense+)>

There could be zero, one or more <info> elements. The contents would be unstructured text, and an attribute (inf_type) would be used to indicate the type of information, e.g. literal translation, derivation, etc. The DTD description would be:

  • <!ELEMENT info (#PCDATA)>
  • <!ATTLIST info inf_type CDATA #IMPLIED>

Entry-wide Language Source Elements

It is proposed to combine the current <lsource> and <dial> elements and move them from within the <sense> element to become entry wide.

The <lsource> element would retain its current attributes (xml:lang, ls_type, ls_wasei), and the information in the current <dial> elements would become the value of a new "ls_dial" attribute within the <lsource> element.

Examples:
- the current アンジョ entry would simply see <lsource xml:lang="por">anjo</lsource> move from the first (and only) sense to be entry wide.
- the current アホ野郎 entry would see the dialect recorded as <lsource ls_dial="ksb"/> at the entry level instead of <dial>&ksb;</dial> within the first sense.

Implicit in this change is that entries, such as パン, which record loanwords from several source languages, will need to be split into an entry for each source language term.

Entry-wide Inflection Pattern Elements

It is proposed to include an entry-wide <infl> element containing information about conjugation or inflection patterns of the entry. This element would typically only be used for entries which comprise or end with a verb or adjective, and would indicate the appropriate inflections for tense, mood, aspect, etc. It would replace the present system where such information is embedded in the part-of-speech coding as the sense level (v1, v5m, adj-i, etc.)

Pitch Accent Elements

It is proposed to provide for pitch accent information to be included with each reading of a Japanese term. This will be an additional element associated with each reading, and the proposed change to the DTD would from:

  • <!ELEMENT r_ele (reb, re_nokanji?, re_restr*, re_inf*, re_pri*)>

to

  • <!ELEMENT r_ele (reb, re_pa*, re_nokanji?, re_restr*, re_inf*, re_pri*)>
  • <!ELEMENT re_pa (#PCDATA)>

There could be zero, one or several <re_pa> elements per reading. The actual format of the content of the <re_pa> has yet to be decided, however there should be the potential to support multiple systems for describing pitch accent information. A possible approach would be to have an attribute value such as:

  • <!ATTLIST re_pa pa_type CDATA #IMPLIED>

An example for entry 1584660 (明日/あした) might be "<re_pa pa_type="hm">3</re_pa>" with the あした reading indicating a higher pitch on the 3rd mora.

Cross-References

At present the <xref> element within <sense> simply states a target surface form and if specified a sense number, e.g. "<xref>スライド・1</xref>". It is proposed to expand this by including the target entry sequence number and sense number as attributes, e.g.

  • <xref type="see" seq="1073760" sno="1">スライド・1</xref>

The text portion would remain the same as this makes it easier to generate legacy versions such as EDICT.

The current <ant> entity would be removed and instead "<xref type="ant" ......>" would be used.

Part-of-Speech Separation

At present the <pos> element within the <sense> element records both actual parts of speech, e.g. "n", "v5s", "adj-i", etc., as well as supplementary information that is not actually a POS, e.g. "adj-no", and general information which is not usually regarded as a POS at all, e.g. "exp", "int", etc.

It was proposed that an additional element <pos_sup> be introduced to record the information which is not an actual POS. After some discussion that proposal has been withdrawn. A better approach may be to simply document which elements are actually parts of speech and which are supplementary information.

Additional Attribute Values

The main use of XML attributes in JMdict is to identify different types of <gloss> elements using the "g_type" attribute. At present the values used are "lit", "fig" and "expl". It is proposed to add the "descr" value to indicate a gloss which is a description of the Japanese term rather than a translation or an explanation of the meaning.

Additional Entity Values

JMdict uses an extensive set of standard entity values for such things as part-of-speech tags, dialect names, fields, etc. It is proposed to add a number of addition values:

  • Christn - term associated with Christianity, as with the current "Buddh" and "Shinto" values
  • TBC