Difference between revisions of "JMdict: Next Generation"

From EDRDG Wiki
(Created page with "=The Next Generation of JMdict= ==Introduction==")
 
(Introduction)
Line 1: Line 1:
 
=The Next Generation of JMdict=
 
=The Next Generation of JMdict=
 
==Introduction==
 
==Introduction==
 +
 +
This page has been set up to record proposed changes to the JMdict microstructure, i.e. the way the information in the dictionary is recorded and laid out. The file as distributed is in XML format and the structure is defined in the JMdict DTD (document type definition). The current DTD can be viewed [https://www.edrdg.org/jmdict/jmdict_dtd_h.html here], and a sample of an entry [https://www.edrdg.org/jmdict/jmdict_sample.html here].
 +
 +
The changes are likely to involve:
 +
* additional or changed XML elements. These are the data items of groups of items that carry the information. For example the kanji forms of a Japanese term are located in "keb" elements within a "k_ele" element.
 +
* additional or changed attributes. These attributes and their values provide information about the element, for example the "gloss" element uses the "xml:lang" attribute to carry the language code.
 +
* additional or changed entity values. These are standardized codes covering such things as  part-of-speech, dialects, etc.
 +
 +
== Element Changes ==
 +
 +
=== Entry-wide Information Elements ===
 +
 +
It is proposed to introduce between the reading (<r_ele>) and sense (<sense>) elements an information element for carrying relevant information about the lexical item as a whole. At present such information can only be recorded about senses. Thus the top level of the DTD would change from:
 +
* <!ELEMENT entry (ent_seq, k_ele*, r_ele+, sense+)>
 +
to
 +
* <!ELEMENT entry (ent_seq, k_ele*, r_ele+, info*, sense+)>
 +
 +
There could be zero, one or more <info> elements. The contents would be unstructured text, and an attribute (inf_type) would be used to indicate the type of information, e.g. literal translation, derivation, etc. The DTD description would be:
 +
* <!ELEMENT info (#PCDATA)>
 +
* <!ATTLIST info inf_type CDATA #IMPLIED>
 +
 +
=== Pitch Accent Elements ===
 +
 +
It is proposed to provide for pitch accent information to be included with each reading of a Japanese term. This will be an additional element associated with each reading, and the proposed change to the DTD would from:
 +
* <!ELEMENT r_ele (reb, re_nokanji?, re_restr*, re_inf*, re_pri*)>
 +
to
 +
* <!ELEMENT r_ele (reb, re_pa*, re_nokanji?, re_restr*, re_inf*, re_pri*)>
 +
* <!ELEMENT re_pa (#PCDATA)>
 +
 +
There could be zero, one or several <re_pa> elements per reading. The actual format of the content of the <re_pa> has yet to be decided.
 +
 +
=== Part-of-Speech Separation ==
 +
 +
At present the <pos> element within the <sense> element records both actual parts of speech, e.g. "n", "v5s", "adj-i", etc., as well as supplementary information that is not actually a POS, e.g. "adj-no", and general information which is not a POS at all, e.g. "exp", "int", etc. I is proposed that an additional element <pos_sup> be introduced to record the information which is not an actual POS. The definition of the <sense> element would change from:
 +
* sense (stagk*, stagr*, pos*, xref*, ant*, field*, misc*, s_inf*, lsource*, dial*, gloss*)>
 +
to
 +
* sense (stagk*, stagr*, pos*, pos_sup*, xref*, ant*, field*, misc*, s_inf*, lsource*, dial*, gloss*)>
 +
* <!ELEMENT pos_sup (#PCDATA)>
 +
 +
== Additional Attribute Values ==
 +
 +
The main use of XML attributes in JMdict is to identify different types of <gloss> elements using the "g_type" attribute. At present the values used are "lit", "fig" and "expl". It is proposed to add the following "descr" value to indicate a gloss which is a description of the Japanese term rather than a translation or an explanation of the meaning.
 +
 +
== Additional Entity Values ==
 +
 +
JMdict uses an extensive set of standard entity values for such things as part-of-speech tags, dialect names, fields, etc. It is proposed to add a number of addition values:
 +
 +
* Christn - term associated with Christianity, as with the current "Buddh" and "Shinto" values
 +
* TBC

Revision as of 12:17, 26 July 2019

The Next Generation of JMdict

Introduction

This page has been set up to record proposed changes to the JMdict microstructure, i.e. the way the information in the dictionary is recorded and laid out. The file as distributed is in XML format and the structure is defined in the JMdict DTD (document type definition). The current DTD can be viewed here, and a sample of an entry here.

The changes are likely to involve:

  • additional or changed XML elements. These are the data items of groups of items that carry the information. For example the kanji forms of a Japanese term are located in "keb" elements within a "k_ele" element.
  • additional or changed attributes. These attributes and their values provide information about the element, for example the "gloss" element uses the "xml:lang" attribute to carry the language code.
  • additional or changed entity values. These are standardized codes covering such things as part-of-speech, dialects, etc.

Element Changes

Entry-wide Information Elements

It is proposed to introduce between the reading (<r_ele>) and sense (<sense>) elements an information element for carrying relevant information about the lexical item as a whole. At present such information can only be recorded about senses. Thus the top level of the DTD would change from:

  • <!ELEMENT entry (ent_seq, k_ele*, r_ele+, sense+)>

to

  • <!ELEMENT entry (ent_seq, k_ele*, r_ele+, info*, sense+)>

There could be zero, one or more <info> elements. The contents would be unstructured text, and an attribute (inf_type) would be used to indicate the type of information, e.g. literal translation, derivation, etc. The DTD description would be:

  • <!ELEMENT info (#PCDATA)>
  • <!ATTLIST info inf_type CDATA #IMPLIED>

Pitch Accent Elements

It is proposed to provide for pitch accent information to be included with each reading of a Japanese term. This will be an additional element associated with each reading, and the proposed change to the DTD would from:

  • <!ELEMENT r_ele (reb, re_nokanji?, re_restr*, re_inf*, re_pri*)>

to

  • <!ELEMENT r_ele (reb, re_pa*, re_nokanji?, re_restr*, re_inf*, re_pri*)>
  • <!ELEMENT re_pa (#PCDATA)>

There could be zero, one or several <re_pa> elements per reading. The actual format of the content of the <re_pa> has yet to be decided.

= Part-of-Speech Separation

At present the <pos> element within the <sense> element records both actual parts of speech, e.g. "n", "v5s", "adj-i", etc., as well as supplementary information that is not actually a POS, e.g. "adj-no", and general information which is not a POS at all, e.g. "exp", "int", etc. I is proposed that an additional element <pos_sup> be introduced to record the information which is not an actual POS. The definition of the <sense> element would change from:

  • sense (stagk*, stagr*, pos*, xref*, ant*, field*, misc*, s_inf*, lsource*, dial*, gloss*)>

to

  • sense (stagk*, stagr*, pos*, pos_sup*, xref*, ant*, field*, misc*, s_inf*, lsource*, dial*, gloss*)>
  • <!ELEMENT pos_sup (#PCDATA)>

Additional Attribute Values

The main use of XML attributes in JMdict is to identify different types of <gloss> elements using the "g_type" attribute. At present the values used are "lit", "fig" and "expl". It is proposed to add the following "descr" value to indicate a gloss which is a description of the Japanese term rather than a translation or an explanation of the meaning.

Additional Entity Values

JMdict uses an extensive set of standard entity values for such things as part-of-speech tags, dialect names, fields, etc. It is proposed to add a number of addition values:

  • Christn - term associated with Christianity, as with the current "Buddh" and "Shinto" values
  • TBC