Sentence-Dictionary Linking

From EDRDG Wiki
Revision as of 01:54, 25 February 2013 by JimBreen (talk | contribs) (Created page with "=Sentence-Dictionary Linking= To enable dictionary systems, apps, etc. to use the Japanese-English sentences from the Tanaka Corpus/Tatoeba as examples, a set of word-level i...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sentence-Dictionary Linking

To enable dictionary systems, apps, etc. to use the Japanese-English sentences from the Tanaka Corpus/Tatoeba as examples, a set of word-level indices have been compiled and are associated with each sentence (at present about 150,000 sentences have indices.) These indices are maintained within the Tatoeba system (there is a special GUI for this), and periodically downloaded for use with dictionary systems. The indices are particularly associated with the JMdict/EDICT2 dictionary files, but may also be used elsewhere.

Index Format

The indices for a sentence consist of a line of text with space-delimited index elements for each word in the sentence. The following is an example:

Sentence: その家はかなりぼろ屋になっている。

Indices: 其の{その} 家(いえ)[01] は 可也{かなり} ぼろ屋[01]~ になる[01]{になっている}

The format of the index elements is as follows:

  • the usual headword as it appears in the dictionary. Even if the word is usually written in kana, the kanji form must be used if it is available. This field is mandatory, howver it may be omitted for proper names not found in the dictionary.
  • the reading of the word. This is optional, however it must be used if there are several different dictionary entries with the same headword.
  • a sense number. This is used when the word has multiple senses in the JMdict/EDICT2 file, and indicates which sense applies in the sentence. It is a two-digit numeric in square parentheses. The field is optional.
  • the form in which the word appears in the sentence. This may differ from the indexing word, e.g. if it is an inflected verb or adjective, if the word is usually written in a different way, etc. This field is in "curly" parentheses. It is not mandatory, but should be included where appropriate.