JMdict/EDICT Editorial Policy and Guidelines
These guidelines are intended for people preparing new entries or amendments for the JMdict/EDICT files. Typically these entries or amendments will be made via the JMdictDB on-line database system.
Before proposing a new entry or an amendment, you should:
- familiarize yourself with the style of the dictionary, particularly the way the English meanings are typically worded;
- make very sure it is not already an entry. An amazing number of "new" entries turn out to be in the dictionary already, or variants of existing entries. If it is a variant, add it to the existing entry. Check such things as:
- common variants of writing 外来語, e.g. using either ー or イ for extending vowels, having a ー at the end (コンピューター/コンピュータ), etc.;
- common okurigana variants, e.g. 生花/生け花;
- modern and old kanji, e.g. 合気道/合氣道
- check you have written it correctly. Has it the correct kanji? Is the reading correct, with the vowel length right, ず/づ issues resolved, etc.?
- verify the source. There are excellent online dictionaries available, e.g. the Sanseido dictionaries at the Goo site, and the various collections at the Kotobank site. The Eijiro dictionary at the ALC site is also useful. If the word or phrase can't be found in a dictionary, WWW references to where it is used may suffice, but the meaning and context has to be clear. Dictionary and other reference information must be included in the "Reference" section in the form. Include the precise URL - just "weblio" or "wiki" is no use at all to the editors. Note that if the page you are referencing is not about the proposed entry, include an extract from the reference text to help the editor(s) establish the validity of the proposed entry.
- verify that the word or phrase is common enough to include in the dictionary. Page counts for Google or Yahoo are useful for this purpose. In general unless a word or phrase has more than about 50 hits on the WWW, it is not worth submitting.
- decide whether it is really worth having as an entry. Some expressions are so obvious that it just clutters to dictionary to include them. (See the section below.)
Dictionary Entry Fields
The Kanji section of the entry form contains the form of the Japanese word/phrase which contains kanji, special characters or letters from non-Japanese scripts (e.g. ＭＰ３プレーヤー). The word/phrase should written in full-width characters (e.g. it is not MP3プレーヤー).
There may be more than one version of the word or phrase in this section. The usual reasons for having more than one version (also known as "surface forms" or "orthographical variants") are:
- alternative kanji in the word, e.g. 合気道 and 合氣道
- variations in okurigana, e.g., 生け花 and 生花
- part of a word being written either in kanji or kana, e.g., 言い付ける and 言いつける
Where there are multiple forms of a word, enter them with the most commonly used form first, and then order them in decreasing frequency of use. In general irregular or incorrect forms, e.g. those tagged iK, io or ik, should be placed to the rear of the surface form list, even if they are commonly used on WWW pages.
Synonyms should not be included here. Instead they should be entered as separate dictionary entries, and a cross-reference inserted to them.
Some other points to note:
- in the case of na-adjectives (形容動詞), the な is NOT included in the entry (some Japanese dictionaries include it.) Use a part-of-speech of "adj-na".
- as most adverbs are derived from either regular adjectives (く form) or na-adjectives (に), there is no need to have an entry unless the adverb is not apparent from the adjective.
- for verbs formed from adding する to a noun, do not include the する in the headword - instead use the part-of-speech of "vs". The exception to this is the group of single-kanji-plus-する verbs such as 愛する. For these include the complete verb and use the "vs-s" part-of-speech.
- for adverbs that are indicated by と, e.g. まざまざと, do not include the と, instead note the part-of-speech as "adv-to".
- for adjectives that use たる (and と in the adverbial form), e.g. 依然たる, 依然と, omit the たる and と and use "adj-t" as the part-of-speech.
- for the -さ (-ness) and -く (adverb) inflections of adjective, only include them if the meaning is not obvious from the gloss of the adjective itself.
A set of tags, e.g. iK or oK, can be applied to the words in this section. These should be used sparingly.
In this section enter either:
- the reading(s) of the word/phrase in the Kanji section, or
- the word itself if it is written only in kana, such as a 外来語 or a word/phrase written only in hiragana.
Readings associated with kanji should normally be in hiragana; the main exceptions being:
- Chinese or Korean words and names, which are often transliterated using katakana;
- the names of biological species which should be entered in both katakana and hiragana (if there is also a kanji form.)
- older loanwords such as 硝子 (ガラス: glass) and 加里 (カリ: potassium). Included in this are some country names such as 加奈陀 (カナダ), 英吉利 (イギリス) and 亜米利加 (アメリカ).
More than one reading can be entered where alternatives are possible. This can occur when
- a kanji has alternative readings;
- where there are different transliterations of 外来語, e.g., ダイヤモンド and ダイアモンド;
- where a species name is being recorded; in these cases both hiragana and katakana forms should be entered. The katakana form must have "[nokanji]" after it to indicate that it is used without the kanji form, and a "[uk]" should be included in the Meanings field. Place the hiragana form first (client software such as WWWJDIC will display the katakana form first.)
- where a katakana form is commonly used and is identical to one of the readings (e.g. 仏陀-ぶっだ-ブッダ). Here also place "[nokanji]" after the katakana version and place it at the end of the readings.
Where alternative readings are restricted to particular variants of the kanji form, specify this using the [restr=KKK] pattern after the reading. As in the Kanji section, place the more common reading(s) first.
外来語 (in katakana) are entered in this section. Do not enter them in the kanji section. Where a 外来語 is a transliteration of several source words, include versions with and without a separating "middle dot", e.g. "アームレスチェア;アームレス・チェア". Note that the JIS middle-dot must be used - there are other Unicode middle-dots which are not accepted.
If a 外来語 (e.g. ベースボール) means the same as a native Japanese word (e.g. 野球), do not include the 外来語 form as a reading of the kanji. Instead create a separate entry and create cross-references between them. Similarly if two kana-only words have the same meaning, do not place them in the same entry unless they are related, e.g. spelling or pronunciation variants.
If the kanji part contains katakana (e.g. 一眼レフ), use katakana in the Reading as well for the matching portion (いちがんレフ).
A set of tags, e.g. ik or ok, can be applied to the words in this section. These should be used sparingly.
The Meanings section of the entry form is divided into senses, i.e. distinct meanings. These are indicated by a sense number: , , etc. Each sense can have a number of part of speech tags (POS), e.g. [n], [adj-i] and miscellaneous tags, e.g. [abbr] and [col].
The meanings consist of one or more short translations or explanations of the Japanese word or phrase.
- do not copy translations, especially longer ones, directly from other dictionaries. For simple terms there may not be much in the way of alternatives, but for longer explanations use you own words, reword things, etc. Significant copying carries a risk of charges of plagiarism or copyright violation.
- where the Japanese has more than one distinct meaning, break the section into senses.
- make each translation a separate item, i.e. place a ";" between them. This makes reverse look-up and exact match on the English possible. Some examples:
- abbreviations: "three letter acronym; TLA" not "three letter acronym (TLA)"
- conjunctions: "rice field; rice paddy" not "rice field or paddy"
- where different forms of English use different terms, include all major variants (e.g. both "snow pea" and "mange tout" or "tap" and "faucet".)
- do not use capital letters unless referring to a proper name (person, place, etc.) Japanese theatrical forms should be given as "noh" and "kabuki"; not "Noh", "Kabuki", etc.
- do not precede the meaning with the articles "a", "an" or "the" unless it is absolutely necessary to make the meaning clear.
- when putting numbers into translations be consistent and concise. In general:
- if the numbers are in the context of a formula, quantity, measurement, etc. use figures (e.g. 1.5 kilograms);
- if the numbers are in something more descriptive or narrative, in general use words for numbers up to ten (e.g. three kings, five flowers), and figures for numbers over ten (e.g. 147 angels). In some cases, such as the 三十三所 entry, "thirty-three temples" looks more natural than "33 temples".
- avoid mixing figures and words, even if it means relaxing the advice above. Writing "eat five to twenty raisins" or "eat 5 to 20 raisins" is fine, but "eat five to 20 raisins" looks unnatural.
- make the translations as international as possible. For example, use "university" rather than "college" when referring to tertiary education, as outside the US the word "college" has much wider usage.
- include both "British" and "American" spellings. For short meanings it is better to repeat the meaning with the alternative spelling, however it is also acceptable to just put the alternative at the end in parentheses, e.g. "full colour (color)". Do not use patterns such as "colo(u)r" as they can't be searched for successfully.
- when using "e.g." to expand on the meaning of a word by giving examples, or when using "i.e." to qualify the meaning of a word, place the expansion in parentheses after the initial translation. For example say "hand game (e.g. rock, paper, scissors)", not "hand game, e.g. rock, paper, scissors". Also, do not include a comma after e.g. or i.e.
- best not to use "etc" with a one-item list. It such cases, "e.g." is preferable.
- as with the use of "e.g." and "i.e." above, it is OK to add a few words of extra information in parentheses after the translation. The situations where this is done include:
- providing some context for the term;
- short disambiguations;
- short explanations for technical words or words where the meaning might not be clear to a literate user;
- scientific names of species (but only when it is following a common name). This is explained more fully below.
- provide useful explanations where appropriate. "type of card game" is not very useful - in such a case explain briefly what the card game entails
- never create an English meaning purely based on the translation of the meanings of the kanji making up a word. Sometimes it will be correct, but there are many cases where the result would be quite wrong. (魂柱 does not mean "spirit pillar").
- when entering the scientific name of a plant, animal, etc. put it in brackets after the first common English name, e.g. "spectacled bear (Tremarctos ornatus)". Note that the first word of the scientific name will have a capital letter. (See the note on "Names of biological species" below.)
- put any context in brackets, e.g.: "consulting (the oracle)" not "consulting the oracle".
- when indicating a field or domain for an entry, e.g., "comp" or "ling", state it using the [fld=xxxx] pattern. The full list of field tags is here. For example:
- [fld=comp] floating-point
- when entering the name of a species of animal, plant, etc. do not use the "zool", "bot" field tags, as this should be obvious. Those tags are really to establish the context of a technical term.
- short explanatory notes can be included as part of a sense. Use the pattern [note="this is a note"]. These should be kept short, and only used when it it is necessary to include some information that can't go in a gloss. In general it is best to word the glosses so that further explanation is not needed. (These explanatory notes are not carried through to the legacy EDICT format of the dictionary, so it is permissible to have Japanese text in them.)
- where the English meaning is an obscure technical term, add a short explanation in lay terms after it in parentheses. Do not add such explanations where the English meaning should be clear to a literate user (this is not an English dictionary.)
- it is sometimes useful to indicate the literal meaning of an idiomatic expression, etc. In this case:
- place "[lit]" at the front of the gloss;
- place this gloss last, after the real translation(s).
- note that the "[lit]" tag should not be used for such things as literal translations of the kanji in a jukugo.
- if a gloss has a figurative meaning, this can be indicated by placing "[fig]" in front of it.
- on occasions the usual translation may be a bit opaque, and a more complete explanation would be helpful. In this case add a more explanatory gloss with "[expl]" in front of it. Keep these to a minimum (it's a dictionary; not an encyclopedia.) The "[expl]" tag is usually not used if it is the only gloss in the sense.
(At present the "expl", "lit" and "fig" tags are only used in the database - they are not yet exported to JMdict or EDICT.)
Which Reference Is Best?
On occasions references (see the list of dictionaries, etc. below) will differ as the the meanings of entries, and which senses are more important than others. Here are some suggestions for handling this:
- our goal is to reflect modern Japanese, so precedence should be given to sources that indicate up-to-date usage;
- the major Japanese-English dictionaries tend to be more up-to-date and focussed in their translations than the 国語辞典;
- 広辞苑 lists its meanings in historical order, so use its material with caution;
- in general 大辞林's meanings (especially in recent editions) appear to be more topical than those in 大辞泉 or 日国;
- if a term only appears in 大辞泉 or 日国, consider tagging it "obsc" or "arch" as appropriate, unless it gets a reasonable number of WWW hits;
- WWW pages can give confirmation of modern contexts, although quite a few pages may have to be scanned. Sometimes looking at the associated images can give a quick indication of which sense is dominant
Part-Of-Speech (POS) Issues
- where a term can be used in multiple roles, e.g. as a noun, adjective, adverb, etc., the part-of-speech tags should usually be ordered with the most common role(s) first.
- many nouns in Japanese can also be, or act as adjectives (e.g. tagged as adj-na, adj-no, or adj-f in JMdict.) These terms should generally have "n" as the first part-of-speech tag and be given a noun meaning. Exceptions can be made when the adjective usage is obviously much more common*, e.g. with 複雑, or when it's difficult to translate the term as a noun in English, e.g. スポーツ万能. The approach taken by major Japanese-English dictionaries can be a guide, as can the n-gram frequency counts.
- in general the form of the meaning should agree with the first part-of-speech tag for the sense. If the Japanese word is marked as a noun, don't make the translation a verb (e.g. to xxxx) or an adjective.
- do not list verb translations for nouns that can also be used as verbs (i.e. [n,vs]). See the 料理 entry, which has: "cooking; cookery; cuisine", not "cook".
- if the verb sense is not easily derived from the noun form, include a second sense with a POS of "vs" in which meaning will be "to ...".
- if the POS of an entry is "vs" alone, the meaning will be given as a verb (such entries are rare.)
- when entering a verb, use the infinitive in English (to run, to jump, etc.)
- for adjectives, the English entry should be just the adjective, not the adjective and copula:
- "lucky" not "be lucky" or "is lucky"
- for entries marked "adj-no" or "adj-na", do not include "adj-f" as well, as the dropping of the の and な particles is quite common.
- there is a range of archaic POS tags available, e.g. the ones associated with the 二段 and 四段 verb types (v2* and v4*). Most modern verb equivalents have an archaic verb equivalent, i.e. most verbs that are "v5k" could also be marked as "v4k", and most "adj-na" entries could also be flagged as "adj-nari". For most words such extra tags are quite redundant. The old verb, etc. POS tags should only be used for archaic words which never use a modern conjugation, e.g. 崇まふ.
If the word or term comes from another language, mark this at the beginning of the sense(s) to which it applies. The format is [lsrc=lng:], where lng is the three-letter code from the ISO 639-2:1998 "Codes for the representation of names of languages" standard, e.g.:
- アルバイト [n,vs][lsrc=ger:Arbeit] part-time job
- アールデコ [n] [lsrc=fre:"art déco"] art deco
Don't do this for (i) common Sino-Japanese vocabulary, (ii) loan-words from English where the source word is among the translations; (iii) words/terms which are translations from other languages. If the word or term in the source language is identical to the translation, don't repeat it in the [lsrc:...] field. Note that where a loan-word from English was originally from another language, e.g. ベランダー/verandah, the usual practice is not to indicate a source language.
Non-English source languages are usually indicated in the major 国語辞典 such as Daijrin and Daijisen, and also in 外来語 dictionaries such as the Gakken カタカナ 新語辞典. In cases of disagreement or doubt, e.g. where a term may have come from either English or French, omit any source language marking.
Source words in languages that use a non-Latin script should be given in Latin transcription. Diacritical marks can be used. For the following languages, use these transcription systems:
- Chinese: Pinyin (with tonal marks)
- Russian: BGN/PCGN
- Korean: Revised Romanization of Korean (not Yale or McCune–Reischauer)
- Sanskrit: IAST
The language markings apply both to loanwords (外来語), as with the examples above, and to transliterations (音写), typically the Buddhist terms taken from Sanskrit, which are not usually regarded as loanwords.
Note that where ISO 639 discriminates between historical forms of a language, e.g. "grc" for Classical Greek and "gre" for Modern Greek, the modern tag is to be used as the discrimination cannot easily be applied at the word level.
Cross-references can be made to other dictionary entries where this enhances the value of the entry to the typical dictionary user. Examples of such useful cross-references are:
- where one entry is an abbreviation of another, e.g. 学割 and 学生割引 (see below).
- where the words are commonly associated or contrasted, e.g. 先輩/後輩, 税別/税込み, etc.
- where there is a derivational relationship between words that it is useful to highlight, e.g. between かっけー and 格好いい, or between オケる and 空オケ.
At present two classes of cross-reference are supported: a general "see" and an "ant" for antonyms.
Specify the cross-reference using the pattern [see=言葉] or [ant=何等] (see the detailed instructions). Where the reference is to a particular headword/reading combination, use the format: kanji・reading, e.g., [see=金本位・かねほんい]. Where the target word has a kanji form, that form should be used. For targets that are a particular sense of the target word use the format [see=漢字]
Please note that the "ant" (antonym) tag should only be used for genuine opposites. Words such as "short" and "tall" are antonyms; "short person" and "tall person" are not - use the regular "[see=...]" form for these. (For more information, see the excellent Wikipedia article on this.)
Avoid adding cross-references to words which simply mean the same (or opposite), as it adds a lot of clutter to the entries without necessarily being helpful to users. There are related systems such the the Japanese WordNet which specifically provide details of large numbers of synonyms. Some systems such as WWWJDIC link to the Japanese WordNet as part of the entry display.
Many Japanese terms are abbreviations of longer terms, for example 学割 is an abbreviation of 学生割引. When creating an entry for such an abbreviation:
- add the tag "[abbr]" to indicate it is an abbreviation;
- add a cross-reference to the full form (add an entry for the full form if necessary.) For example "[see=学生割引]".
If appropriate, a cross-reference back from the full form to the abbreviation may be appropriate.
Romanized forms of Japanese words may be used within meanings in the following situations:
- words such as "karate", "samurai" or "kimono" which have become part of the English lexicon. These would typically be the first meaning or gloss of the sense;
- Japanese proper nouns such as Tokyo and Meiji;
- romanized forms of Japanese terms which are in reasonably common use in particular contexts, e.g. "wasei eigo". These would not usually be the first meaning or gloss for the entry, but would follow more explanatory meaning(s).
The Hepburn romanization system, in particular the revised (aka modified) version, will be used. That page can be taken as a guide, with the key points being:
- where appropriate long vowels will be indicated using macrons (not circumflexes). Thus the era name 養老 should be written as "Yōrō"; not "Yourou", "Yooroo" or "Yoro".
- where a Japanese term or name is commonly used in English, such as "tofu" and "judo", macrons would typically not be included on the long vowels. It may be appropriate to include the version with macrons in parentheses at the end of the gloss, e.g. "somen (sōmen)". Terms that are not regularly used in English should use macrons, e.g. "man'yōgana".
- where ambiguities may occur, e.g. in words such as ほんやく or しんいち, apostrophes should be used to make the underlying kana forms clear, e.g. "hon'yaku" and "shin'ichi".
Old and Rarely Used Terms
Several miscellaneous tags are available for indicating that terms are no longer in current use or are rarely used. They are:
- "arch" (archaism). This is typically used to indicate that the term was primarily used during or before the Edo period.
- "obs" (obsolete). This is typically used for terms that were in use in the Meiji and early Showa periods, but are no longer in general use, e.g. they have been supplanted by another term.
- "obsc" (obscure). This is used to indicate that a term, although in current use, is rarely encountered. A term that is included in one or more 国語辞典 but is not in Japanese-English dictionaries and has low occurrence levels in n-gram corpora would be a candidate for this tag. It is also particularly appropriate to add it to a term if there are other more common terms with the same meaning.
- "hist" (historical). This is used to indicate a current term that refers to a concept in the past, e.g. an art-form common in the 18th century.
- "dated". This is used to indicate an old term that is still used but sounds old fashioned and is possibly inappropriate in modern contexts.
Numbers with Units and Symbols
In general where a number is followed by a unit or symbol, the following spacing rules should be followed:
- a space should be used between numbers and associated units. Please use "100 km"; not "100km".
- where a number is followed by a symbol, do not include a space. Examples of this include "15°C" and "9%". Note that 5 cents would be "5c" as the "c" is treated as a symbol.
Date and Time Formats
For the sake of consistency, the same format should be used when recording specific dates. The preferred formats are:
- March 17 (where the year is not included)
- March 17, 2019 (where the year is included)
For the dates of individual people, e.g. in the named-entity dictionary, use the YYYY.MM.DD format for the sake of brevity, e.g. "Yukio Mishima (1925.1.14-1970.11.25)".
Similarly, for specific times of the day, use the "2am" and "12:30pm" styles, both to be consistent and to use the minimum amount of space.
Capital letters should generally be confined to proper nouns, e.g. specific countries, places, people, products, etc. Astronomical objects such as the Sun, Saturn, etc. will have capitals, but moonlight and sunshine will not.
This is where you indicate the sources for the entry or amendment. It helps establish its validity, enables editors to check out the accuracy, e.g. of the translation from a 国語辞典, and leaves a record for other people to know where the entry and translation came from.
- for proposed new entries supporting reference information MUST be provided. Proposals without any such information may be summarily rejected by an editor;
- for amendments to existing entries, straightforward suggestions such as spelling changes or rewording of translations need not have references, but more substantial changes must be accompanied by references and/or a case for the change in the Comments field.
The best references are to other dictionaries, and the more the better. Sometimes just the name of the dictionary will do, where the proposed entry is already an entry in the reference, however if the entry in the dictionary is readily visible online it is better to include the URL. Editors and regular contributors have developed a set of abbreviations and mnemonics for some of the popular sources:
- koj: Kôjien, 広辞苑 - a major medium-sized 国語辞典.
- daijr: Daijirin, 大辞林 - another major medium-sized 国語辞典.
- daijs: Daijisen, 大辞泉 - another major medium-sized 国語辞典.
- nikk: Nikkoku 日国/日本国語大辞典 - a major multi-volume 国語辞典.
- GG5: Kenkyusha 新和英大辞典第５版 - major Japanese-English dictionary (translators often refer to this as the "Green Goddess", hence the "GG".)
- ＫＯＤ追加語彙: addenda to the GG5, available via the Kenkyusha online dictionary site
- ルミナス: Luminous ルミナス和英辞典 - medium Kenkyusha JE dictionary
- GJD: 日本語大事典 (The Great Japanese Dictionary) - medium-sized 国語辞典 with brief English glosses
- 新和英中辞典: medium Kenkyusha JE dictionary
- リーダーズ+プラス: medium-sized Kenkyusha English-Japanese dictionary
- 新英和大辞典: large Kenkyusha English-Japanese dictionary
- 新英和中辞典: medium-sized Kenkyusha English-Japanese dictionary
- JWN: Japanese WordNet
- LSD: Life Sciences Dictionary - major biomedical terminology dictionary
- カタカナ新語辞典 (Gakken): a useful dictionary of loanwords
- Unidic: morpheme dictionary from the National Institute for Japanese Language and Linguistics (NINJAL)
- eij or alc: Eijiro, 英辞郎 - large word/phrase collection, available online at the ALC site. In general this resource is not suitable as the sole reference for a proposed term (see the comment below).
- 実用日本語表現辞典, which is often used by the Weblio aggregator. This site is useful for helping understand expressions, etc. but should not be used as a sole reference for a proposed entry.
Some of the above references are available via aggregator or reference WWW sites such as Goo, Weblio, Yahoo, etc. In such cases please make sure the reference URL is to the specific term on the site, and add the name of the actual dictionary being used for the reference (大辞林, 日国, etc.)
If the references include online resources such as a dictionary entry or a Wikipedia article, quote the relevant URL. Please note that a Japanese Wikipedia article by itself is not necessarily a good source for a dictionary entry. Some articles are simply translations from an English page and not evidence that a term is in use in Japanese. Sometimes an article only covers one aspect of a term's usage, and there are other senses which need to be covered. It is best to check the term in other sources and state that in the References section.
If the sources for the entry are other WWW-based documents, quote the URLs of at least one (preferably several), and use the Comments field to state your case for it being included.
As noted above, the Eijiro glossary should not be the sole source of references for a proposed entry, although it may be used as a supplementary reference for confirming meanings. This is because the glossary is a collection of Japanese-English pairs which have apparently been collected from translations. In a Japan Times article Daniel Morales described it as "a smorgasbord of reibun and definitions, some of which err on the side of slang, often delighting the expat community. For example, the entry for nyūbō (乳房, breasts) has no fewer than 51 English options, including the ever-so-mature “funbags.” And kyūryōbi (給料日, payday) lists “when the eagle flies” (an American tribute to governmental pay), among other more colorful renditions."
Use this field to enter any additional information you think will help the editors when they assess the entry or amendment. These comments are kept with the entry as a record of the discussions. The Comments field will also be used by editors when providing feedback.
While is not mandatory, it is best if you include your name. Editors get to know who are regular contributors of amendments and new entries, and it is easier to establish some rapport if the contributor is identified. Also, having an email address enables editors to contact a contributor directly if there is a question they wish to raise. Note that email addresses cannot be seen by people browsing the database; they are only visible to editors who have logged into the system.
There is no requirement for people submitting new entries or amendments to identify themselves. It is preferred, however, that people making regular contributions provide some identification, either their name or a pen-name, as it will add to the sense of community among the participants, and also enable the editors to take into account the quality of previous contribution(s) when examining a proposal.
Although the database supporting the dictionary uses Unicode coding and can contain any character from that set, the distributed forms of the dictionary are more constrained, in particular:
- the (legacy) EDICT format can only contain characters in the JIS X 0208 set. This includes 6,356 kanji, alphanumerics and the Greek and Russian alphabets, but does not include Latin alphabet characters with diacritics, such as é and ö.
- the EDICT2 format used by WWWJDIC and some other applications can contain characters from both JIS X 0208 and JIS X 0212. As well as containing an additional 5,801 kanji, JIS X 0212 adds a range of other characters including Latin alphabet characters with diacritics.
The JMdict database is in Unicode and thus can contain any valid Unicode characters.
Care needs to be taken with the inclusion in the database of characters outside the JIS X 0208 and JIS X 0212 codesets as this has implications for the EDICT and EDICT2 versions of the data. In particular:
- any non-JIS208/212 character(s) will be removed. This means that if such characters are used, e.g. some hangul in a note, then the romanized version should be included as well.
- for EDICT (but not EDICT2) alphabetics with diacritics will be replaced as appropriate, e.g. ö will be changed to oe.
- if a kanji or reading part of an entry contains non-JIS characters. then the part will be removed entirely. JIS X 0212 kanji are retained in EDICT2, but in EDICT the entire kanji part is removed.
Kanji which lie outside the JIS X 0208 and JIS X 0212 codesets, e.g. the additional kanji in JIS X 0213, can be included in the database and will be in the JMdict distributions, however, they will not be propagated into the EDICT/EDICT2 distributions.
Merging Entries/Two-out-of-three Rule
On occasions, two or more entries may be merged when there are grounds for assuming they are variants of each other. The basic principle that is applied is a "two-out-of-three" rule (first described in a paper in 2004). For the candidate entries, if at least two out of the (a) kanji-headword, (b) reading and (c) meaning fields are the same, the entries may be merged. Otherwise they must be separate entries. It is often not a simple decision, as there may be kanji-headwords which only apply to some of the readings.
Where the entries have multiple kanji parts or readings, this rule really applies only to the major/common forms. Mergers should not be carried out on the basis of a rare or archaic kanji form or reading. Common sense must apply.
Two entries with no kanji could be merged if they have the same meaning and the kana forms are related, e.g. are variants of each other, such as ダイアモンド and ダイヤモンド.
Is it worth including?
An important issue is whether a possible entry is worth including. This question primarily arises with expressions such as XXXのYYY/XXXがYYY/etc. or compound nouns/multi-word expressions. Clearly we want to include entries that are useful and relevant, but we don't want to clutter the dictionary with things that are obvious. It is inevitably a value judgement and often leads to some debate between editors before a proposed entry is accepted or rejected. All dictionaries have to deal with this issue. It is worth reading the Wiktionary Criteria for inclusion as it discusses many of the issues in considerable detail. The following is a list of criteria being used by the editors to assess whether a proposed entry should be included. Generally passing one or more of these criteria is needed.
- is its meaning not obvious from the component parts? Note that many words/expressions have additional senses or nuances that cannot be deduced from the constituent parts (the former entry "僧になる" was removed because it failed this test, as well as the others)
- is it not what someone reasonably proficient in Japanese would come up with when trying to express the English meaning in Japanese? (For example, 未収入金 is a reasonably common Japanese compound noun meaning "accounts receivable", but it is not necessarily what would be the result of translating "accounts receivable" into Japanese from scratch.)
- is it already in one or more dictionaries? (Other dictionaries have had to address this issue, and if their editors have decided it is worth including, that is a good signal. Note that inclusion in Eijiro alone is not a good indication, as its coverage is vast and rather indiscriminate.)
- does it have a reading which is not obvious from the constituent kanji? (Some expressions use unusual or irregular readings, often because they are based on archaic forms.)
- is it very, very common, with squillions of hits in WWW pages, etc.? (This is a rather weak test, and is mainly used with idiomatic expressions.)
Many loanwords (外来語) in Japanese have multiple surface forms which reflect such things as alternative mappings from the source language, variant vowel lengths, etc. Examples include ダイヤモンド/ダイアモンド, コンピュータ/コンピューター and ヴァイオリン/バイオリン. In general, all variants that are regular use should be included; ranked in order of use (an n-gram corpus can be used to determine this.) Rarely-used variants can be omitted, or included with an "ik" (incorrect kana) tag.
In general, the dictionary is not the place for recording extended text passages, but there is scope for including short, pithy passages which are recognized as useful in Japanese. Tests that will be used by editors when assessing such passages for inclusion include whether they are clearly in common use in Japanese, and/or are included in one or more of the major 国語辞典.
With regard to quotations and proverbs, the following guidelines are suggested for the use of the tags:
- [quote] - used for entries that are passages from some text, either originally in Japanese or a translation from another language. Typically a ([note="..."]) note would be included to indicate the source/author.
- [proverb] - used for entries which consist of a proverb, maxim, aphorism, pithy saying, etc. The popular Japanese ことわざ would also be tagged with this. Note that 四字熟語 have their own [yoji] tag and do not also get marked with the [proverb] tag.
Some entries consist of a term or passage based on or derived from part of a historical text. These should not be marked as [quote] unless they are an actual translation. Where appropriate a note can be included indicating the original text, e.g. "deriv. from 史記 passage".
In general, the JMdict/EDICT dictionary is not intended to include proper names as these are included in the companion ENAMDICT/JMnedict dictionary. It is common, however, for small numbers of high-profile proper names to be included in general dictionaries, and this is the case with JMdict. Proper names included in JMdict are primarily place names, with emphasis on the names of significant places within Japan, and on the Japanese names of countries and major cities. (The proper names in JMdict will be in ENAMDICT/JMnedict as well.)
The proper names considered appropriate for inclusion are:
- Japanese prefectures
- major Japanese cities, in particular, the designated cities and the capitals of prefectures
- Japanese regions (近畿, 北陸, 東北, etc.)
- major Japanese geographical features, e.g. 本州, 北海道, 富士山, 能登半島, 琵琶湖, etc.
- the former provinces in Japan
- other countries and their capital cities and other significant cities
- major geographical features (continents, oceans, major seas, lakes, mountain ranges, etc.)
- states and provinces of English-speaking countries and their capital cities
- provinces of China, major Chinese cities, and major cities in Korea
- deities and other major religious figures of Japanese religions and other significant religions, in particular, the Abrahamic faiths
- significant religious texts, Japanese works of literature, and reference books such as dictionaries
- a select number of extremely important historical, scientific, literary, musical, etc. figures known worldwide (Gandhi, Einstein, Darwin, Confucius, Hitler, Shakespeare, Beethoven, etc.)
- ministries, government departments and major organizational units, especially in Japan.
The above covers most of the proper names in JMdict. Some other names have been included, e.g. major newspapers, and there is discussion as to whether that can be retained under a "grandfather" principle, or confined to ENAMDICT/JMnedict.
The tags such as "place", "work", "person", etc. which are used to classify named-entities in the JMnedict database may also be used for proper names in JMdict however they should only be used when the nature of the entry is not clear from the gloss itself. For example "バルセロナ (n) Barcelona (Spain)" does not need the addition of the "place" tag.
As with other transcriptions of Japanese terms, the modified Hepburn system will be used. In most cases macrons will be used for long vowels, the only exceptions being cities such as Tokyo, Osaka and Kobe which are commonly used in English without macrons.
Names of biological species
The rules we are using for biological species are:
- Whenever possible, both the common name and the scientific name (using binomial nomenclature) of a species should be provided. The preferred format is: common_name (scientific_name), e.g. European magpie (Pica pica). If the common name is unknown, the preferred format is: scientific_name (description), e.g. Mola mola (a species of sunfish).
- Common names should be written in dictionary form. This means that only proper nouns and proper adjectives should be capitalized, even for officially standardized common names. e.g. "American kestrel", not "American Kestrel".
- Generic names (and names of higher taxa) are always capitalized; specific epithets are never capitalized. e.g. "Tyrannosaurus rex", not "tyrannosaurus rex" or "Tyrannosaurus Rex"
- Where applicable, subspecific taxonomic categories should be written out fully using ICZN or ICBN rules.
- For animal subspecies, this consists of merely writing the subspecific epithet. For example, the cinnamon bear, a subspecies of American black bear, should be submitted as: "cinnamon bear (Ursus americanus cinnamomum)"
- For plant subspecies, the abbreviation "subsp." should be used before the subspecific epithet. For example, occluded blindweed, a subspecies of hedge bindweed, should be submitted as: "occluded blindweed (Calystegia sepium subsp. erratica)"
- For varieties, the abbreviation "var." must be used.
- For forms, "f." must be used.
- Cultivar epithets should be capitalized and placed in single quotes. (e.g. Taxus baccata 'Variegata')
- Do not submit the author name. e.g., raspberry (Rubus idaeus), not raspberry (Rubus idaeus L.) (The "L." stands for Linnaeus.)
- Whenever possible, junior synonyms should not be submitted. Submit only the single scientific name currently accepted as the senior synonym. Wikipedia and The Encyclopedia of Life are good resources for finding the most up-to-date classifications.
- Submissions should include the Japanese name in kanji, hiragana, and--in the vast majority of cases--katakana. Biological names are very often written in katakana, and thus a (uk) tag is usually warranted. Nevertheless, the katakana reading should always be placed after the hiragana reading. For example, 銭形海豹 [ぜにがたあざらし,ゼニガタアザラシ] (n) (uk) harbor seal (Phoca vitulina)/harbour seal/common seal
- where the katakana name is a transcription of an English name, e.g. ブルシャーク, also include the form with the components separated by a middle-dot, e.g. ブル・シャーク.
- Names of higher taxa should include the headword written entirely in kanji, even though it may be only rarely used in practice. Reading restrictions will be used where appropriate. For example, セリ科,芹科 [セリか(セリ科),せりか(芹科)] (n) Apiaceae (parsley family of plants)/Umbelliferae
- When unsure of a kanji headword, it is often easy to determine based on the English translation or the appearance of the species. For example, the white-cheeked pintail (Anas bahamas) is known as ホオジロオナガガモ in Japanese. This word does not appear in any Japanese dictionary, but it is rather obviously written as 頬白尾長鴨. Include a kanji headword whenever it can be determined in this manner, but never guess.
Note that in Japanese a genus is always denoted by the use of 属/ぞく, as in:
(n) Lespedeza (genus comprising the bush clovers)
As in any language, there are words and terms in Japanese which need to be used with care and sensitivity, as they may be blunt, cause offence in some contexts, etc. In JMdict there is a "sens" tag which may be associated with one or more senses of an entry to indicate that the term should be used with a degree of caution. Determining which terms should be regarded as sensitive is quite difficult. In general the major Japanese-English and English-Japanese dictionaries do not attempt to indicate them, probably because they are usually compiled for Japanese users who do not need to be told this.
A useful reference is a list of problem terms (放送問題用語) based on a 1983 publication by NHK. That list, for example, includes virtually every term which includes 盲/めくら (blindness), so for 盲窓/めくら窓, it advises that "外見だけの窓" be used instead. Some of the prohibitions seem extreme; for example, 医者 is on the list, with the advice that 医師 or お医者さん be used instead, however, foreign learners of Japanese are usually taught 医者 without any qualification. Note that the list is over 30 years old, and there are reports that it is not being followed completely now. The list is categorized according to whether terms are banned (×), have some reservations (△) or are uncertain (？), and the "×" tag is applied to 122 terms.
While there can be no hard and fast rules, it is suggested that people submitting or amending entries apply the following guidelines when considering whether the entry should include a "sens" tag.
- if the term is already tagged as "derog" (derogatory) or "vulg" (vulgar", there is no need for any additional "sens" tag. In fact, it is preferred that where appropriate "derog" or "vulg" tags be used;
- inclusion on the NHK list referenced above, particularly if it has an "×" tag, may indicate the need for a "sens" tag, however, it needs to be assessed on a case-by-case basis. The list, for example, says that 新平民 should not be used, but since it is an archaism there is no need to state it is sensitive. The list includes 板前 (chef) and recommends 板前さん be used instead, but it is clear from word-frequencies that 板前 alone is much more widely-used;
- where appropriate consider a note indicating preferred alternatives, e.g. for 医者, a note "pref. 医師, お医者さん" may be appropriate.