Main Page: Difference between revisions
(48 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Electronic Dictionary Research and Development Group== | ==Electronic Dictionary Research and Development Group== | ||
Welcome to the Wiki of the [[About EDRDG | Electronic Dictionary Research and Development Group]]. The Wiki | Welcome to the Wiki of the [[About EDRDG | Electronic Dictionary Research and Development Group]]. The Wiki has been developed as a repository of information and documentation about the Group's work and projects. | ||
== | ==User Accounts== | ||
Sorry but we no longer provide user accounts. We've been hit by link spammers which led to disabling of self-creation of accounts, and it's all too much a distraction. | |||
( | If you have any edits you would like to suggest, email Jim Breen (jimbreen-at-gmail.com) with the details. | ||
==The JMdict/EDICT Project== | ==The JMdict/EDICT Project== | ||
Line 15: | Line 15: | ||
===History=== | ===History=== | ||
The project began in 1991 with the EDICT Japanese-English text file in a simple format. In 1999 this was expanded into the XML-format JMdict file with a more complex format allowing for much better treatment of Japanese words and expressions. From 1999 the data was maintained by Jim Breen in a mark-up system from which the JMdict file, in both English and multiple-language editions, the EDICT file, and the extended EDICT2 file were generated. Public input into the project was mainly via WWW forms incorporated in the WWWJDIC server, and new | The project began in 1991 with the EDICT Japanese-English text file in a simple format. In 1999 this was expanded into the XML-format JMdict file with a more complex format allowing for much better treatment of Japanese words and expressions. From 1999 the data was maintained by Jim Breen in a mark-up system from which the JMdict file, in both English and multiple-language editions, the EDICT file, and the extended EDICT2 file were generated. Public input into the project was mainly via WWW forms incorporated in the WWWJDIC server, and new editions of the files were generated daily. | ||
In July 2010 maintenance of the data moved to an [[JMdictDB_Project|online database]], from which the daily distributions are prepared. | In July 2010 maintenance of the JMdict data moved to an [[JMdictDB_Project|online database]], from which the daily distributions are prepared. In September 2014 the maintenance of the [[http://www.edrdg.org/wiki/index.php/Main_Page#The_ENAMDICT.2FJMnedict_Project JMnedict]] named-entity data was moved to that database too. | ||
===Documentation and Links=== | ===Documentation and Links=== | ||
Line 23: | Line 23: | ||
Some useful links are: | Some useful links are: | ||
*the main [[JMdict-EDICT_Dictionary_Project|documentation of the JMdict/EDICT dictionary files]] | *the main [[JMdict-EDICT_Dictionary_Project|documentation of the JMdict/EDICT dictionary files]] | ||
*some help with [[JMdict:_Getting_Started|getting started]] on putting in new entries or editing existing ones. | *some help with [[JMdict:_Getting_Started|getting started]] on putting in new entries or editing existing ones. | ||
*the [[Editorial Process]] for handling proposed new entries and amendments | |||
*the [[editorial policy|Editorial Policy]] and guidelines for the JMdict/EDICT files | |||
*the [[Editorial Board]] for JMdict/EDICT | |||
*the [https://github.com/JMdictProject/JMdictIssues/issues JMdict Issues] forum where matters such as structure, format, policies, tags, and other issues concerning dictionary content can be raised and discussed (currently hosted on GitHub.) | |||
*the [https://gitlab.com/yamagoya/jmdictdb/-/issues JMdictDB Issues] site for reporting problems and making feature requests concerning the JMdictDB web pages and software. | |||
*the [https://groups.google.com/search/groups?q=edict-jmdict mailing list] for project discussion. (That page should have a link for asking to join, Alternatively email [mailto:jimbreen@gmail.com Jim Breen] and ask to be added.) | |||
*the [http://www.edrdg.org/edrdg/licence.html licence statement for use of the projects' files]. This licence also applies to the contents of this Wiki. | *the [http://www.edrdg.org/edrdg/licence.html licence statement for use of the projects' files]. This licence also applies to the contents of this Wiki. | ||
*lists of [[JMdictEDICT_software|packages and servers]] using the JMdict/EDICT files | *lists of [[JMdictEDICT_software|packages and servers]] using the JMdict/EDICT files | ||
*an [[Entries Under Development]] page, where people can place incomplete words and phrases for later filling out to become full entries. (Note that this is rather inactive and needs cleaning up.) | |||
*an [[Entries Under Development]] page, where people can place incomplete words and phrases for later filling out to become full entries. | |||
== Current Version & Downloads== | |||
The project's master database is continuously being updated and new versions of the files are generated daily. The date of generation is included in the header of the files. | |||
The files are currently distributed via the EDRDG [http://ftp.edrdg.org/pub/Nihongo/00INDEX.html ftp server], (formerly at Monash University) which also provides an rsync service. The main files available are: | |||
* [http://ftp.edrdg.org/pub/Nihongo/JMdict.gz JMdict.gz ] - the full JMdict file, including English, German, French, Russian, Spanish, Hungarian, Slovenian and Dutch glosses; | |||
* [http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz JMdict_e.gz ] - the JMdict file with only English glosses; | |||
* [http://ftp.edrdg.org/pub/Nihongo/JMdict_b.gz JMdict_b.gz ] - the basic JMdict file with only English glosses. This file omits several thousand proper name entries from JMnedict;; | |||
* [http://ftp.edrdg.org/pub/Nihongo/JMdict_e_examp.gz JMdict_e_examp.gz ] - the above JMdict file with example sentence pairs from the [[Tanaka_Corpus]]; | |||
* [http://ftp.edrdg.org/pub/Nihongo/edict.gz edict.gz ] - the "traditional" EDICT file. | |||
* [http://ftp.edrdg.org/pub/Nihongo/edict2.gz edict2.gz ] - the extended EDICT2 file. | |||
==JMdictDB Database== | ==JMdictDB Database== | ||
The maintenance of the JMdict/EDICT dictionary files is now handled by the online JMdict Database (JMdictDB) system developed by Stuart McGraw since June 2010. For more information see: | The maintenance of the JMdict/EDICT and JMnedict/ENAMDICT dictionary files is now handled by the online JMdict Database (JMdictDB) system developed by Stuart McGraw, and operational since June 2010. For more information see: | ||
* an [[JMdictDB Project|overview]] of the database; | * an [[JMdictDB Project|overview]] of the database; | ||
* Stuart's [http://edrdg.org/~smg/ summary page]; | * Stuart's [http://edrdg.org/~smg/ summary page]; | ||
* the [http://edrdg.org/jmdictdb/cgi-bin/edhelpq.py quick overview] to editing entries; | * the [http://edrdg.org/jmdictdb/cgi-bin/edhelpq.py quick overview] to editing entries; | ||
* the [http://edrdg.org/jmdictdb/cgi-bin/edhelp.py full help file] for editing entries. | * the [http://edrdg.org/jmdictdb/cgi-bin/edhelp.py full help file] for editing entries. | ||
* a [http://www.edrdg.org/jmdictdb/JMdictEntries.html page] showing the current entry counts for the two dictionaries (updated daily). | |||
* project [https://gitlab.com/yamagoya/jmdictdb code] at GitLab. | |||
==The Tanaka Corpus== | ==The Tanaka Corpus== | ||
Line 43: | Line 61: | ||
The Corpus is now maintained within the [http://tatoeba.org/home Tatoeba Project]. This project has extended the file to include many other languages, and many sentences are available in three or more languages. The project WWW site has extensive facilities for searching and editing the sentences, and has an active community of people entering and editing sentences. | The Corpus is now maintained within the [http://tatoeba.org/home Tatoeba Project]. This project has extended the file to include many other languages, and many sentences are available in three or more languages. The project WWW site has extensive facilities for searching and editing the sentences, and has an active community of people entering and editing sentences. | ||
An important aspect of the Tanaka Corpus and its ongoing maintenance and expansion is its use as a source of examples in dictionary systems such as [http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C WWWJDIC], [http://jisho.org/ Denshi Jisho] , etc. This is achieved via a set of indices attached to each sentence pair. There is a [[Sentence-Dictionary Linking|detailed description]] of this process. | An important aspect of the Tanaka Corpus and its ongoing maintenance and expansion is its use as a source of examples in dictionary systems such as [http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C WWWJDIC], [http://jisho.org/ Denshi Jisho] , etc. This is achieved via a set of indices attached to each sentence pair. There is a [[Sentence-Dictionary Linking|detailed description]] of this process. | ||
Line 50: | Line 66: | ||
==The KANJIDIC Project== | ==The KANJIDIC Project== | ||
The KANJIDIC | The [[KANJIDIC Project]] has compiled files of comprehensive information on kanji used in Japanese text processing. The files | ||
cover the kanji in three Japanese standards: | cover the kanji in three Japanese standards: | ||
* JIS X 0208-1998, which includes 6,355 kanji. | * [https://en.wikipedia.org/wiki/JIS_X_0208 JIS X 0208-1998], which includes 6,355 kanji. | ||
* JIS X 0212-1990, which includes extra 5,801 kanji | * [https://en.wikipedia.org/wiki/JIS_X_0212 JIS X 0212-1990], which includes extra 5,801 kanji | ||
* JIS X 0213- | * [https://en.wikipedia.org/wiki/JIS_X_0213 JIS X 0213-2012], which extends JIS X 0208, overlaps with some of JIS X 0212, and adds additional kanji. | ||
==The COMPDIC Project== | ==The COMPDIC Project== | ||
The COMPDIC project involved the compilation of a glossary of terms used in the computing and telecommunications industries. The file was in the "EDICT" format. See the [http:// | The COMPDIC project involved the compilation of a glossary of terms used in the computing and telecommunications industries. The file was in the "EDICT" format. See the [http://www.edrdg.org/jmdict/compdic_doc.html brief documentation]. | ||
In 2008 the entries in the COMPDIC file were included in the JMdict/EDICT file. While it is no longer maintained as a separate file, an extract of the entries relating to computing and telecommunications is still generated. | In 2008 the entries in the COMPDIC file were included in the JMdict/EDICT file. While it is no longer maintained as a separate file, an extract of the entries relating to computing and telecommunications is still generated. | ||
Line 76: | Line 80: | ||
==The ENAMDICT/JMnedict Project== | ==The ENAMDICT/JMnedict Project== | ||
The ENAMDICT | The JMnedict/ENAMDICT files contain about 740,000 proper names in Japanese, covering place-names, surnames, given names, company names, names of artistic and literary works, product names, etc.. There is a basic [http://www.edrdg.org/enamdict/enamdict_doc.html documentation page]. | ||
* JMnedict (the Japanese-Multilingual named entity dictionary) is in XML format and is in Unicode/UTF-8 coding. [http://ftp.edrdg.org/pub/Nihongo/JMnedict.xml.gz (download)] | |||
* ENAMDICT is in a variant of the EDICT format, with part-of-speech and other tags omitted and replaced with some special tags to indicate the type of proper name. [http://ftp.edrdg.org/pub/Nihongo/enamdict.gz (download)] | |||
The information in the files is held in the same database as the JMdict/EDICT information. To use the online edit system | |||
follow [http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid= this link] and select "jmnedict" from the drop-down Corpus menu. | |||
Several thousand common entries from JMnedict are also included in the JMdict distribution. | |||
==The KRADFILE/RADKFILE Project== | ==The KRADFILE/RADKFILE Project== | ||
This project provides a decomposition of kanji into a number of visual elements or radicals to support software | This project provides a decomposition of kanji into a number of visual elements or radicals to support software that provides a lookup service using kanji components. These elements can be seen in the [http://nihongo.monash.edu/cgi-bin/wwwjdic?1R WWWJDIC] server, the [http://jisho.org/#radical Jisho.org] server, and [http://kanji.sljfaq.org/mr.html Ben Bullock's SLJFAQ] page. | ||
There is an [http://www. | There is an [http://www.edrdg.org/krad/kradinf.html information page] about the data files. The files can be downloaded - use the links in that page. | ||
==The WWWJDIC Dictionary Server== | ==The WWWJDIC Dictionary Server== | ||
WWWJDIC is a dictionary WWW server first developed by Jim Breen in 1998. Its (rather clunky) name came about because it is based on code and techniques developed in the earlier JDIC (DOS) and XJDIC (Unix/X11) applications. | |||
The home site of the server is [http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C here], and there are several [http://www.edrdg.org/wwwjdic/wwwjdicmirrors.html mirror sites] which are updated daily from the home site. The server has links at the dictionary entry level to other sites and to the JMdict database for editing entries. | |||
The main documentation is the WWWJDIC [http://www.edrdg.org/wwwjdic/wwwjdicinf.html User's Guide]. | |||
A number of elements in the server's display can be configured by users, and the interface language can be set to Japanese (as part of the [[WWWJDIC in Japanese]] project.) | |||
==Wishlist== | ==Wishlist== | ||
Line 100: | Line 111: | ||
This is a set of [[wishlist]] items for the various projects. Feel free to add suggestions. | This is a set of [[wishlist]] items for the various projects. Feel free to add suggestions. | ||
There is also an old [http:// | There is also an old [http://nihongo.monash.edu/edictredev/edictwishlist.html wishlist page]. Some of the items in this section have been copied from it. | ||
==Mailing List== | ==Mailing List== | ||
There is a [ | There is a [https://groups.google.com/g/edict-jmdict/ mailing list] for people engaged in the EDRDG projects. | ||
==How Can I Help?== | ==How Can I Help?== | ||
Line 110: | Line 121: | ||
From time to time people ask how they can best contribute to the projects. There are many ways of assisting, the main ones being: | From time to time people ask how they can best contribute to the projects. There are many ways of assisting, the main ones being: | ||
* adding to and enhancing the main (EDICT/JMdict) dictionary file. This is best done by using the [http://www. | * adding to and enhancing the main (EDICT/JMdict) dictionary file. This is best done by using the [http://www.edrdg.org/jmdictdb/cgi-bin/srchform.py?svc=jmdict&sid= Search] and [http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid= New Entry] pages of the JMdictDB system. | ||
* adding extra Japanese-English sentence pairs to the collection based on the Tanaka Corpus. | * adding extra Japanese-English sentence pairs to the collection based on the Tanaka Corpus. This is done by adding them to the [https://tatoeba.org/eng Tatoeba Project] as a linked sentence pair, the contact Jim Breen to have them indexed. | ||
* assisting with the translation of the WWWJDIC interface into other languages. At present the priority is to make it fully available in Japanese. See the [[WWWJDIC in Japanese]] page. | * assisting with the translation of the WWWJDIC interface into other languages. At present the priority is to make it fully available in Japanese. See the [[WWWJDIC in Japanese]] page. | ||
Line 118: | Line 129: | ||
* work through the lists of words Paul Blay has place on the [[Talk:Tanaka_Corpus]] page, which could become new dictionary entries. | * work through the lists of words Paul Blay has place on the [[Talk:Tanaka_Corpus]] page, which could become new dictionary entries. | ||
* join and participate in the [ | * join and participate in the [https://groups.google.com/g/edict-jmdict mailing list] for people engaged in the EDRDG projects. |
Latest revision as of 06:52, 28 April 2023
Electronic Dictionary Research and Development Group
Welcome to the Wiki of the Electronic Dictionary Research and Development Group. The Wiki has been developed as a repository of information and documentation about the Group's work and projects.
User Accounts
Sorry but we no longer provide user accounts. We've been hit by link spammers which led to disabling of self-creation of accounts, and it's all too much a distraction.
If you have any edits you would like to suggest, email Jim Breen (jimbreen-at-gmail.com) with the details.
The JMdict/EDICT Project
This project is to build and maintain a freely-usable general Japanese electronic dictionary database.
History
The project began in 1991 with the EDICT Japanese-English text file in a simple format. In 1999 this was expanded into the XML-format JMdict file with a more complex format allowing for much better treatment of Japanese words and expressions. From 1999 the data was maintained by Jim Breen in a mark-up system from which the JMdict file, in both English and multiple-language editions, the EDICT file, and the extended EDICT2 file were generated. Public input into the project was mainly via WWW forms incorporated in the WWWJDIC server, and new editions of the files were generated daily.
In July 2010 maintenance of the JMdict data moved to an online database, from which the daily distributions are prepared. In September 2014 the maintenance of the [JMnedict] named-entity data was moved to that database too.
Documentation and Links
Some useful links are:
- the main documentation of the JMdict/EDICT dictionary files
- some help with getting started on putting in new entries or editing existing ones.
- the Editorial Process for handling proposed new entries and amendments
- the Editorial Policy and guidelines for the JMdict/EDICT files
- the Editorial Board for JMdict/EDICT
- the JMdict Issues forum where matters such as structure, format, policies, tags, and other issues concerning dictionary content can be raised and discussed (currently hosted on GitHub.)
- the JMdictDB Issues site for reporting problems and making feature requests concerning the JMdictDB web pages and software.
- the mailing list for project discussion. (That page should have a link for asking to join, Alternatively email Jim Breen and ask to be added.)
- the licence statement for use of the projects' files. This licence also applies to the contents of this Wiki.
- lists of packages and servers using the JMdict/EDICT files
- an Entries Under Development page, where people can place incomplete words and phrases for later filling out to become full entries. (Note that this is rather inactive and needs cleaning up.)
Current Version & Downloads
The project's master database is continuously being updated and new versions of the files are generated daily. The date of generation is included in the header of the files.
The files are currently distributed via the EDRDG ftp server, (formerly at Monash University) which also provides an rsync service. The main files available are:
- JMdict.gz - the full JMdict file, including English, German, French, Russian, Spanish, Hungarian, Slovenian and Dutch glosses;
- JMdict_e.gz - the JMdict file with only English glosses;
- JMdict_b.gz - the basic JMdict file with only English glosses. This file omits several thousand proper name entries from JMnedict;;
- JMdict_e_examp.gz - the above JMdict file with example sentence pairs from the Tanaka_Corpus;
- edict.gz - the "traditional" EDICT file.
- edict2.gz - the extended EDICT2 file.
JMdictDB Database
The maintenance of the JMdict/EDICT and JMnedict/ENAMDICT dictionary files is now handled by the online JMdict Database (JMdictDB) system developed by Stuart McGraw, and operational since June 2010. For more information see:
- an overview of the database;
- Stuart's summary page;
- the quick overview to editing entries;
- the full help file for editing entries.
- a page showing the current entry counts for the two dictionaries (updated daily).
- project code at GitLab.
The Tanaka Corpus
This project is to maintain and extend the Tanaka Corpus which is a large collection of parallel Japanese/English sentence pairs.
The Corpus is now maintained within the Tatoeba Project. This project has extended the file to include many other languages, and many sentences are available in three or more languages. The project WWW site has extensive facilities for searching and editing the sentences, and has an active community of people entering and editing sentences.
An important aspect of the Tanaka Corpus and its ongoing maintenance and expansion is its use as a source of examples in dictionary systems such as WWWJDIC, Denshi Jisho , etc. This is achieved via a set of indices attached to each sentence pair. There is a detailed description of this process.
The KANJIDIC Project
The KANJIDIC Project has compiled files of comprehensive information on kanji used in Japanese text processing. The files cover the kanji in three Japanese standards:
- JIS X 0208-1998, which includes 6,355 kanji.
- JIS X 0212-1990, which includes extra 5,801 kanji
- JIS X 0213-2012, which extends JIS X 0208, overlaps with some of JIS X 0212, and adds additional kanji.
The COMPDIC Project
The COMPDIC project involved the compilation of a glossary of terms used in the computing and telecommunications industries. The file was in the "EDICT" format. See the brief documentation.
In 2008 the entries in the COMPDIC file were included in the JMdict/EDICT file. While it is no longer maintained as a separate file, an extract of the entries relating to computing and telecommunications is still generated.
The ENAMDICT/JMnedict Project
The JMnedict/ENAMDICT files contain about 740,000 proper names in Japanese, covering place-names, surnames, given names, company names, names of artistic and literary works, product names, etc.. There is a basic documentation page.
- JMnedict (the Japanese-Multilingual named entity dictionary) is in XML format and is in Unicode/UTF-8 coding. (download)
- ENAMDICT is in a variant of the EDICT format, with part-of-speech and other tags omitted and replaced with some special tags to indicate the type of proper name. (download)
The information in the files is held in the same database as the JMdict/EDICT information. To use the online edit system follow this link and select "jmnedict" from the drop-down Corpus menu.
Several thousand common entries from JMnedict are also included in the JMdict distribution.
The KRADFILE/RADKFILE Project
This project provides a decomposition of kanji into a number of visual elements or radicals to support software that provides a lookup service using kanji components. These elements can be seen in the WWWJDIC server, the Jisho.org server, and Ben Bullock's SLJFAQ page.
There is an information page about the data files. The files can be downloaded - use the links in that page.
The WWWJDIC Dictionary Server
WWWJDIC is a dictionary WWW server first developed by Jim Breen in 1998. Its (rather clunky) name came about because it is based on code and techniques developed in the earlier JDIC (DOS) and XJDIC (Unix/X11) applications.
The home site of the server is here, and there are several mirror sites which are updated daily from the home site. The server has links at the dictionary entry level to other sites and to the JMdict database for editing entries.
The main documentation is the WWWJDIC User's Guide.
A number of elements in the server's display can be configured by users, and the interface language can be set to Japanese (as part of the WWWJDIC in Japanese project.)
Wishlist
This is a set of wishlist items for the various projects. Feel free to add suggestions.
There is also an old wishlist page. Some of the items in this section have been copied from it.
Mailing List
There is a mailing list for people engaged in the EDRDG projects.
How Can I Help?
From time to time people ask how they can best contribute to the projects. There are many ways of assisting, the main ones being:
- adding to and enhancing the main (EDICT/JMdict) dictionary file. This is best done by using the Search and New Entry pages of the JMdictDB system.
- adding extra Japanese-English sentence pairs to the collection based on the Tanaka Corpus. This is done by adding them to the Tatoeba Project as a linked sentence pair, the contact Jim Breen to have them indexed.
- assisting with the translation of the WWWJDIC interface into other languages. At present the priority is to make it fully available in Japanese. See the WWWJDIC in Japanese page.
- work through the lists of words Paul Blay has place on the Talk:Tanaka_Corpus page, which could become new dictionary entries.
- join and participate in the mailing list for people engaged in the EDRDG projects.