[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sentence example indices - projected change



This is a heads-up for people working with WWW systems. apps,
etc. which use the Tatoeba/Tanaka Corpus Japanese-English sentence
pairs via the associated dictionary indices. See:
http://www.edrdg.org/wiki/index.php/Tanaka_Corpus  and
http://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking

As the sentence->dictionary indexing is done on the "surface form"
of the word/term, there has always been an issue where two or more
dictionary entries have the same form. This is about to get more
apparent as the JMdict editors have agreed that entries for loanwords
should be restricted to the one source language term. This means that
the ラップ entry will be split into three according to the source term
(wrap, rap, lap). Others to be split include サン, ホース, スカル, etc.

To handle this situation, I propose to add an additional field when
needed to the indexing information in the examples file. This will
consist of the JMdict sequence number of the entry.  For a
reference to スカル/scull the index would be スカル#1067710 and
for スカル/skull it would be スカル#2841243.

Sites and apps using the JMdict file will have the sequence numbers
in each entry. Those using EDICT will need to use the version I use
in WWWJDIC which has the sequence number included. Contact me
if you want access to that version.

Comments, feedback, etc. welcome.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/