[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Errors and Remarks on JMDict



Hi Marc,

Welcome to the list.

You need to do something about your code settings. The header says
it is ISO-2022-JP, but the Japanese content has been screwed up.

[frenchquacky ([edict-jmdict] Errors and Remarks on JMDict) writes:]
>> I m currently developping a software based on JMDict and it involves
>> to structure the data in to a database.
>> During automatic process of JMDict XML file, I found some errors I d
>> like to share with you. Some are easy to correct, other are a little
>> bit more difficult.
>> 
>> *********Simple errors:

OK. All fixed.

>> *********complex problem
>> **Frequency problem
>> The frequency of use of the reb elements are not correct I think.
>> There are just the concatenation of the corresponding keb elements.
>> Then in a lot of case it becomes just a non-sense.
>> For example in the entry 1169870 two times news2 frequency appears.
>> Not a real problem though. 

Just one of those, and I have fixed it.

>> But having nf35 and nf40 has not a clear
>> meaning. In some cases it s nf35, in other it is nf40? but which one?
>> You have then to have a look to the keb... which is not really
>> conveninent I think.

It's done that way for a reason. Often I have to generate a set of
kanji/reading pairs, selecting on the basis of things like frequency.
Where there are possibly multiple kanji words and readings, I need the
codes to align them.

>> **Xref and antonym problem
>> I could build a list of 21 entries where the xref element points on
>> more than one possibilities. Because the keb element is not the
>> identifier of the entry there is then an ambiguity. 

Yes, at the moment it is largely visual. Ideally the xref should be 
unambiguous, e.g. include the ent-seq in the xref. This can be addressed
once we move to a new database for maintaining the file. At present
it would be far too messy to try and change.

>> Here is the list of this entries. To solve this problem, the system of
>> xreference should be changed to point on the entry sequence rather
>> than a word
>> 
>> 1310600
>> 1348350
>> 1571600
>> 1605845
>> 1637250
>> 2019870
>> 2059750
>> 2065770
>> 2073110
>> 2082140
>> 2083610
>> 2083950
>> 2084240
>> 2085270
>> 2087310
>> 2088280
>> 2113530
>> 2113910
>> 2120540
>> 2121770
>> 2127650

I'll look over the list and see if any can be improved now. A long term
fix should be considered.

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学