[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] JMdict generation



hi Jim,

good news.
So what are the remaining dependencies on your old utilities at Monash ?

JL

On Tue, 13 Mar 2012 15:24:46 +1100, Jim Breen wrote:
I have just made some long overdue changes to the way the
JMdict versions are generated for distribution.

When the database went live in mid-2010, I simply
made it front-end my existing system by:
- downloading the database in JMdict format
- converting it from UTF-8 to EUC-JP
- pushing it through a utility which converted it into my
old internal text format.

From there the original utilities take over and generate
the EDICT and EDICT2 versions (in EUC-JP) and the
JMdict versions (English only, and multiple languages,
both in UTF-8). This means that the JMdict versions
had gone into EUC-JP, then back to UTF-8. The downside
of this is that any characters which are not valid in
EUC-JP, i.e. anything not in ASCII, JIS208 or JIS212, got
zapped.

From tomorrow, the JMdict versions are being done as follows:
- JMdict_e, i.e. the one with only English glosses, will be exactly
as it comes from the database. Nothing changed
- JMdict, i.e. the one with multi-lingual glosses, will still go via
my old text format (because that's where the non-English glosses
get added), but will stay in UTF-8 throughout.

The only change of note in JMdict_e is the way wasei tags work
when they are either partial, or a mix of English and other languages,
or both. From now on they will be done properly, as in the database.
Until now they were simplified and incomplete.

With JMdict, the partly-broken wasei tags are still there. I'll either
fix the utility, or (better) reprogram the inclusion of the other
languages so that they can be included directly into the XML (better).

These changes mean there is much more freedom in using the full
range of Unicode characters, however care still needs to be taken
as they will not necessarily be included in the EDICT/EDICT2 versions.

I have added some comments on this matter to the Editorial Policy :
http://www.edrdg.org/wiki/index.php/Editorial_policy#Character_Codes

[All this has happened because Nils did some work on the ダル/ダール
entry, and added "[lsrc=hin:"दाल dāl",lsrc=urd:"دال dāl"]"  I went to
remove the Devanagari, etc. then decided it would be better to handle
them properly rather than simply removing them.]

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne


------------------------------------

Yahoo! Groups Links