Re: [edict-jmdict] POS, etc. entities in JMdict

Subject: Re: [edict-jmdict] POS, etc. entities in JMdict

From: Olivier Binda <olivier.binda@**********>

Date: Tue, 03 Dec 2013 13:04:53 +0100

On 12/03/2013 07:30 AM, J Greely wrote:

On Dec 2, 2013, at 9:24 PM, Jim Breen <jimbreen@*********> wrote:
> Any comments or suggestions?

I'd hate to lose the information from the less-common ones,
but yes, the first thing I did when I started parsing JMdict
was add a hook to un-expand entities in all the fields where
they're used. The standard behavior in XML parsers is quite
annoying in this case.

Indeed, that is the first thing I had to do too.
Having the same data without the annoying expension would greatly simplify things for developers/apps...and help save the planet.

(you have to first parse JMDict to get the entities, optionnaly sorting them in the field/diag/pos/misc categories (I do)
build a map to unexpand them and then parse JMDict again for the real stuff)

Took me at least a day to find the best way to do that in Java with a XmlPullParser
(and that was my second try : at first, I manually wrote my own kanjidict/JMdict parsers to do just that)

I don't mind how they are coded as long as it makes sense and they are easy to parse like for example

<pos>n</pos>

or

<pos type="n"/>

I love the pos/misc/field/dial in JMdict and have a use for most of them.
I would just love having more significant usefull metadata.

To me, the best metadata tags are the one that can be understood/used by a program like
stagk, stagr, re_nokanji?,re_restr, re_pri...

tags that can only be used/understood by humans like
re_inf or s_inf
are great for humans that are learning japanese but not so great for developers because, appart from displaying those on screen, what can you do programmatically about them ? There could be anything in those tags ! .

Also, I would love having more languages in JMdict...
The second thing I'm doing when parsing kanjidict and JMdict is adding russian/german/spanish meanings....

(I would love to add chinese...there would be like 1 Milliard people that could then benefit from JMdict/Kanjidic...but well...haven't )

Keep up the good work.
JMdict and Kanjidic rock !

Olivier

Would it make sense to add a top-level element containing
an array of all the tags and their expansions? They could
even have an xml:lang attribute for eventual localization.

-j