[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Yomigata for Edict and Nedict
On 13/08/07, olivier.binda <olivier.binda@wanadoo.fr> wrote:
> Well... Here is an improved batch of the Yomigata for JMedic.xml.
> It contains 103964 regular yomigata and 6818 irregular Yomigata
> (computed in 4 minutes).
> http://www.lakedaemon.org/fileadmin/Nihongo/JMdict-Yomigata-2007-8-12.zip
I have grabbed that and looked at it a bit. Comments below.
> And here is my first batch of Yomigata for JMnedict.xml
> It contains 641725 regular Yomigata and 116886 irregular Yomigata.
> (computed in around 28 minutes).
> http://www.lakedaemon.org/fileadmin/Nihongo/JMnedict-Yomigata2007-8-12.zip
I'll hold off on that for now. JMNedict is simply generated from the EDICT-style
ENAMDICT file, which can't incorporate this form of data yet.
I'd like to raise the question of the most appropriate way of presenting
the kanji/reading alignments. Olivier's file has:
脱窒 だっちつ だっ|ちつ
which is OK to a point, but it also does things like:
払い込む はらいこむ はら|い|こ|む
AFAICT, the purpose of the exercise is to inform people of the components
of the *pronunciation* of a word/phrase associated with each kanji,
and to enable
apps to build furigana/ruby displays. Thus I would think the
appropriate content of
some JMdict elements might be something like:
<kanjyomi>脱,だっ</kanyomi>
<kanyomi>窒,ちつ</kanyomi>
or even:
<kanjyomi kanji="脱" yomi="だっ"/>
...
and
<kanjyomi>払,はら</kanjyomi>
<kanjyomi>込,こ</kanjyomi>
etc.
This more along the lines Paul Blay suggested. It seems more use to
me than spitting out all the kana, regardless of whether it is okurigana,
inflectional endings, etc.
> For example, I noticed that :
> a) most entries with western chars use Japanese Ascii, yet, there are
> some that use (non-japanese) ASCII.
> This is inconsistent and maybee this should be fixed.
>
> For example : In JMNedict, the entry
> LGフィリップスLCD
> エルジーフィリップスエルシーディー
> エル|ジー|フ|ィ|リ|ッ|プ|ス|*|*|*エルシーディー 21
>
> uses Japanese ascii for LG and ascii for LCD.
I have changed that to LCD. Thanks.
> b) there are weird stuff in the entries
> For example, in JMNedict :
> 西鉄大牟田線(1.44) にしてつおおむたせん
> 砂防工事専用軌道(0.61) さぼうこうじせんようきどう
> 奥羽本線(秋田新幹線(1.44)) おううほんせん
> 田沢湖線(秋田新幹線(1.44)) たざわこせん
> 羽場久(さんずい+尾)子 はばくみこ
I have fixed those too.
> b) there might be mispelling
>
> せんろく鼻
> せんくろはな
> せ|ん|*|*くろ|はな
Yes. Fixed.
> あづま総合運動公園 あずまそうごううんどうこうえん
> あ|*ず|ま|そう|ごう|うん|どう|こう|えん
Also an error. The (correct) あづまそうごう.... version is there as
well.
> d) there are abbreviations...(my routine fails there...though it tries
> it's best)
>
> 赤城久呂保高原ゴルフ場 あかぎくろほこうげんごるふじょう あか|ぎ|く|ろ
> |ほ|こう|げん|ご|る|ふ|じょう 0
I have since changed all the ごるふじょう to ゴルフじょう.
> This entry is a success.Yet when humans abbreviates things, there is
> not much to do :
>
> 赤城久呂保高原ゴルフ場 ごるふ *|*|*|*|*|ご|*|*|る|ふ|* 08
This was clearly an error. I have deleted it.
> Now, to the matter of the copyrights of those :
>
> > At present I am adding hacks for major contributions to the comments
Oops. "acks" not "hacks".
> in the
> > DTD. This can probably move to a more appropriate place, e.g. a
> contributions
> > section of a WWW page, later.
> > > 3) now, I don't know if this is asking for much, but...
> > > as for copyright...
> > > I'll gladly let any open-source/free project use these Furigana
> for free.
> > >
> > > But, if those Furigana were to be used by a commercial product,
> > > it would be nice if Jim Rose and I could get a reasonable fee
> > > (because we spent a few hard and tough days hacking into mindbogingly
> > > weird functions to get these available for everyone).
> > > In that case, I can be contacted at Olivier.Binda@...
> >
> > Sorry. I can't accept in such restrictions to be placed on material in
> > JMdict.
>
> And for that, well...let's say that I keep the copyright
> (mine ! my precious....)
Actually, I think it is doubtful whether you could sustain a claim of
copyright over a dataset generated by running someone else's data through
a program you wrote. You may well be able to patent the program, if you
have a lot of time and money to spare.
> but, If I can do that considering the fact that I used Kanjidic2.xml,
> jmdict.xml and Jmnedict.xml, let's say that the yomigata that I have
> posted till now are licensed (why go dual when you can go quad and
> even better penta ^-^) under the following licences (pick the one you
> like best) :
> 1) The JMdict and JMnedict license.
> 2) The GPL License (I like you but less than the BSD guys, cause you
> are contagious dear (and plagued by awfull underaged GPL freaks ands
> zealots)).
> 3) The BSD license (I love you guys)
> 4) The MIT License (I love you too, don't be jealous of BSD, dear)
> 5) The CDDL (Gretings to the OpenSolaris guys)
Anything that goes into JMdict has to covered by the JMdict licence alone. I'd
need your agreement to that before it could be included. Sorry, but
having been caught
out in that area before, where you need X's permission to use one bit and Y's
permission to use something else, I don't want to make it more complicated.
Cheers
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/