[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
[Jim Rose (Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?) writes:]
>> I was a very poor kid when I was trying to put myself through
>> college, and having free tools based on EDICT and KANJIDIC to learn
>> Japanese really helped out. So it feels good to contribute. But in
>> JMdict / EDICT / ENAMDICT / KANJIDIC / radkfile / Tanaka Corpus
>> (TC), technological know-how things such as a code base haven't
>> really been a part of the public distribution - nor do I feel they
>> necessarily should be. If people were expected to contribute their
>> technological know-how to the public domain, I wonder that anyone
>> would work very hard to develop new tricks. Socialized societies
>> tend to wipe out innovation. While I agree that the sharing of data
>> in the public sphere has probably inspired many developers and
>> innovations, I'm not convinced that code sharing would work as well
>> in the long run.
I don't want to distract us into a discussion of Free Software. I
haven't been the greatest sharer of software myself. The only major slab
of code I released was xjdic (*). I've not released the wwwjdic code,
partly because it's a permanent work-in-progress, and partly because it
completely lacks documentation, especially how to install and run.
The wwwjdic code is based on xjdic, of course.
Releasing xjdic as GPL-ed software was interesting. It was used seed
other packages like MacJDic, Gjiten, Kiten, etc. It's also been a pain
to keep it up-to-date, and forks have happened with the Debian and SuSE
"maintainers" changing to fit their environments, odd RPMs turning up,
etc. I've had a "2.5" ready to roll for nearly a year, but have never
got around to documenting it, etc. I really should move it into a
properly CVSed sourgeforge environment, and turn it loose, but even that
takes a lot of time.
>> And another question that needs to be answered then is this. Is
>> kanji/reading alignment "enhanced data", or just a display technology
>> for plain old normal data?
I think it's enhanced data.
>> If it involved contributing alignment data, and not code technology,
>> I would feel a lot more warm and fuzzy about the idea.
Great.
>> My last work on the euphony/ idiom data can parse EDICT from about
>> one year ago. Then I turned my efforts into automatically updating
>> all of the underlying file structure of Ice from fresh copies of
>> EDICT, and TC - which I explained some time ago was not easy - and
>> included automatic EDICT yomigana parsing, and at the time of the
>> accident, I hadn't finished that project. Then before picking up on
>> that project again, I started trying to automate the updating of any
>> file depending on radkfile data, and before finishing that, I got a
>> modern version of GD installed on the only surviving Mac, dropped the
>> radkfile project to address the huge pile of SOD submissions that
>> needed to be edited and turned into animations - which is where I am
>> today.
>>
>> Back when I completed the last update of Ice's underlying EDICT and
>> TC, I branched off a second data set independent from the EDICT set
>> to attempt to parse ENAMDICT and then abandoned it. That project
>> will take a very long time to complete - perhaps as many hours as the
>> data needed to parse EDICT even though the EDICT data is already
>> inside the ENAMDICT data. Suffice it to say, my data cannot parse
>> ENAMDICT at this time, and to parse a new version of EDICT would
>> probably require an additional hour or so of enhancements to the
>> readings file.
The trouble is having too many sub-projects on the run simultaneously.
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学