[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?

To: edict-jmdict@***************
Subject: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
From: Jim Breen <Jim.Breen@**********************>
Date: Thu, 25 Jan 2007 12:28:31 +1100 (EST)

[Jim Rose (Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?) writes:]
>> I was a very poor kid when I was trying to put myself through  
>> college, and having free tools based on EDICT and KANJIDIC to learn  
>> Japanese really helped out.  So it feels good to contribute.  But in  
>> JMdict / EDICT / ENAMDICT / KANJIDIC / radkfile / Tanaka Corpus  
>> (TC),  technological know-how things such as a code base haven't  
>> really been a part of the public distribution - nor do I feel they  
>> necessarily should be.  If people were expected to contribute their  
>> technological know-how to the public domain, I wonder that anyone  
>> would work very hard to develop new tricks.  Socialized societies  
>> tend to wipe out innovation.  While I agree that the sharing of data  
>> in the public sphere has probably inspired many developers and  
>> innovations, I'm not convinced that code sharing would work as well  
>> in the long run.

I don't want to distract us into a discussion of Free Software. I
haven't been the greatest sharer of software myself. The only major slab
of code I released was xjdic (*). I've not released the wwwjdic code,
partly because it's a permanent work-in-progress, and partly because it
completely lacks documentation, especially how to install and run.
The wwwjdic code is based on xjdic, of course.

Releasing xjdic as GPL-ed software was interesting. It was used seed
other packages like MacJDic, Gjiten, Kiten, etc. It's also been a pain
to keep it up-to-date, and forks have happened with the Debian and SuSE
"maintainers" changing to fit their environments, odd RPMs turning up,
etc. I've had a "2.5" ready to roll for nearly a year, but have never
got around to documenting it, etc. I really should move it into a 
properly CVSed sourgeforge environment, and turn it loose, but even that
takes a lot of time.
 
>> And another question that needs to be answered then is this.  Is  
>> kanji/reading alignment "enhanced data", or just a display technology  
>> for plain old normal data?

I think it's enhanced data.

>> If it involved contributing alignment data, and not code technology,  
>> I would feel a lot more warm and fuzzy about the idea.

Great.

>> My last work on the euphony/ idiom data can parse EDICT from about  
>> one year ago.  Then I turned my efforts into automatically updating  
>> all of the underlying file structure of Ice from fresh copies of  
>> EDICT, and TC - which I explained some time ago was not easy - and  
>> included automatic EDICT yomigana parsing, and at the time of the  
>> accident, I hadn't finished that project.  Then before picking up on  
>> that project again, I started trying to automate the updating of any  
>> file depending on radkfile data, and before finishing that, I got a  
>> modern version of GD installed on the only surviving Mac, dropped the  
>> radkfile project to address the huge pile of SOD submissions that  
>> needed to be edited and turned into animations - which is where I am  
>> today.
>> 
>> Back when I completed the last update of Ice's underlying EDICT and  
>> TC, I branched off a second data set independent from the EDICT set  
>> to attempt to parse ENAMDICT and then abandoned it.  That project  
>> will take a very long time to complete - perhaps as many hours as the  
>> data needed to parse EDICT even though the EDICT data is already  
>> inside the ENAMDICT data.  Suffice it to say, my data cannot parse  
>> ENAMDICT at this time, and to parse a new version of EDICT would  
>> probably require an additional hour or so of enhancements to the  
>> readings file.

The trouble is having too many sub-projects on the run simultaneously.

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学

Prev by Date: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Next by Date: Revised edit/amendment forms
Previous by thread: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Next by thread: deare であれ
Index(es):
- Date
- Thread