[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on a new project based on KANJIDIC (and EDICT)



Yes, I wanted to release it under CC, no doubt. The acknowledgement is obvious, in fact, the preferred way I wanted to put this out was by you hosting it on the edrdg.org website, if you had no objections, that is.

I know about the updating issue, it is in fact a pain in the butt for me as a non-programmer. I'm only concerned about the fields containing the definitions, though. The way I would update to a newer version of the definitions is by taking EDICT, strip out a set of pre-defined entries with regular expressions (to match my template), and then compare it to my database with a SQL join statement. The last time I did this it would return around 100 entries that I had to manually change, but note that I hadn't updated it in months, so ~100 differences over a couple of months wasn't really an issue (remember, I only use ~12,000+ entries, not 200,000+). After setting up the website, I was intending to do this manual update once per month, it would maybe take me 30 minutes for those few entries that differ. Note that this updating issue is only with my particular custom format, because I renamed PoS abbreviations into their full names and did some other format changes. If somebody imports the data straight from EDICT without any custom modifications, the updating is a non-issue and can be done fully automatic.

As for the other updates: as I've written I've stripped a lot of stuff out of KANJIDIC, codepoints which I find useless and whatnot. The strokes, radicals, regular onyomi, regular kunyomi, etc., are pretty much complete and correct and therefore need no updates. So really the only things that are still possible to be expanded on are the onyomi, kunyomi, nanori and definition fields. However, as I've said, I've included a lot of obscure readings from the KanKen tests and Kanjigen for all of the 6355 kanji (+445 others), so any extra readings that might get updated in KANJIDIC will be super-obscure and not really of interest to myself personally or my database. I do understand this to be an issue if one wants to document _ALL_ the readings and have a complete database, though, even though the chances of somebody needing _that_ reading are close to zero, since if somebody is working with such obscure readings, chances are they're using and prefer a Japanese-Japanese dictionary in the first place. If anything, readings from KANJIDIC should be updated using the data from my database (there are instances in KANJIDIC here the okurigana separator is at a wrong position, too). And also, since I've got separate columns for the regular readings, which are 100% correct and need no updating, the emphasis is once again on the obscure readings and thus not a big priority. So the only thing that's left is the definitions field, at the end of which I've also appended the "common meanings" if such a common meaning was missing from the main entry. And I've also stripped duplicate definitions of the type "American style, separator, British style." I really had no need for such myself, so syncing this field with KANJIDIC will be a pain. However, since I've added the "common meanings" (for joyo kanji), the extra meanings aren't really that big of a priority (for me), except, again, if one wants to be thorough and have _ALL_ the meanings.

Yes, I know it would have been nice if I kept you updated daily on all the changes and corrections, but truthfully, I did so much that I lost track of it all, and it would have been way too much of a hassle. As I began work on this, I only wanted the correct strokes and readings, I never thought I'd go to this length with it... So I figured I'd complete my project and share it with you (and others) then, hoping it wouldn't be too much of a pain to synchronize everything.