[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] duplicate elements in jmdict



> Very nice. I have done these sorts of files at various times over the
> years. It's amazing how duplicates and near duplicates sneak in.
> 
> When I was building the first JMdict-style file from the original EDICT
> I had to combine a heap of entries. I did a semi-automatic merge where
> the kanji parts were the same and there was an overlap of words in the
> English part (bag-of-words). I then edited the result by hand. Speeded
> things up.

My "gee-whiz-look-what-you-can-do-with-jmdic-in-a-database" 
posts aren't meant to sound like I am going where no jmdict person 
has ever gone before :-) .  I know you (general/plural) can do all this 
stuff with unix text tools, some scripts, some code, etc.  And that 
you (Jim) probably have a lot of such code already built in libraries, 
can use it very efficiently, and have probably done all these things 
many times before.

What I am hoping to sell is that a lot of these things with a database 
are easier/faster than doing it in code, especially for one-off questions, 
or more complex questions.  The price to be paid is learning sql but 
that is not as big a problem as book and courseware publishers would 
have one believe.  (As is the case with learning hiragana/katakana I 
suppose.)  I am hoping that you (Jim) and you (jmdict editors) will find 
direct access to the database a very useful resource in addition to 
the routine stuff done via a web interface or ad hoc methods on text
files.