[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?




On Jan 23, 2007, at 8:30 PM, Stuart McGraw wrote:
No problem here about not wanting to release the code.  That's
absolutely your right.  (Although I disagree with you that if
other people had access to it that that would reduce Kanji Cafe's
popularity but that's a philosophical discussion way out of
scope for this list :-)


Yeah, I'm just funny about code.  Code feels more like a child I've raised than data (which might feel more like animals I've collected).


I notice that in Kanji Cafe you go further than what I was asking
about (which was determining the kanji-reading mapping for
the known readings in each jmdict entry).  Kanji Cafe also seems
to be choosing *the* right reading (of the several possible) for
jmdict word occurances in example sentences.


No it doesn't.  The readings are derived from Jim's (Breen) morphological analysis of the Tanaka Corpus with the ChaSen tool, full of errors which have only been partially corrected by Jim, Paul, myself, and many others.  In fact I hope to invest a chunk of time correcting those errors, using the normal everyday use of Ice Mocha as the venue for error discovery...  if only I was locked in a room with a steady supply of Coka-a-Cola, fruit shakes, and barbecued steak... and no other obligations to anyone or anything else.



That is obviously
much harder.  Did you hand-pick the readings?  Picking them
algorithmically (heuristically)?  If the latter how accurate do
you feel it is?


I was able to install CheSen on a G4 OS X 10.4 Macintosh.  I haven't yet used it to create new wizbang tools, but it comes with a Perl module too.