On Aug 17, 2007, at 3:02 AM, Paul Blay wrote:
> Indeed, for intra-JMdict xrefs. I regard the Tanaka corpus as a > freestanding project. There are people out there who use it > and who'd be rather upset if the current indices were replaced by > codes that only applied to JMdict. (Supplemented is probably OK.) I could probably live with a set up where a supplemented alternative version is generated on an infrequent basis. Hmm ... might be fiddly but doable.
Let's not forget the complications introduced when you try to improve the glosses - and a number isn't of much use to the human intervener who will be correcting by hand. How often have you run into two glosses that were parsed by ChaSen way back when, that would better server the user by a single, longer entry in JMDICT/EDICT? Machine logic can't discover that at the moment.... though it could if you wanted to pair up every adjacent B gloss and see if it had an entry.
Also, if you're not careful, you could easily start introducing the wrong pseudo gloss number into the B line because you're not starting out with the number but have to discover it. Many a time you find the wrong word when you're trying to find a B line gloss's corollary in the dictionary with machine logic alone.
And I suspect the level of work "still to be done" on the TC is quite vast. There are always words in the A line that simply do not appear on the B line, and that could take years to flesh out.