[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Re: Regarding the ENT_SEQ field in JMDICT




On Aug 17, 2007, at 8:33 AM, Paul Blay wrote:

> Many a time you find
> the wrong word when you're trying to find a B line gloss's corollary
> in the dictionary with machine logic alone.

I'm not with you. Do you mean when generating a _new_ B line from
a sentence?


Just saying that there are ambiguous entries... different words, identical "spelling".  Hard to match them automatically.


> And I suspect the level of work "still to be done" on the TC is quite
> vast. There are always words in the A line that simply do not appear
> on the B line, and that could take years to flesh out.

As I mentioned in one recent post the number of sentences in TC in the
A line that do not appear in the B line is currently 11,040 from
154,726 (excluding words that are _intentionally_ not appearing).
That's a lot of work but it is far from impossible. I've no way to tell
how large that figure was exactly when I started working on the file
but I believe it was probably over 50% (80,000+ at the time).


Not sure what you mean by "number of senten es in A line".  If you mean "number of sentences with words which do not appear in the B line", then this is what I'm writing about.  But then how would you know?  Chasen fails to find many words.  You can check for kanji in A that are not in B, but that still under-represents the words spelled out in kana which have not been reduced to B line entries.  So how do you say with confidence that the number is 11,040?  I'm working an old version of the TC, but I suspect that the problem is much more pervasive than 11,040.


That is not to say there are not mistakes in the collection,
because there darn well are, but when I no longer have a backlog
of records to index the TC will be a lot more mature than it is now.


I'm thinking TC could become the core of a much larger corpus too.