On Aug 17, 2007, at 8:33 AM, Paul Blay wrote:
> Many a time you find
> the wrong word when you're trying to find a B line gloss's corollary
> in the dictionary with machine logic alone.
I'm not with you. Do you mean when generating a _new_ B line from
a sentence?
Just saying that there are ambiguous entries... different words, identical "spelling". Hard to match them automatically.
> And I suspect the level of work "still to be done" on the TC is quite
> vast. There are always words in the A line that simply do not appear
> on the B line, and that could take years to flesh out.
As I mentioned in one recent post the number of sentences in TC in the
A line that do not appear in the B line is currently 11,040 from
154,726 (excluding words that are _intentionally_ not appearing).
That's a lot of work but it is far from impossible. I've no way to tell
how large that figure was exactly when I started working on the file
but I believe it was probably over 50% (80,000+ at the time).
Not sure what you mean by "number of senten
es in A line". If you mean "number of sentences with words which do not appear in the B line", then this is what I'm writing about. But then how would you know? Chasen fails to find many words. You can check for kanji in A that are not in B, but that still under-represents the words spelled out in kana which have not been reduced to B line entries. So how do you say with confidence that the number is 11,040? I'm working an old version of the TC, but I suspect that the problem is much more pervasive than 11,040.
That is not to say there are not mistakes in the collection,
because there darn well are, but when I no longer have a backlog
of records to index the TC will be a lot more mature than it is now.
I'm thinking TC could become the core of a much larger corpus too.