[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Re: Regarding the ENT_SEQ field in JMDICT



> On Aug 17, 2007, at 8:33 AM, Paul Blay wrote:
>
> > > Many a time you find
> > > the wrong word when you're trying to find a B line gloss's corollary
> > > in the dictionary with machine logic alone.
>
> > I'm not with you. Do you mean when generating a _new_ B line from
> > a sentence?
>
> Just saying that there are ambiguous entries... different words,
> identical "spelling".  Hard to match them automatically.

There are no known ambiguous entries from the B line to Edict.
Every three or four months I go through a somewhat fiddly process
to ensure that this is still the case (and update/correct where
required).

> > As I mentioned in one recent post the number of sentences in TC in the
> > A line that do not appear in the B line is currently 11,040 from
> > 154,726 (excluding words that are _intentionally_ not appearing).
> > That's a lot of work but it is far from impossible. I've no way to tell
> > how large that figure was exactly when I started working on the file
> > but I believe it was probably over 50% (80,000+ at the time).
>
> Not sure what you mean by "number of sentences in A line".
> If you mean "number of sentences with words which do not appear in the
> B line", then this is what I'm writing about.

That is what I meant, as I posted in a correctional follow up.

> But then how would you know?

Simple.  Every character in the A line is taken account of by
a) an actual match to a keyword
or
b) a match to a intentionally excluded word/symbol
anything left over means that sentence isn't finished.

How it works is like follows ...
A Line:
私はひげをそっているときに顔を切った。
English:
I cut myself while shaving.
B Line:
私(わたし) は を 時(とき){とき} に 顔 を 切る{切った}
Check-off line (x's mark characters that are accounted for):
xxひげxそっているxxxxxxxxx
NoIndex Line:
 。 x

Those with a 'Check-off line' that is entirely xxxxx are
'finished' (excepting mistakes).  Mistakes don't mean
'un-indexed' but 'indexed to the wrong word' in this case.

> Chasen fails to find many words.

I don't use Chasen.

> You can check for kanji in A that are not in B, but that still
> under-represents the words spelled out in kana which have not been
> reduced to B line entries.  So how do you say with confidence that the
> number is 11,040?  I'm working an old version of the TC, but I suspect
> that the problem is much more pervasive than 11,040.

Nope, there are exactly 11,034 in the file I am currently working on.
There will be a few more in the version currently online until I next
update.