[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Re: Regarding the ENT_SEQ field in JMDICT



> > > You could use a method of checking to see if a longer, more precise
> > > gloss is possible to create from the shorter ones.
> >
> > If you've got one, I'll take it. ;-)
>
> Goal is to parse this word out of a sentence:
>
> 得になる
>
> (exp) to do (a person) good; to bring profit
>
> But instead you currently parse:
>  得{とく}
> (adj-na,n,vs) profit; gain; interest
>
> に
> (prt) indicates such things as location of person or thing, location of short-term action, etc.
>
> なる(為る)
> (v5r) to change; to be of use; to reach to

Er, no.

> OR
>
> なる(生る)
> (v5r) to bear fruit

No.

> OR
>
> なる(成る)
> (v5r) (uk) to become

And no.

Actually it (and there was one sentence) was indexed to

>  得{とく}
> (adj-na,n,vs) profit; gain; interest

and

になる (suf) (1) becomes; will become; (2) (with "o" and masu-stem of
verb) grammatical form creating an honorific verbal expression; (P)

The other ones are not actual possibilities for various reasons.

> So on first pass, you join adjacent B line glosses check for JMDICT entries:
>
> 得に
>
> but you strike out.  But because ni is kana, you take a chance and add the next parsed word
>
> 得になる
> (exp) to do (a person) good; to bring profit
>
> Your program now erases three glosses from the B line and adds one new one.
>
> Extending string theory could be used, and a genetic algorithm could
> use readings instead of kanji or vice versa for each test mutant.  Some
> arbitrary number of bins could be set as max bin size.
>
> Of course this will require that you dedicate a computer for perhaps a
> few days to see how many it can dig out, but once its done its done, and
> you currently have no method of knowing how many TC lines are affected by
> this.

Like I said I've done some work along these lines so I do have a rough
idea how things go.  There are no doubt more to be found but most require
kana changing to kanji or infections fiddling with before they match.
I'm dubious that 1,000 sentences would be affected, I'm pretty sure
a dozen sentences would be affected.  Exactly where between those two it
falls I don't know.

> > > Don't think that is something you can do perfectly by hand,
> >
> > Probably not, although I also don't think it's something you can
> > do perfectly by computer. I have done some rough work in that
> > area but I think it's reached the point where the gains possible
> > are not worth the time/difficulty involved for computer assisted
> > processes.
>
> Not perfectly because some words will have inflections.  You write
> a short set of instructions and walk away from it.  Difficulty?

Not that short, and I've only got the one computer so I don't really
want to walk away from it for that long.  I'm also not a programmer,
although I can fake one to people who don't know the difference.
If someone _else_ wants to run through and come back with a list of
edict entries that have matches in the examples but have not been
indexed I'd be happy to check them.

> > and I would also think that the longer a compound is, the more
> > likely it is to appear in a sentence in its dictionary/JMDICT form.
>
> I'm not sure of your logic there. I could argue, for example, that
> the longer a compound is the more likely it is that it will have a
> kana/kanji/particle difference that will cause it not to match
> a near equivalent in an example sentence. I suspect, though,
> that most long Edict entries will just not exist in Example sentences.
> There seems to be a rough rule "longer = more obscure". Although
> some proverbs and sayings break that.
>
> No Paul, longer implies NOUN... compounds...long ones, tend to be
> names of things, and not verbs.

Are you guessing or did you check?

Of the top 100 longest Edict entries (that include some kanji)
47 are sentences.*  Of course even if they are noun compounds
that doesn't mean that a) the kanji/kana-ization will match or
b) they will actually exist in an example sentence.

> Therefore the longer a word is, the more likely it is not to have
> inflected forms, and thus be the same as its dictionary version.

* Only including one version from multiple headwords, not checked
closely.