[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Re: Regarding the ENT_SEQ field in JMDICT




On Aug 17, 2007, at 12:36 PM, Paul Blay wrote:

> You could use a method of checking to see if a longer, more precise
> gloss is possible to create from the shorter ones.

If you've got one, I'll take it. ;-)


Goal is to parse this word out of a sentence:

得になる 
(exp) to do (a person) good; to bring profit

But instead you currently parse:
 得{とく}
(adj-na,n,vs) profit; gain; interest

(prt) indicates such things as location of person or thing, location of short-term action, etc.

なる(為る)
(v5r) to change; to be of use; to reach to

OR

なる(生る)
(v5r) to bear fruit

OR

なる(成る)
(v5r) (uk) to become


So on first pass, you join adjacent B line glosses check for JMDICT entries:

得に

but you strike out.  But because ni is kana, you take a chance and add the next parsed word

得になる 
(exp) to do (a person) good; to bring profit


Your program now erases three glosses from the B line and adds one new one.

Extending string theory could be used, and a genetic algorithm could use readings instead of kanji or vice ve sa for each test mutant.  Some arbitrary number of bins could be set as max bin size.

Of course this will require that you dedicate a computer for perhaps a few days to see how many it can dig out, but once its done its done, and you currently have no method of knowing how many TC lines are affected by this.




> Don't think that is something you can do perfectly by hand,

Probably not, although I also don't think it's something you can
do perfectly by computer. I have done some rough work in that
area but I think it's reached the point where the gains possible
are not worth the time/difficulty involved for computer assisted
processes.

Not perfectly because some words will have inflections.  You write a short set of instructions and walk away from it.  Difficulty?


> and I would also think that the longer a compound is, the more
> likely it is to appear in a sentence in its dictionary/JMDICT form.

I'm not sure of your logic there. I could argue, for example, that
the longer a compound is the more likely it is that it will have a
kana/kanji/particle difference that will cause it not to match
a near equivalent in an example sentence. I suspect, though,
that most long Edict entries will just not exist in Example sentences.
There seems to be a rough rule "longer = more obscure". Although
some proverbs and sayings break that.

No Paul, longer implies NOUN... compounds...long ones, tend to be names of things, and not verbs.  Therefore the longer a word is, the more likely it is not to have inflected forms, and thus be the same as its dictionary version.