[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Aligning German, Dutch, etc. dictionaries.
One of my many must-do items is to set up a system
to properly align the entries from the dictionaries that
get rolled into JMdict at build time (Wadoku, Warandict, etc.)
At present it's clunky, labour-intensive, and doesn't happen
very often.
I'm working up a utility that hopefully will lead to the alignment being
done automatically. It's looking good and might even be
operational in a week or so.
One of the by-products of this is that it reports when an entry
in Wadoku or Warandict, both of which allow multiple kanji forms
in an entry, matches on more than one JMdict entry. In some
cases this is because of an incorrect merge in the other dictionary,
e.g. Wadoku has 追い手 and 追い風 in the one entry, but in other cases
it's because JMdict has potential merges which we haven't picked up.
If you look at the corrections I've been doing over the last couple of days
you'll see quite a lot of merges. These have been coming from comparing
Wadoku with JMdict. As I'm only 5% of the way through Wadoku, I suspect
there's a lot more to come.
Another issue is aligning senses. I think this would be a very nice challenge
in NLP. I'm sure there are some techniques that could be deployed
to do a good first-cut alignment across languages, and I'll take it up
with some NLP academics who I know are always interested in finding
students projects with practical application.
Cheers
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University