[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Updates added to JMdict/EDICT



If an all encompassing solution is the goal, we would probably need to
look at Unicode data, or a program that does that.
There is a field called "Canonical decomposition" for characters that
are composed of (Roman?) letters and diacritics:

The ồ in your example:

U+006F LATIN SMALL LETTER O + U+0302 COMBINING CIRCUMFLEX ACCENT +
U+0300 COMBINING GRAVE ACCENT

I expect there to be some bit of software that does this already
though, this is not an uncommon problem. I wrote a simple substitution
function (only Dutch vowels and such) that did this, because we wanted
to use people's first names as part of autogenerated login
credentials. People will get slightly annoyed if the have to enter
diacritics for a website login. Most people don't even know how unless
it is a key on their keyboard.

~ Jeroen

2009/2/22 Darren Cook <darren@dcook.org>:
> This also came up when I tried to give Jim a recent wikipedia interwiki
> list. There is a surprising amount of non-ascii in English Wikipedia.
> Does someone have a ready-made algorithm (*) for stripping all
> diacritics that handles all of unicode? We're not just talking French
> and German here, and not just talking vowels. E.g.
>  http://en.wikipedia.org/wiki/Vietnamese_%C4%91%E1%BB%93ng
>
> Darren