[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Updates added to JMdict/EDICT

To: edict-jmdict@***************
Subject: Re: [edict-jmdict] Updates added to JMdict/EDICT
From: Jeroen Hoek <mail@*************>
Date: Sun, 22 Feb 2009 11:24:33 +0900

If an all encompassing solution is the goal, we would probably need to
look at Unicode data, or a program that does that.
There is a field called "Canonical decomposition" for characters that
are composed of (Roman?) letters and diacritics:

The ồ in your example:

U+006F LATIN SMALL LETTER O + U+0302 COMBINING CIRCUMFLEX ACCENT +
U+0300 COMBINING GRAVE ACCENT

I expect there to be some bit of software that does this already
though, this is not an uncommon problem. I wrote a simple substitution
function (only Dutch vowels and such) that did this, because we wanted
to use people's first names as part of autogenerated login
credentials. People will get slightly annoyed if the have to enter
diacritics for a website login. Most people don't even know how unless
it is a key on their keyboard.

~ Jeroen

2009/2/22 Darren Cook <darren@dcook.org>:
> This also came up when I tried to give Jim a recent wikipedia interwiki
> list. There is a surprising amount of non-ascii in English Wikipedia.
> Does someone have a ready-made algorithm (*) for stripping all
> diacritics that handles all of unicode? We're not just talking French
> and German here, and not just talking vowels. E.g.
>  http://en.wikipedia.org/wiki/Vietnamese_%C4%91%E1%BB%93ng
>
> Darren

Follow-Ups:
- Re: [edict-jmdict] Updates added to JMdict/EDICT
  - From: J Greely <jmdict@***********>

References:
- Updates added to JMdict/EDICT
  - From: Jim Breen <jimbreen@*********>
- Re: [edict-jmdict] Updates added to JMdict/EDICT
  - From: Darren Cook <darren@*********>

Prev by Date: Re: [edict-jmdict] Updates added to JMdict/EDICT
Next by Date: Re: [edict-jmdict] Updates added to JMdict/EDICT
Previous by thread: Re: [edict-jmdict] Updates added to JMdict/EDICT
Next by thread: Re: [edict-jmdict] Updates added to JMdict/EDICT
Index(es):
- Date
- Thread