[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Diacritics in lsrc



On 6 July 2010 16:30, Jeroen Hoek <mail@jeroenhoek.nl> wrote:

> Yeah, that should use the same Unicode data as a source.

The Unicode data keeps changing from version to version, so you need
to check details if you care about it.

On 6 July 2010 16:29, Jim Breen <jimbreen@gmail.com> wrote:

> I see the old Suns have Perl 5.6.1. If I get nowhere with the Python
> upgrade, I'll see if I can stir the Perl along.

I am very sorry but the thing I posted has not even a tiny hope of
working on Perl 5.6, you would need to investigate some legacy modules
in that case.

On 6 July 2010 16:38, Jeroen Hoek <mail@jeroenhoek.nl> wrote:
> On 6 July 2010 09:29, Jim Breen <jimbreen@gmail.com> wrote:
>> Interesting, and a bit of a shock to see it also whips the にごり
>> marks off kana as well. (It left the dots in the tops of "i" though  8-)})
>
> Logical if you think about it; nigori are diacritic-ish.

That sample program is just blanket-applying the Unicode decomposition
of each character to strip out anything which can be stripped out.
There is lots of support for pattern matching to select or exclude
sets of characters if you need it.