[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Choosing a database backend



--- In edict-jmdict@yahoogroups.com, Jim Breen <Jim.Breen@...> wrote:
>
> That's a key point. However in my fiddlings with MySQL I just used
> its equivalent of "unsigned char" and it handled Japanese fine.

Some things to consider before jumping on the MySQL wagon:

>> From http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html

"RFC 3629 describes encoding sequences that take from one to four
bytes. Currently, MySQL support for UTF-8 does not include four-byte
sequences. [...]"

This should not affect jmdict data I think, but it does affect
kanjidic. Then again, it's not necessary to store the actual utf8
representation when a unicode codepoint will do. On the other hand
it's handy to be able to store the correct utf8 representation,
not having to convert between utf8 and codepoints all the time in
the code.

A quick test tells me there are about 300 kanji in kanjidic which
has a 4-byte utf8 representation, and thus not representable
in a utf8 field in MySQL.

Treating the data as raw bits is of course possible, but then
all string handling at the database level is lost.

-- 
David,