[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] Tighter rules for reading fields



Jim Breen wrote:
2009/2/28 Stuart McGraw <smcg4191@frii.com>:
> I just ran my parser on the current jmdict.xml file and it
> reports the following...
>
> The complaints about reb with '?' are because they are
> character 301C (WAVE DASH) rather than FF5E (FULL-WIDTH TILDE)
> you gave above. (30C1 doesn't seem to have a representation
> in JIS which this email is in, hence the '?').

I'm not sure I can do a lot about that one. It's a known
round-trip problem between JIS and Unicode. See:
http://en.wikipedia.org/wiki/Unicode#Mapping_to_legacy_character_sets

Thanks for the reference.  I think I have a (still a
little fuzzy) understanding of the issue, but can see that you want to maintain JIS X 208 compatibility in
the reb/keb elements to the maximum extent possible,
and that using U+FF5E will break that. I was just confused because I saw a U+FF5E in your email.

The only other thing I wondered about is if it would make sense to allow a small set of punctuation characters in the reb -- in case jmdict gets more phrases or expressions. (I noticed one of the previously reported warnings was for an entry containing a comma, though it's gone now.)

[...]
I think all those should be OK now. Please check them tomorrow
when the next version goes out.

Seq 1262730: Conflicting pri value 'nf41' in reading げんぜん, kanji 厳然
Seq 1274190: Conflicting pri value 'nf17' in reading こうぜん, kanji 公然
Seq 1376200: Conflicting pri value 'nf32' in reading せいぜん, kanji 整然
Seq 1475790: Conflicting pri value 'nf22' in reading ばくぜん, kanji 漠然
Seq 2405880: lsource has attribute(s)  but no text
 Not sure if the above is an error or intended.
Seq 2415870: keb text '?' not kanji.