[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [edict-jmdict] uk tag question
Jim Breen wrote:
> [Stuart McGraw ([edict-jmdict] uk tag question) writes:]
>[...]
> I didn't know there were any "uk" tags in the re_inf. I just checked, and
> could only find one - in the new 樺太柳葉魚 entry. I moved it to the
> <misc> with all the others.
>
> Can you tell me where you saw some of the others?
Sorry, my mistake. There was just one. I was mistakenly
looking at 'uK' tags in r_inf but I guess that raises the same
question for them. There are 9 uK tags in rinf, and 5 in misc.
misc: 1225700,1812570,2077340,2082710,2123440
rinf: 2113750,2114610,2114630,2115990,2118810,
2119780,2121430,2121440,2128660
>[...]
> >> Also, do entry sequence numbers >9000000 indicate anything
> >> special about those entries?
>
> As Jean-Luc remarked, they are from the JIS212_containing "edicth" file.
> I have kept these apart for legacy software reasons - a lot of software
> out was written to use EDICT in EUC or Shift_JIS, and hasn't catered
> for JIS212 characters (Shift_JIS can't encode them; EUC can but it's a
> 3-byte code and needs special handling).
>
> When we get the database firing, these entries can be rolled into the
> main file, although there needs to be a way of generating a subset
> that involves just JIS208 characters.
The 9x entries are in jmdict so they are in the database now. So they
shoud not be exported to your master file when it's genereated from the
database?
I was guessing that the seq numbers are probably relatively immutable
since jmdict users may rely on them to identify the "same" words across
different versions of the jmdict file.. So the 9x seq numbers could continue
to identify jis212 words as now. Can one distingush between jis208 and
jis212 characters based on unicode code point (other than using a
character lookup table?)