[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Changing entities to attribute values



On Jan 12, 2010, at 11:05 PM, Glenn Maynard wrote:
> At least that's my impression, because every XML parser I've used
> makes it very difficult to read text data and get unexpanded
> entities.

Yes. The hardest part about writing my scripts to import JMdict
and JMnedict into Sqlite was getting the actual data back out,
rather than the expanded codes. The only two things that made it
possible were a Perl library that included a way to dump the
complete list of entities defined in the file, and the fact that
they're always the entire contents of a known set of elements.
That reduced it to a simple hash lookup.

(I of course stuffed the expanded definitions into another
table; I want them, just not 700,000 times)

> I think the cleanest approach (which I didn't think of before) is:
>   <misc type="col"/>

You'd also want to put in a table of definitions, along the lines
of:
    <code><c_key>prt</c_key><c_value>particle</c_value></code>

That would allow you to include non-English expansions for the
codes.

-j