[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] pos/misc entities
On Thu, Jun 12, 2008 at 12:23:14PM +1000, Jim Breen wrote:
> 2008/6/11 gfxmaynard <g_yhoo@zewt.org>:
I guess Yahoo forgot my name. I liked the old list a lot better,
I ended up entering six captchas to sign up for this one (some of
which were in Arabic, I think).
> > I keep finding entities to be a pain for things like <pos>. Most
> > parsers really want to expand them, and make me jump hoops to ge the
> > actual entity, which is what I need for anything programmatic. Is
> > this really what entities are for? It seems like nodes would be
> > better: <pos>&v5u;</pos> becomes <vt/> or <pos><vt/></pos>.
>
> I think you mean <pos>&v5u;</pos> becomes <v5u/>?
Yeah (changed my example halfway through the mail), or <pos>...</pos>;
I think separating them would still be useful, and it'd keep the <sense>
spec from bloating.
> Hmmm. I list the entities in alphabetical order for easier scanning. I
> could group them by application in the DTD and docs. Would that help?
> They are unique -no POS entity is used in a <misc>, etc.
There are many benefits to having it not be an entity:
- XPath can't select on entity names (they're expanded by the time it
sees the data, I think), so you have to use the complete, expanded
text. You can do "//entry[sense/pos/v5s]" instead of the awkward
"//entry[sense/pos/text()="noun (common) (futsuumeishi)"]" to select
all entries with one or more v5s senses.
- XSLT can't substitute based on entity names (same thing)
- expat makes you jump through hoops to not expand entities (still
havn't got it to work, and it means hacking up the Python bindings,
which have no notion of receiving entities)
- specifying exactly what's a pos and what's a misc puts it in the
DTD where it probably belongs--specified precisely as part of the
data type like the rest of the spec, instead of ad hoc in the
documentation.
I think entities are meant for data that you normally want to see
expanded, like &. With JMdict, I've almost never wanted to
receive this data expanded (even if I'm displaying them formatted
somehow, I'm probably not using the particular expansions in the DTD).
A lot of people use this data. Am I the only person having trouble
with this?
> As for making the POS a node, I wonder. If I were starting again, I'd probably
> make them attributes:
>
> <gloss pos="n">cooking</gloss>
> <gloss pos="vs">to cook</gloss>
(Did you mean <sense pos="n"><gloss>?)
>
> Actually I could convert to that relatively easily, as only the XML
> generator would need to change.
Attribute names are unique, so it'd have to have something like a
delimited list. It's better to let XML do the parsing.
<sense>
<pos><adv/><adv-to/><vs/></pos>
<misc><on-mim/></misc>
<gloss>easily</gloss>
<gloss>readily</gloss>
<gloss>quickly</gloss>
</sense>
--
Glenn Maynard