[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [edict-jmdict] database schema
Jim Breen wrote:
> [Stuart McGraw ([edict-jmdict] database schema) writes:]
[...]
> >> Combining keywords
> >> ------------------
> >> Different type of keywords used for the same element
> >> could be combined. For example "sense" uses seperate
> >> child tables to keep lists of the "pos", "misc", and
> >> "field" keywords associated with the sense. By putting
> >> the pos, misc and field keywords in the same table
> >> (with non-overlapping pk's of course) a sense element
> >> would need only one table to represent all three types
> >> of keywords instead of three tables as now.
>
> Does this mean that the combined field would need to be parsed
> in some way?
No, it would just mean that when you query the the database
for the pos, misc, and field data associated with a sense,
you would get back a single set of data containing all three
types. (As opposed to doing three queries, one for each
type of information.) But each item would be a separate
row in the resultset, no parsing required.
However I think having separate tables is better. If these
tables are combined, the corresponding kw* tables also need
to be combined. But applications will need to distingish
between the different types of keywords. For example, to
populate a part-of-speech dropdown box of a form, an application
will have to retreive only the <pos> keywords. A combined
keyword table would need a "type" column added to support
this. But then those types wold have to be defined in another
table. So I think in the end, just having a seperate <pos>,
<misc> and <field> tables is probably cleaner.
All three of the alternatives I mentioned I think are
problematic. I mentioned them mainly because they occured
to me as possiblities when thinking about a schema, and I
thought they might occur to others too.
> >> restr, stagk, stagr inversion
> >> -----------------------------
> >> In jmdict, the default assumption is that all combinations
> >> of kanji and readings are valid. If this assumption is not
> >> true, the restr tag info defines a subset of the full K x R
> >> cross product that are valid. Since the absence of restr
> >> means all are valid (rather than none, which would more
> >> consistent) a separate re_nokanji flag is used to indicate
> >> the none condition. In the database, the restr table has
> >> an inverted meaning. It identifies the K x R subset that
> >> is invalid rather than valid. This also eliminates the need
> >> for a separate nokanji flag. The same applies analogously
> >> to the stagk and stagr tables.
>
> Any suggestion on how to improve those restrictions would be
> welcomed. They could just be narrative, however I want to drop
> the no-complying glosses when generating the EDICT version.
They probably shouldn't be narrative if anything other than a
human being will ever need to use the infomation they contain.
> Similarly the "no_kanji" tag is to supress the kanji when making an
> EDICT entry for, say, a plant which has the katakana version as well.
> It has almost no role in the XML version.