[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] database schema



Jim Breen wrote:
> [Stuart McGraw ([edict-jmdict] database schema) writes:]
[...]
> >>     Combining keywords
> >>     ------------------
> >>     Different type of keywords used for the same element
> >>     could be combined.  For example "sense" uses seperate
> >>     child tables to keep lists of the "pos", "misc", and
> >>     "field" keywords associated with the sense.  By putting
> >>     the pos, misc and field keywords in the same table
> >>     (with non-overlapping pk's of course) a sense element
> >>     would need only one table to represent all three types
> >>     of keywords instead of three tables as now.
> 
> Does this mean that the combined field would need to be parsed
> in some way?

No, it would just mean that when you query the the database 
for the pos, misc, and field data associated with a sense, 
you would get back a single set of data containing all three 
types.  (As opposed to doing three queries, one for each 
type of information.)  But each item would be a separate 
row in the resultset, no parsing required.

However I think having separate tables is better.  If these
tables are combined, the corresponding kw* tables also need
to be combined.  But applications will need to distingish 
between the different types of keywords.  For example, to 
populate a part-of-speech dropdown box of a form, an application 
will have to retreive only the <pos> keywords.  A combined 
keyword table would need a "type" column added to support 
this.  But then those types wold have to be defined in another 
table.  So I think in the end, just having a seperate <pos>, 
<misc> and <field> tables is probably cleaner.

All three of the alternatives I mentioned I think are 
problematic.  I mentioned them mainly because they occured 
to me as possiblities when thinking about a schema, and I 
thought they might occur to others too.

> >> restr, stagk, stagr inversion
> >> -----------------------------
> >> In jmdict, the default assumption is that all combinations
> >> of kanji and readings are valid.  If this assumption is not 
> >> true, the restr tag info defines a subset of the full K x R 
> >> cross product that are valid.  Since the absence of restr 
> >> means all are valid (rather than none, which would more 
> >> consistent) a separate re_nokanji flag is used to indicate 
> >> the none condition.  In the database, the restr table has
> >> an inverted meaning. It identifies the K x R subset that 
> >> is invalid rather than valid.  This also eliminates the need
> >> for a separate nokanji flag.  The same applies analogously 
> >> to the stagk and stagr tables.
> 
> Any suggestion on how to improve those restrictions would be
> welcomed. They could just be narrative, however I want to drop
> the no-complying glosses when generating the EDICT version.

They probably shouldn't be narrative if anything other than a 
human being will ever need to use the infomation they contain.

> Similarly the "no_kanji" tag is to supress the kanji when making an
> EDICT entry for, say, a plant which has the katakana version as well.
> It has almost no role in the XML version.