[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [edict-jmdict] database schema
Philip Ronan wrote:
> For what it's worth, here's the SQL file I've been using to import
> JMdict. I'm no expert on these things, so I suggest you use this for
> reference only :-)
>
> Just a few points:
>
> 1. Priority values are calculated based on a lookup table and
> averaged together into an integer value (ke_priority) when the data
> is loaded. I figured this would allow the results of database queries
> to be sorted more quickly.
Interesting idea. Did you try doing the calculation on the fly
and see performance problems? The downside is the need to maintain
that number when the k_pri values change. Are you thinking of using
a trigger for this?
> 2. It's better to postpone the "CREATE INDEX" statements until after
> the database has been filled, otherwise it will take longer to load
> the data. When creating indexes for text fields, you might want to
> restrict the number of characters used to form the index. For
> example, "CREATE INDEX kana_txt ON kana (txt(10));" will create an
> index based on the first 10 characters of items in the txt column.
> The default size of this field is 2048 characters (up to 6 KB of
> Unicode text), so the indexes could end up being rather large.
I knew about not creating (or dropping) indexes before a bulk
load but didn't do it out of laziness. I didn't know about the
text column index size until this morning when I was looking in
the manual to verify my answer to Ben Bullock's email. Thanks.
> 3. A table called tmpref is used to resolve cross-references
> resulting from re_restr, ke_restr and xref elements. This picked out
> over 40 errors in the JMdict file the other day, so foreign key
> constraints would certainly help matters. Incidentally, these cross-
> references can be resolved much faster if indexes are added to the
> r_ele and k_ele tables first.
The first loader program I wrote (years ago) did that but when
I wrote the load_jmdict script I posted it seemed easier just
to keep the stuff in memory, and resolve them after the program
finishes loading the main entries.
Also, as a general observation, not a comment on what you said
about the indexes....
while a loader program will be used frequently while evaluating
and developing the database, after the database is in operation,
the loader script will be only of historical interest. (There
are better ways of backing up the database than relying on re-
loading from a assorted text and xml files.) So advanced load
optimizations will probably have a short lifetime.
> 4. The CHARSET statements are added to each table because I'm working
> on a shared server where I don't have permission to set global data
> for the whole database. (At least, I don't *think* I do...). I've had
> no problems with Unicode on this server (MySQL version 4.1.11).
>
> 5. I have to admit I'm rather confused by all the kw text fields in
> Stuart's SQL file. This is probably a dumb question, but if these all
> contain keywords, then shouldn't they be in a single table?
I don't think it's dumb because that is exactly what I did
the last time. But it turned out to have few advantages
and some significant disadvantages.
Plusses and minuses...
+ The visual noise when listing tables is reduced.
+ Fewer "things" to maintain in the schema.
+ Easier what kw table to use in a join (since there
are fewer or only one)
o You would think that it might reduce the number of joins
but see below.
- (the most important)
The different kinds of keywords really are different things,
they are not arbitrary text tags that can be applied to any
elements (even though the practicalities of an xml representation
looses the distinction). One cannot legitimately apply an "adj-na"
tag to a kanji element, or an "oK" tag to a sense but a single
"entities" table allows that.
- I don't see any distinction between "entities" (which you have
all in one table) and other keywords like the language and
dialect (which you have in separate tables). The difference
between them seems to be only in a choice Jim made for their
representation in xml. Had he defined entities for the language
and dialect keywords, would they belong in the entities table?
What would have changed about the information they convey?
I have wondered (as I mentioned in my original posting)
whether it would be worthwhile combining the keywords
for the same elements. For example combining the misc,
pos, and field, keywords since they all belong to "sense".
This would not be to reduce the number of keyword tables
(because you can only do this in a few places you would
only eliminate a few tables) but to allow combining the
"pos", "misc", and "fld" tables into one. Since these
tables are often joined with the sense table it would
mean one join instead of up to three. But....
I find the most frequent need to join these tables is
when doing searches. The search query obviously has to
contain every table that holds one of the data you are
searching for. But how often does one do a search that
involves all three of a "pos", a "misc" tag, and a "field"?
And one doesn't even do searches by manually writing a
query very often; there will be application(s) like Pawel's
web page that will take search criteria you enter in a form,
build the right sql. and run it. So the number of tables
involved is not very important (unless it has performance
ramifications but that seems unlikely, and needs to be
determined empirically, not by guessing).
One other point... I have found that one seldom needs
to use the kw tables in joins. It is far easier to have
applications read the kw tables at startup, and do the
keyword lookup themselves. So queries they run will
not need to have joins to any of the keyword tables.
As for people, looking at things with Mysql's query
browser and such, the easiest thing to do is to define
a view For example, one would define
CREATE VIEW pos_kw AS
SELECT p.sens, w.kw FROM pos p JOIN kwpos w ON p.kw=w.id;
Now you just use "pos_kw" instead of "pos" when looking
at part-of-speech data for an entry, on in joins, etc, and you
are presented with the keyword string instead of an obscure
number, without having to explicitly join the kw table.
All-in-all, as I see it, I think that having separate keyword
tables offers more pluses than minuses and that most of
the minuses turn out to be not so important. If seeing a
lot of tables in the tables list is annoying, we could name
them prefixed with "z" so that that they would sit innocuously
at the bottom of the list.