[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] database schema

To: <edict-jmdict@***************>
Subject: RE: [edict-jmdict] database schema
From: "Stuart McGraw" <smcg4191@********>
Date: Thu, 21 Sep 2006 19:16:31 -0600
Importance: Normal

Philip Ronan wrote:
> For what it's worth, here's the SQL file I've been using to import  
> JMdict. I'm no expert on these things, so I suggest you use this for  
> reference only :-)
> 
> Just a few points:
> 
> 1. Priority values are calculated based on a lookup table and  
> averaged together into an integer value (ke_priority) when the data  
> is loaded. I figured this would allow the results of database queries  
> to be sorted more quickly.

Interesting idea.  Did you try doing the calculation on the fly
and see performance problems?  The downside is the need to maintain 
that number when the k_pri values change.  Are you thinking of using 
a trigger for this?  

> 2. It's better to postpone the "CREATE INDEX" statements until after  
> the database has been filled, otherwise it will take longer to load  
> the data. When creating indexes for text fields, you might want to  
> restrict the number of characters used to form the index. For  
> example, "CREATE INDEX kana_txt ON kana (txt(10));" will create an  
> index based on the first 10 characters of items in the txt column.  
> The default size of this field is 2048 characters (up to 6 KB of  
> Unicode text), so the indexes could end up being rather large.

I knew about not creating (or dropping) indexes before a bulk
load but didn't do it out of laziness.  I didn't know about the 
text column index size until this morning when I was looking in 
the manual to verify my answer to Ben Bullock's email.  Thanks.

> 3. A table called tmpref is used to resolve cross-references  
> resulting from re_restr, ke_restr and xref elements. This picked out  
> over 40 errors in the JMdict file the other day, so foreign key  
> constraints would certainly help matters. Incidentally, these cross- 
> references can be resolved much faster if indexes are added to the  
> r_ele and k_ele tables first.

The first loader program I wrote (years ago) did that but when 
I wrote the load_jmdict script I posted it seemed easier just 
to keep the stuff in memory, and resolve them after the program 
finishes loading the main entries.

Also, as a general observation, not a comment on what you said
about the indexes....
while a loader program will be used frequently while evaluating
and developing the database, after the database is in operation, 
the loader script will be only of historical interest.  (There 
are better ways of backing up the database than relying on re-
loading from a assorted text and xml files.)  So advanced load 
optimizations will probably have a short lifetime.

> 4. The CHARSET statements are added to each table because I'm working  
> on a shared server where I don't have permission to set global data  
> for the whole database. (At least, I don't *think* I do...). I've had  
> no problems with Unicode on this server (MySQL version 4.1.11).
> 
> 5. I have to admit I'm rather confused by all the kw text fields in  
> Stuart's SQL file. This is probably a dumb question, but if these all  
> contain keywords, then shouldn't they be in a single table?

I don't think it's dumb because that is exactly what I did 
the last time.  But it turned out to have few advantages 
and some significant disadvantages.  
Plusses and minuses...

+ The visual noise when listing tables is reduced.

+ Fewer "things" to maintain in the schema.

+ Easier what kw table to use in a join (since there 
   are fewer or only one)

o You would think that it might reduce the number of joins 
  but see below.

- (the most important)
  The different kinds of keywords really are different things, 
  they are not arbitrary text tags that can be applied to any 
  elements (even though the practicalities of an xml representation
  looses the distinction).  One cannot legitimately apply an "adj-na" 
  tag to a kanji element, or an "oK" tag to a sense but a single 
  "entities" table allows that.

- I don't see any distinction between "entities" (which you have 
  all in one table) and other keywords like the language and 
  dialect (which you have in separate tables).   The difference
  between them seems to be only in a choice Jim made for their
  representation in xml.  Had he defined entities for the language
  and dialect keywords, would they belong in the entities table?
  What would have changed about the information they convey?

I have wondered (as I mentioned in my original posting)
whether it would be worthwhile combining the keywords 
for the same elements.  For example combining the misc, 
pos, and field, keywords since they all belong to "sense".   
This would not be to reduce the number of keyword tables 
(because you can only do this in a few places you would 
only eliminate a few tables) but to allow combining the 
"pos", "misc", and "fld" tables into one.  Since these 
tables are often joined with the sense table it would 
mean one join instead of up to three.  But....

I find the most frequent need to join these tables is 
when doing searches.  The search query obviously has to 
contain every table that holds one of the data you are 
searching for.  But how often does one do a search that 
involves all three of a "pos", a "misc" tag, and a "field"?  
And one doesn't even do searches by manually writing a 
query very often; there will be application(s) like Pawel's 
web page that will take search criteria you enter in a form, 
build the right sql. and run it. So the number of tables 
involved is not very important (unless it has performance 
ramifications but that seems unlikely, and needs to be 
determined empirically, not by guessing).

One other point... I have found that one seldom needs
to use the kw tables in joins.  It is far easier to have 
applications read the kw tables at startup, and do the 
keyword lookup themselves.  So queries they run will 
not need to have joins to any of the keyword tables.

As for people, looking at things with Mysql's query 
browser and such, the easiest thing to do is to define 
a view   For example, one would define

  CREATE VIEW pos_kw AS 
    SELECT p.sens, w.kw FROM pos p JOIN kwpos w ON p.kw=w.id;

Now you just use "pos_kw" instead of "pos" when looking 
at part-of-speech data for an entry, on in joins, etc, and you 
are presented with the keyword string instead of an obscure
number, without having to explicitly join the kw table.

All-in-all, as I see it, I think that having separate keyword 
tables offers more pluses than minuses and that most of 
the minuses turn out to be not so important.  If seeing a 
lot of tables in the tables list is annoying, we could name 
them prefixed with "z" so that that they would sit innocuously 
at the bottom of the list.

Follow-Ups:
- Re: [edict-jmdict] database schema
  - From: Philip Ronan <mail@************************>

References:
- Re: [edict-jmdict] database schema
  - From: Philip Ronan <mail@************************>

Prev by Date: Re: [edict-jmdict] database schema
Next by Date: RE: [edict-jmdict] database schema
Previous by thread: Re: [edict-jmdict] database schema
Next by thread: Re: [edict-jmdict] database schema
Index(es):
- Date
- Thread