[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] database schema



Ben Bullock wrote:
> Thanks for posting your example, it was educational, but it's a little
> hard to understand. I wonder if you could add some examples of how
> actual data fits into the database?

Here is an example of the database tables and rows that represent 
a specific entry and how they are related....

You may also want to refer to the schema diagram "schema.png"
that was in the zipfile I posted.

If you use a variable width font to view this, it will look like crap.

Here is the xml for entry 136533, 親父【おやじ】

<ent_seq>1365330</ent_seq>
    <k_ele>
        <keb>親父</keb>
        <ke_pri>news2</ke_pri>
        <ke_pri>nf25</ke_pri>
    </k_ele>
    <k_ele>
        <keb>親爺</keb>
    </k_ele>
    <k_ele>
        <keb>老爺</keb>
    </k_ele>
    <r_ele>
        <reb>おやじ</reb>
        <re_inf>&gikun;</re_inf>
        <re_pri>news2</re_pri>
        <re_pri>nf25</re_pri>
    </r_ele>
    <r_ele>
        <reb>ろうや</reb>
        <re_restr>老爺</re_restr>
    </r_ele>
    <r_ele>
        <reb>しんぷ</reb>
    </r_ele>
    <sense>
        <pos>&n;</pos>
        <gloss>one's father</gloss>
        <gloss>old man</gloss>
        <gloss>one's boss</gloss>
        <gloss g_lang="de">Vater</gloss>
        <gloss g_lang="de">(als Bez. von Männern für ihren Vater verwendet)</gloss>
    </sense>
    <sense>
        <gloss g_lang="de">alter Mann</gloss>
    </sense>
    <sense>
        <gloss g_lang="de">Boss</gloss>
        <gloss g_lang="de">Chef</gloss>
        <gloss g_lang="de">Meister</gloss>
        <gloss g_lang="de">(Bez. von Untergebenen für ihren Chef)</gloss>
        <gloss g_lang="fr">son père</gloss>
        <gloss g_lang="fr">vieil homme</gloss>
        <gloss g_lang="fr">son patron</gloss>
        <gloss g_lang="fr">(JF2)</gloss>
    </sense>
</entry>


In the database and entry is represented by a row in the "entr" table.

mysql> select * from entr where seq=1365330;
+-------+-----+---------+--------+
| id    | src | seq     | note   |
+-------+-----+---------+--------+
| 35524 |   1 | 1365330 | 684937 |
+-------+-----+---------+--------+

id:	Primary key, was auto-assigned when this row was loaded,  
	  Its only purpose is to uniquely identify this record.
src:	Is a reference to a row in the kwsrc table.  The value
	  1 is used for jmdict entries, 2 jnedict entries, 3 for 
	  entries from the examples files, etc.
seq:	The jmdict sequence number.
note:   Holds aribrarty text about this entry.  For convenience
	  and illustrations, the load_jmdict.py script puts the 
	  entry's jmdict file line number in here.

Here are all the rows currently in the kwsrc table.  All the kw*
table have the same structure show I will show onlr releveant 
rows from them from now on.

Table kwsrc:
+----+----------+--------------------------------------+
| id | kw       | descr                                |
+----+----------+--------------------------------------+
|  1 | jmdict   | Entry from the JMdict file           |
|  2 | jmnedict | Entry from the JMnedict (names) file |
|  3 | examples | Entry from the Examples_s file       |
+----+----------+--------------------------------------+

Each entry has an arbitrary number of reading, kanji and sense
items, each represented by rows in the tables "kana", "kanj",
"sens" respectively.  

Entries also have and arbitrary number of dialects and languages
associated with them in tables "dial" and "lang".
Finally they have anarbitrary number of audit records in the 
"audit" table.  This entry doesn't have any so I will make 
some up a dialect entry:

mysql> select * from dial where entr=57761;
+--------+----+
|   entr | kw |    # Faked results
+--------+----+
|  35524 |  2 |
+--------+----+
2 corresponds to the the entry for Kansai-ben in kwdial:

mysql> select * from kwdial where id=2;
+----+-----+------------+
| id | kw  | descr      |
+----+-----+------------+
|  2 | ksb | Kansai-ben |
+----+-----+------------+

Here are the rows in the "kana" table which holds the readings.
mysql> select * from kana where entr=35524;
+-------+-------+-----+--------+
| id    | entr  | ord | txt    |
+-------+-------+-----+--------+
| 36988 | 35524 |  10 | おやじ      |
| 36989 | 35524 |  20 | ろうや     |
| 36990 | 35524 |  30 | しんぷ       |
+-------+-------+-----+--------+
"entr" refers back (is a foreign key) to the row in entr to 
which this reading belongs.
"ord" is set of numbers, increasing within the readings
for each entry, that the order the readings were encountered 
the the xml.  Recall that relational tables are by definition
unorderd, and the only reliable way to order them is using
an ORDER BY clause in a query.  "ord" provides something to
order by, to retrieve the records in a controlled order.


The rinf table holds lists of r_inf values.  Here is 
what we have for this entry...
mysql> select * from rinf where kana in (36988,36989,36990);
+-------+----+
| kana  | kw |
+-------+----+
| 36988 |  2 |
+-------+----+

r_inf kw 2 is:
mysql> select * from kwrinf where id=2;
+----+-------+-------------------------+
| id | kw    | descr                   |
+----+-------+-------------------------+
|  2 | gikun | gikun (meaning) reading |
+----+-------+-------------------------+

r_pri values are represented in the "rfreq" table
mysql> select * from rfreq where kana in (36988,36989,36990);
+-------+----+-------+
| kana  | kw | value |
+-------+----+-------+
| 36988 |  5 |    25 |
+-------+----+-------+

This is similar to the other tables containing keyword
lists except "rfreq" and "kfreq" contain and extra column.
If we look up kw 5 in the kwfreq table we see:

mysql> select * from kwfreq where id=5;
+----+----+-------+
| id | kw | descr |
+----+----+-------+
|  5 | nf | NULL  |
+----+----+-------+

This and the value 25 in the rfreq row tell us that this 
reading has a "nf25" tag in the xml.

Note that in the rfreq results, there was no row for "news2".
"news2" is redundent information (it is present when the nf
value is beteween 25 and 48.  If you really need to see "news2"
in a query (instead of deriving it in your application)
you can do something like:

mysql> select concat("news",1+(value>24)) from rfreq where kw=5 and kana in (36988,36989,36990);
+-----------------------------+
| concat("news",1+(value>24)) |
+-----------------------------+
| news2                       |
+-----------------------------+

Now for the kanji:
mysql> select * from kanj where entr=35524;
+-------+-------+-----+------+
| id    | entr  | ord | txt  |
+-------+-------+-----+------+
| 23995 | 35524 |  10 | 親父    |
| 23996 | 35524 |  20 | 親爺    |
| 23997 | 35524 |  30 | 老爺    |
+-------+-------+-----+------+

There are kinf and kfreq tables that work exactly like
rinf and rfreq so I won'y repeat it hear.  But note that 
though table "rinf"s kw column refers to table "rinf" and 
table "kinf"s kw column refers to table "kwinf", the kw 
columns of both table "rfreq" and "kfreq" refer to that 
same table "kwfreq".

This entry has an restr element in jmdict.  This info is
in the database "restr" table:

mysql> select * from restr where kana in (36988,36989,36990);
+-------+-------+
| kana  | kanj  |
+-------+-------+
| 36989 | 23995 |
| 36989 | 23996 |
+-------+-------+
(exactly the same result set should be obtained from 
"select * from restr where kanj in (23995,23996,23997)")
We have two rows because the database stores the *invalid*
pairs of kanji and readings rather than the *valid* ones
as in the xml.  I will defer the rational for this to
another post.

Now lets look at the sense and gloss entries:

mysql> select * from sens where entr=35524;
+-------+-------+-----+------+
| id    | entr  | ord | note |
+-------+-------+-----+------+
| 43160 | 35524 |  10 |      |
| 43161 | 35524 |  20 |      |
| 43162 | 35524 |  30 |      |
+-------+-------+-----+------+
The structure similar to the kana and kanj tables.

There are three child tables that hold part-of-speech ("pos")
miscellaneous ("misc") and application field ("fld") infomation.
This entry has only pos data.

mysql> select * from pos where sens in (43160,43161,43162);
+-------+----+
| sens  | kw |
+-------+----+
| 43160 | 17 |
+-------+----+

kw 17 in the kwpos table is:
+----+----+------------------------------+
| id | kw | descr                        |
+----+----+------------------------------+
| 17 | n  | noun (common) (futsuumeishi) |
+----+----+------------------------------+

This entry has no misc or field data.
The "gloss" table is also a child table of "sens"

mysql> select * from gloss where sens in (43160,43161,43162);
+--------+-------+-----+------+----------------------------------------------------+------+
| id     | sens  | ord | lang | txt                                                | note |
+--------+-------+-----+------+----------------------------------------------------+------+
| 212898 | 43160 |  10 |    1 | one's father                                       | NULL |
| 212899 | 43160 |  20 |    1 | old man                                            | NULL |
| 212900 | 43160 |  30 |    1 | one's boss                                         | NULL |
| 212901 | 43160 |  40 |    4 | Vater                                              | NULL |
| 212902 | 43160 |  50 |    4 | (als Bez. von Mテ、nnern fテシr ihren Vater verwendet)   | NULL |
]                                                                                         |
| 212903 | 43161 |  10 |    4 | alter Mann                                         | NULL |
|                                                                                         |
| 212904 | 43162 |  10 |    4 | Boss                                               | NULL |
| 212905 | 43162 |  20 |    4 | Chef                                               | NULL |
| 212906 | 43162 |  30 |    4 | Meister                                            | NULL |
| 212907 | 43162 |  40 |    4 | (Bez. von Untergebenen fテシr ihren Chef)             | NULL |
| 212908 | 43162 |  50 |    8 | son pティre                                           | NULL |
| 212909 | 43162 |  60 |    8 | vieil homme                                        | NULL |
| 212910 | 43162 |  70 |    8 | son patron                                         | NULL |
| 212911 | 43162 |  80 |    8 | (JF2)                                              | NULL |
+--------+-------+-----+------+----------------------------------------------------+------+
I manually edited the about output to show the glosses'
correspondence with each of the three senses more clearly.

Nothing new here.  Column lang references table "kwlang"
like entr.lang does.  The lone gloss in sense 43161 has a 
lang value of 4:

mysql> select * from kwlang where id=4;
+----+----+------------------+
| id | kw | descr            |
+----+----+------------------+
|  4 | de | German (Deutsch) |
+----+----+------------------+

Each entry can have a list of audit entries in the "audit" table.
This entry doesn't have any so I will make up a 
pretend one:

mysql> select * from audit where entr=35524;
+------+-------+-----+---------------------+---------------+------+
| id   | entr  | typ | dt                  | who           | note |
+------+-------+-----+---------------------+---------------+------+
| 1355 | 35524 |   1 | 2003-12-03 00:00:00 | JMdict loader | NULL |
+------+-------+-----+---------------------+---------------+------+

Like all the other keyword references, we can lookup typ=1:

mysql> select * from kwsrc where id=1;
+----+----+-------+
| id | kw | descr |
+----+----+-------+
|  1 | a  | Add   |
+----+----+-------+

Finally there are xrefs and antonyms:
An entry with an xref is seq 1004620 (こにちわ).

mysql> select s.id from entr e join sens s on s.entr=e.id where e.seq=1004620;
+-----+
| id  |
+-----+
| 611 |
+-----+

mysql> select * from xref where sens=611;
+------+-------+-----+--------+
| sens | xref  | typ | note   |
+------+-------+-----+--------+
|  611 | 33882 |   3 | JMdict |
+------+-------+-----+--------+

Column xref is the sens.id of the target of the cross-reference.
We will see will see what entry it belongs to...

mysql> select e.* from entr e join sens s on s.entr=e.id where s.id=33882;
+-------+-----+---------+--------+
| id    | src | seq     | note   |
+-------+-----+---------+--------+
| 28017 |   1 | 1289400 | 520530 |
+-------+-----+---------+--------+

Using the showentr.py script to look it up we see:

seq: 1289400, lang=(), dial=(), id=28017

今日は/ichi1
こんにちは

  1. [int][uk]
     en: hello
     en: good day (daytime greeting)
     nl: doei
     fr: bonjour (dans la journee)
     ru: добрый день
     ru: здравствуйте (дневное приветствие)
     de: Guten Tag!
     [3 xrefs to this sense.]


Pheww!  This took a lot longer to write than I thought it would!


--------
As an aside, if anyone is still doubful about the advanatages 
of having jmdict in a database, it took only a couple minutes 
to write the query below that finds all entries that have: a 
pos value, a kfreq value, a kanji/kana restriction, and a rinf 
value.  How long would that take with only jmdict.xml to work 
from, I wonder?

select e.* from ((((((entr e 
  join sens s on s.entr=e.id)
  join pos p on p.sens=s.id)
  join kanj j on j.entr=e.id)
  join kfreq f on f.kanj=j.id)
  join restr r on r.kanj=j.id)
  join kana k on k.entr=e.id)