Modifying JMdictDB tags

This document describes two alternative ways to change the tags recognized by the JMdictDB system:

Local changes made at a particular site (section 1).
Changes made to the JMdictDB software for distribution to all users (section 2).

1. Local changes to the tags

During normal operation of the JMdictDB system the tags known to the system (accepted when submitting edits, used in XML output, etc) are stored in a set of database tables with names prefixed with "kw" and collectively referred to as the "kw*" (keyword) tables or "tag tables". These tables are loaded when the JMdictDB system is installed and updated from time to time via the normal JMdictDB upgrade process.

A site can make localized custom changes to these tags by changing the data in the tables. Changes can be made by direct SQL manipulation of the kw* tables' contents.

Not all tags can be changed by simply by changing the values in the database tables. References to some tags are hardwired in the JMdictDB code and that code must also be changed when the tag’s 'id' or 'kw' value is changed or the entire row deleted. See Appendix A for a list of those tags.

1.1. Contents of the kw* tables

There are seven database tables that contain editable tag information.

Table

XML element

Applies to

Use

kwdial

<dial>

sense

dialect

kwginf

gloss

gloss type

kwfld

<field>

sense

field of application

kwkinf

<ke_inf>

kanji

kanji information

kwmisc

<misc> (<name_type> in JMnedict)

sense

miscellaneous information

kwpos

<pos>

sense

part of speech

kwrinf

<re_inf>

reading

reading information

Please see the comments in the XML DTD for the XML element shown for more details on the purpose and use of each tag.

There are a number of other kw* tables that contain tag or similar information but these may require coordinated code changes when their their contents are changed: kwcinf, kwfreq, kwginf, kwgrp, kwlang, kwsrc, kwsrct, kwstat, kwxref. Changing the contents of these tables is not covered in this document.

Each kw* table has four fields:

id — A numeric id number. Must be unique for each row in the table. When choosing id numbers for new tags you should start with some large value, say 500, to avoid conflicts with new tags introduced in JMdictDB software upgrades.
kw — The tag name; a short abbreviation, must also be unique for for each row in the table. If the tag will occur as an entity in an XML DTD, this will also be the entity name unless overridden in the "ents" field.
descr — A longer (but one line) description of the tag. If the tag will occur as an entity in an XML DTD, this will also be the entity value unless overridden in the "ents" field.
ents — A JSON string that controls how and in which XML files the tag will appear as an entity. This contents of this field are described in Appendix C.

1.2. Making the changes

Always test your changes on a copy of your production database before applying them to the production database. You’ll want to verify that you can edit and enter new entries that use the tag. And you’ll want to confirm that XML (including the DTD entity list) is properly generated.

Always make a backup of your production database before applying your changes to it.

To make a copy of your production database see section 4.2.1 of the Operations Guide.

Appendix B describes the SQL statements needed to make various kinds of changes to the tags. Use that as a guide to write the SQL statements you need to make the changes you want.

It is recommended that you put the statements you will use into a file which you can then run against any database. The file should start with the two lines:

\set ON_ERROR_STOP
BEGIN;

and end with:

COMMIT;

Your SQL statements go between those. You can then apply the changes to a database with:

$ psql -d my_test_database -f my_script_file.sql

where "my_test_database" is the database to apply them to, and "my_script_file.sql" is the file you created with the SQL statements to effect the change.

If there is a problem with any of the statements, the script file will abort without applying any of the changes; you can correct it and run it again.

1.3. Conflicts

The JMdictDB project tracks the tags used in the JMdict and JMnedict XML files distributed by edrdg.org and updates the JMdictDB software to match when they change. Consequently it is possible that a JMdictDB database update could fail to apply successfully because a tag id number or name conflicts with a locally added tag.

In this case you will need to change the locally modified tag to avoid the conflict. The process for changing tag names has already been described. A conflict between id numbers is unlikely if you followed the advice in Appendix B to start local id numbers at some high value like 500.

However if an id number conflict does occur, you can change the local id number using a SQL statement like:

UPDATE <table> SET id=<new-id-number> WHERE id=<old-id-number>;

The change will be propagated down to the entries that use the tag so that any that used the old tag number will now use the new one; i.e. they will appear unchanged and continue to have the tag applied.

1.4. Migrating local tag changes to the JMdictDB project

The steps needed to implement tag changes for distribution by the JMdictDB project are described below in section 2 but two core steps are to update the tags in the kw*.csv files and to write a SQL script to update other existent databases.

tools/kwcmp.py is a tool that can help with these two steps by comparing your modified database tag tables with the (presumed) unmodified CSV files and generating two output files:

A Unix diff file that can be applied by the Unix "patch" utility to update the CSV files to match the database tag tables. This can expedite the CSV file updates required below.
```
$ tools/kwcmp.py -fdiff jmdict > update.patch
```
A SQL script that will apply the changes you made to you database tag tables to a different unmodified database. This script can be used as the core of the db/updates/ script required below.
```
$ tools/kwcmp.py -fsqlx jmdict > update.sql
```

For more information on kwcmp.py see Appendix D.

2. JMdictDB tag changes for distribution

This section describes the procedure for implementing a "global" change to the JMdictDB tags — that is, a change that will be incorporated in the JMdictDB software and distributed to all JMdictDB users (both of them. :-) This is in contrast to a local tag change that can be made in the database tables at a particular site and that affects only that site.

There are two places that need changes:

the kw*.csv file(s) that define the tags values (step 4 below),
a SQL database update script to propagate the changes to any existent databases (step 5 below).

2.1. Procedure

The procedure will generally be the same as for any development activity as described in the Development Guide section JMdictDB development process. Here we will supplement three of those steps with additional information specific to changing the tags.

5.4, Update the code
Most tag changes will not require modifications to any JMdictDB python code, only to the database files.
5.5, Update the database files
See 2.2, “Update the appropriate csv file(s)” below.
5.8, Write an update script to update existent databases
See 2.3, “Write a SQL script to update existent databases” below.

2.2. Update the appropriate csv file(s)

The tag changes need to be incorporated in the jmdictdb/data/kw*.csv file. This can be done manually by editing the file and making the necessary changes, or extracting csv files from a database that has had its kw* tables updated locally.

2.2.1. Manually edit the kw*.csv files

The kw*.csv files in jmdictdb/data/ are the canonical source of static tag information in JMdictDB.

If you are adding a new tag, add a line defining the new tag to the appropriate .csv file in jmdictdb/data/. Each line consists of either three or four tab-separated fields:

id -- The next sequential number.
kw -- A short abbreviation for the tag.
descr -- A longer (but one line) description of the tag.  If the
  text contains any double-quote characters (") the entire descr
  text should be enclosed in double-quotes and each embedded
  double-quote should be doubled (ie, each " changed to "").
ents -- Modififcations to the tag for use in an XML file.
  This field is present only in the files kwdial.csv, kwfld.csv,
  kwkinf.csv, kwmisc.csv kwpos.csv and kwrinf.csv.

Not all tags can be changed by simply by changing the values in the database tables and csv files. References to some tags are hardwired in the JMdictDB code and that code must also be changed when the tag’s 'id' or 'kw' value is changed or the entire row deleted. See Appendix A for a list of those tags.

For more details on the "ents" value, see Appendix C.

Be careful that your editor is set to preserve tabs (including trailing ones) when saving rather than replacing them with spaces.

2.2.2. Update the kw*.csv files from a database

If a database already exists in which the tag changes have been applied as local changes to the kw* tables (see section 1), you can use the kwcmp.py tool to extract those tables directly to csv files in the jmdictdb/data/ directory. This is often the case when tag changes are made at edrdg.org and it is desired to incorporate them in the JMdictDB distribution:

tools/kwcmp.py -f csv

By default it will read the "jmdict" database and output the csv files to jmdictdb/data/. The defaults can be overridden with options, use --help for details. Since jmdictdb/data/ is controlled by Git, 'git diff' will show the changes:

$ cd jmdictdb/data/
$ git diff *.csv

There is more information about kwcmp.py in Appendix D.

2.3. Write a SQL script to update existent databases

Generally, tag changes will involve writing a database SQL script that will update existing databases with the tag changes and set a new database version number. Full details of this process are described in the Development Guide, section 5. JMdictDB development process. Here, we will give an overview addressing tag-specific changes.

The csv files contain the canonical tag definitions and are loaded into a jmdictdb database when it is first created but changes made to the csv files afterwards are not automatically propagated to existing databases — that is done by the script you write here.

Generally, making changes to tags is similar to the procedure described in section 1.2, using the SQL statements described in Appendix B. The difference is that the SQL script produced will follow some additional conventions described below and it will be packaged for distribution with the rest of the JMdictDB code.

After the database has been updated the tag changes will automatically appear in the Search and Help pages.

If a database already exists in which the tag changes have been applied as local changes (see section 1), you can use the kwcmp.py tool to extract those changes in the form of a SQL scipt that can be used as the core of the database update script. See Appendix D.

Appendix A — Tags to avoid changes to

The following table lists the tag tables and specific tags that you should avoid changing because the existing values of those tags are referenced from within the JMdictDB code and those references will need to also need to be updated. The recommendation against changes applies only to row deletion and the 'id' and 'kw' fields; the 'descr' and 'ent' fields (if present) generally may be changed even for the tags listed below.

Tables not listed may have any of their tags modified as needed.

Please note that if a tag is deleted from a table the tag will also be silently deleted from any entries that use that tag. Entries in an XML file that use the tag and will also have the tag dropped and a warning message generated when the file is loaded into a database.

There may be other tags in use in the JMdictDB code that have not been found and documented yet.

Table	Tags
kwfreq	(all)
kwginf	equ, lit
kwlang	eng [*1]
kwmisc	male, fem, uk
kwrinf	nanori
kwpos	n
kwstat	(all)
kwsrc	(all)
kwsrct	(all)
kwxref	see, ant [*2]

Table

Appendix B — SQL to add, change or delete tags

This section provides SQL statements that can be composed in a script file to implement tag changes locally, or in a database update script when implementing the changes as part of a JMdictDB software update.

Note that in SQL statements, case is not significant; upper-case is used here just as a matter of convention.

To view the current contents of any tag table, run:

psql <database-name>

and then enter (replacing "<table>" with the actual kw* table name):

SELECT * FROM <table> ORDER BY id;

In the SQL statements below callouts indicate lines containing parameters in angle brackets that need to be replaced with actual values:

1	<table> — Name of the kw table to alter, e.g. "kwdial".
2	<lnktable> — Name of the table that applies tags to entries. This is the same as <kwtable> but without the "kw" prefix. For example, if <table> is "kwdial", then <lnktable> will be "dial".
3	<id> — Id number of the tag to be altered. For a new tag in the JMdictDB software this will generally be the next highest unused number. For local changes, starting at larger number, for example 500, is advised to avoid conflicts with new tags introduced in the JMdictDB software from time to time. Use "SELECT * FROM <table>;" to see all the current values.
4	<tag-name> — New name the tag is to be given. Unless overridden in the <ents> field, this will also be used as the entity name in the JMdict XML DTD.
5	<description> — Description for the tag. Unless overridden in the <ents> field, this will also be used as the entity value in the JMdict XML DTD.
6	<ents> — A JSON string that controls how and in which XML files the tag will appear. See Appendix C for more information.

fields, use the word NULL without any surrounding quotes.

Regarding the "ents" field, the short version is: if "ents" is empty (NULL) the tag will appear as an entity in the JMdict XML but not in the JMnedict XML. If you want something different then it’s time to read Appendix C.

If there are any single quote characters (') in any of the fields, they should be doubled. For example, to set the "descr" field of a tag to "'taru' adjective", use:

UPDATE kwpos SET descr='''taru'' adjective' ...

The outer single-quotes are required SQL syntax, the single quotes around "taru" were doubled.

To add a new tag:

INSERT INTO <table> VALUES(<id>,'<tag-name>','<descr>','<ents>');   (1)(3)(4)(5)(6)

To delete an existing tag:

Note that the statement to delete the tag will fail if there are any entries that use the tag, including Rejected or Deleted entries. If that is the case, run this first:

DELETE FROM <lnktable> WHERE kw=<id>;                               (2)(3)

Then delete the tag:

DELETE FROM <table> WHERE id=<id>;                                  (1)(3)

To change the name, descr or ents value of an existing tag:

UPDATE <table> SET kw='<tag-name>' WHERE id=<id>;                  (1)(4)(3)
UPDATE <table> SET descr='<description>' WHERE id=<id>;            (1)(5)(3)
UPDATE <table> SET ents='<ents>' WHERE id=<id>;                    (1)(6)(3)

If you are changing more than one field you can combine them in a single statement, for example:

UPDATE <table> SET kw='<tag-name>',descr='<description>' WHERE id=<id>;  (1)(4)(5)(3)

Appendix C — The "ents" field

The "ents" column of the kw* tables and CSV files that have one (kwdial, kwfld, kwkinf, kwmisc, kwpos, krinf) contains information about how and in which XML files the tag will appear as an XML entity item.

The contents of this field are either NULL or a JSON string.

If the "ents" value is empty (NULL), then the tag will will be represented as an entity in JMdict XML with an entity name that is the same as the tag’s "kw" value and an entity value that is the same as the tag’s "descr" value. For JMnedict XML, the tag will be neither recognized when parsing XML nor output when generating XML.

If the "ents" value is not empty, then it must be a JSON string representing an object. The object must contain items with the keys "jmdict", "jmnedict" or both. The "jmdict" item applies when processing JMdict XML and if absent the effect is the same as if the "ents" value was empty (the tag will be treated as an entity in the XML). The "jmnedict" item works similarly for JMnedict XML (if absent the tag will not be treated as an entity in the XML.)

The value of each item should be 0, 1 or another object. If 0, the tag is neither recognized nor produced in the XML. If 1, the tag is recognized and produced as an entity in the XML. Note that because tags are produced by default for JMdict and not for JMnedict, an "ents" value of {"jmdict":1} is effectively a no-op, as is {"jmnedict":0}.

If the value is another object, it must have the keys "e", "v" or both. If there is an "e" key, that item’s value will be used for the entity name in the XML rather than the tag’s "kw" value. If there is a "v" value, it will be used for the entity’s value in the XML rather than the tag’s "descr" value.

Some examples from the kwmisc table:

 id | kw  |     descr     | ents
----+-----+---------------+------
  5 | col | colloquialism |

This has no "ents" value and thus the entity &col; will be recognized and produced in the JMdict XML but not in the JMnedict XML.

 id  |   kw    |       descr       |      ents
-----+---------+-------------------+-----------------
 181 | surname | family or surname | {"jmnedict":1}

The &surname; entity will be recognized and produced in the JMnedict XML. It will also be recognized and produced in the JMdict XML by default since it is not specifically excluded with "jmdict":0.

 id |  kw  |            descr               ents
----+------+------------------------------+----
 15 | male | male term, language, or name |
   {"jmdict":{"v":"male term or language"},
    "jmnedict":{"e":"masc", "v":"male given name or forename"}}

  [Note: above is line-wrapped for this document; would be a
   single line in the kwmisc.csv file.]

In this case the tag description male term, language, or name is used with within JMdictDB but in the JMdict XML, the value of the &male; entity will be male term or language. In the JMnedict XML the entity name will be &masc; and its value will be male given name or forename.

Appendix D — Useful tools

This section decribes programs (currently just one) in the tools/ directory that may be useful when making changes to tags. Run the program(s) with the --help option for full usage details.

tools/kwcmp.py

Compares CSV files to tag tables in a database and generates various kinds of scripts that will bring one into conformance with the other. Specifically it can:

Create a SQL script that will make the database tables match the CSV files.
Create a SQL script that will make database tables that were created from the CSV files match the given database tables. (Useful as the basis for a db/updates/ script.)
Create a diff file that can be applied (with the Unix "patch" utility) to the CSV files to make them match the database tables. (Useful for updating the CSV files after tag updates have been applied to a development database.)
Rewrite the CSV files from scratch from the database tables.

The tool can be useful when a set of tag changes have been applied locally to a database (see section 1) and one wishs to incorporate them into the JMdictDB code (see section 2):

$ tools/kwcmp.py -fdiff dev-database >kw-changes.diff
$ patch -p1 <kw-changes.diff
$ tools/kwcmp.py -fdiff <dev-database> >kw-changes.sql
[Write the db/updates/ script per section 2.4 above and use
kw-changes.sql as the body of the script.]