[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] JMdict download format



On 09/24/2015 04:36 PM, Jim Breen jimbreen@gmail.com [edict-jmdict] wrote:
> On 25 September 2015 at 04:25, Stuart McGraw smcg6347@outlook.com [edict-jmdict] wrote:
>[...] 
> I shudder to think what a CSV would look like, and I really have
> to wonder how it would be an improvement over the XML.

The main improvement is that csv (or sql) could be loaded into 
any database directly, without parsing xml (which typically means 
either 1) using the jmdict loader in the JMdictDB software which 
is limited to Postgresql, 2) writing one's own xml parsing code,
a non-trivial task, or 3) using a generic xml-to-sql converter
which typically results in a very poor schema.)
  
The tradeoff made by loading from csv or sql is that the database 
schema loaded into is predetermined.  The tables and their columns 
all must exist and be compatible with what the csv/sql was generated 
from, which is course the schema defined by JMdictDB.

>> Such a format will be tied to a specific (version of a) database
>> schema. Of course that schema is publicly available but would
>> need to be advertised along with the .sql formated files. It is
>> also quite Postgresql-specific and if .sql (or .csv) files were
>> published, it would be desirable to publish a version of the
>> schema with the postgresql-specific bits elided.
> 
> That really goes to the heart of it. If an "sql" file differs between
> Postgresql, MySQL, etc. then I don't think we should consider making
> one available, as we'd be heading into all the issues that are
> avoided by just publishing in XML.

The distributed sql (or csv) file containing the dictionary data 
would be loadable, unchanged, into any common database.  But the 
database would need to have all the tables pre-created, with the 
expected columns and datatypes for those columns before loading 
the data.  The JMdictDB script that creates those tables is what I 
meant about being very Postgresql-specific.  One would need to do 
a lot of editing to use it to create the tables in a MySql database 
for example.

But if all one wanted was to create tables suitable for loading 
sql or csv jmdict data into, without regard for using the database 
with the rest of the JMdictDB code (which is what requires all 
the Postgresql bells and whistles), then a more generic sql script
for creating the schema could be written using only standard sql 
which could be used on most databases with few or no changes.

Maybe someone (other than me :-) would be interested in doing that.
But then there would be little point unless the jmdict data were 
going to be distributed in sql or csv format.