[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] bad entries in submissions



Jim Breen wrote:
> [Stuart McGraw (RE: [edict-jmdict] bad entries in submissions) writes:]
> >> What I wanted to do before presenting these
[...]
> >>   - Write an importer for the examples and jmnedict files.
> >>       Need this because it may influence db schema.
> 
> I don't see why. I see them both as having quite different structures
> and purposes, and I can't see any lexicographic reasons for having them
> in the one schema. I'd suggest putting them to one side.

But the structures are nearly identical aren't they?  They all represent 
"pieces" of Japanese, and they are all represented in the same way --
entries consisting of readings, kanji, and meanings (in turn consisting
mostly of a set of glosses.)  Currently the entries in jmdict consist of
words, sub-words (e.g. suffixes), expressions, names (place and person).
In jmnedict are more names.  In examples are sentences.  But they are
all "pieces" of Japanese and represented in fundamentally the same 
way.  

"different purposes" is irrelevant in my view.  At my last job the main
corporate database contained diverse information such as inventory
data, purchase and sales orders, bank account information, employee
data,...  Clearly information like "how many gallons of epoxy are in 
the warehouse?" has a very different purpose than "how much money 
do we owe XYZ Corp.?", or "what did John Smith work on this week?".  
The important thing is that they are all *related*.  The epoxy is in the
warehouse because of a purchase order that resulted in money owed
to XYZ Corp, etc.

In the same way, entries in JMdict are related to sentences in the
example sentences file.  The example sentences are (among other
things) examples of (the use of) entries in jmdict.  The entries in
jmdict contain detailed information about words in the example
sentences.  Actualizing these relationships in the database offers 
the same benefits that you get for actualizing them among the parts
of jmdict.  You get error checking and guaranteed data consistency.
In my attempts to load the examples file I found a lot of inconsistencies
(I think -- I haven't gone through them in detail yet so I'm not sure how
many are real and how many are not.)
Answering questions like "how many examples are there for word X?",
"What words don't have examples?", "which pos/senses for word Y 
are not represented by examples", and other more complicated question
are all trivial and require no coding. 

And as a practical matter I would think that having easy access to 
example sentences might be helpful to the approvers of changes, 
and maybe even be leveraged to encourage people to submit example 
sentences when submitting new entries or updates. 

> >>   - Write a exporter to jmdict xml.
> >>       To verify there is no information loss.
> 
> By "jmdict xml" do you mean a database reflecting the jmdict structure?

No.

> For me the XML version is simple an export format, and the database
> itself is the core.

Yes, that was my view too.  I just meant creating (a script to create)
a jmdict xml export file from the data in the database to confirm that 
all the information needed to do that really and truly is in the database.

> [...]
> All good points. I realise there may seem to be a degree of selfishness
> in what I have asked/suggested. I certainly do have a motive for all
> this, because I want to get away from being the sole editor of
> EDICT/JMdict, etc. but as long as it's a text file it really can't be
> shared around. I *could* just walk away from it, but I fear that that 
> could degrade the work that I and many others have put into it over the
> years.

Understood.  I never meant to imply any selfishness on your part.
But I think there is a pov in psychology that all behavior is basically 
selfish on some level, even altruistic behavior.  It's certainly true in 
my case.  Every time I load jmdict at.al. into my database I have to 
spend hours or days resolving inconsistencies.   So I have a very
selfish interest in contributing to this project!  :-)

> >> So I guess what I'm saying in a long winded way is, if people
> >> are willing do development work, the recipients of that work also
> >> have a duty to invest time reviewing that work, making decisions
> >> about what they want, and giving feedback, and not just waiting
> >> for a finished solution (if that was ever a hope).
> >> 
> >> Just some of my thoughts.  I am quite happy to see there
> >> is still interest in this.
> 
> I'm certainly eager to give feedback where I can. I commented quite a
> bit on Pawel's prototype. I couldn't comment as much on yours because
> it was mainly in the form of schemas which I could drive. If you can get
> to the stage of using the update forms, etc. I'll be very involved.

Yes, also understood.  I tried to package up the schema stuff 
to make it easily installable, and provide some command line tools
to get at and play with the data.  But it was still a long conceptual 
distance from the terms I think you are thinking in and a fairly
big PITA to install for people not used to doing this frequently.

Someone (Ronan?) previously offered to host a development 
environment for the jmdict project IIRC.  Maybe it's time to take 
him up on the offer?  It would be easier for you and other interested 
people to "see" the database via a web UI but my current isp is not 
very amenable to that kind of thing.  (They adopted the current 
american practice of having reasonable rates until you go over 
some quota, for which they then charge absurdly high rates.  
Banks here do the same thing.)