[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Closer integration of examples with new jmdict system.



On 15 July 2010 04:49, Paul Blay <blay.paul@googlemail.com> wrote:
> > On Jul 14, 2010, at 7:17 AM, Paul Blay wrote:
> > > Jim and I both (separately) do our own set of checks, but the
> > > process isn't open to other editors and there are things that could /
> > > should be checked that don't get done (often) due to the large
> > > amount of manual interaction required.
> >
> > Paul what is the status of the B line situation?
>
> Mostly up to date. I check it on Thursday, and Jim usually catches
> a few more things needed to fix on Saturday (those are the two days
> the wwwjdic.csv file is updated).

Yes. To document this for people who are not deep into the Tatoeba
project:

A strip of the JE pairs in the database is made on Saturdays (French
time), and the file placed in an ftp directory. It's a tabbed file looking like:

4983TAB1579TABあら、申し訳ございません。TABOh, I'm sorry.TABあら 申し訳ございません~

On Sunday morning (Australian time) I run a cron job which fetches the file
and:

(a) turns it into the A/B format used by wwwjdic (there's a bit of fiddling to
handle Unicode sequences for which there are no JIS codes.)

(b) does a word-by-word comparison of the Japanese in the A-line and the
indices in the B-line. Any mismatches are checked against a file of
known absentees (mostly proper names), and if not there are reported
in an error file, which I check over during the week.(*)

(*) These days my checking back with the sentences and indices in
Tatoeba reveals that Paul has usually already fixed them. This week
all I did was fix one index in Tatoeba and add two names to the list.

> Some things that aren't up to date:
>
> * Checking for entries that are in the examples index but do not have
> a 'good example' pair marked out with ~

Difficult to automate. There are 29,827 on the "good example" list
at present, which is not too bad.

> * Checking for entries that now have multiple senses, but didn't when
> the ~ was assigned.

Tough one.

> * Checking for entries that have more than one example marked out
> with ~ (can happen when entries are merged).

I can derive a list of these. There are about 20, e.g. プールサイド

I'd like to move my processing of the ex-Tatoeba file from Monash
to arakawa in the near future, and would be interested in exploring
ways of making it more open. In truth most of the editing work has
to happen within Tatoeba, but there is scope for cranking up some checking,
etc. tools on arakawa.

HTH

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne