[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] JMdict internationalization effort - let's (finally) do it!



Greetings,

Thank to Alexandre for (re)opening this topic. It's
something I think is very important, and I'd love to
see some progress with it.

I have interpolated a couple of comments, and added a
longer one at the bottom.

On 4 November 2011 01:51, Alexandre Courbot <gnurou@gmail.com> wrote:
> I come up with this topic about once every year, so since 2011 is
> coming to an end I thought I should bring it back on the table. ;)

Let's hope there is some movement.

> Some may remember (3 years ago already) that, as the writer of a
> software that uses JMdict but is also used by non-English speakers
> (many French people notably), I got lots of requests for more complete
> French translations in the dictionary. This raised my interest as to
> how current translations of JMdict are handled and how it is possible
> to contribute to them. If I remember correctly, Jim is currently
> handling translations through various files of various formats and
> merges them into the JMdict file (the same applies to kanjidic2),
> which makes it hard and inconsistent to maintain.

Quite true. Actually things with the French translations are getting worse,
because the blending in of the French glosses (from Jean-Marc
Desperrier's project) is done using sequence numbers and sense
numbers, and as we delete/merge entries and reorder senses, the
number of failed or screwed-up merges grows. (About 270 glosses
are failing at present.)

> ......Thanks to the great
> work done by Stuart, we now have a good way to add new entries and
> amend existing ones, but it still does not handle translations in
> languages other than English. At the same time, it is perfectly
> understandable that JMdict, as a project, wants to focus on English
> instead of spreading into as many directions as there are spoken
> languages on the planet.
>
> So at that time I thought it would be nice to have an interface
> similar to what Stuart did directed at translators of the JMdict, so
> that people can collaboratively translate the dictionary in other
> languages, à la Tatoeba. This would allow to move all existing
> translations into a single format (which would probably simplify Jim's
> life), to effectively improve non-English languages coverage, while -
> most important maybe - not getting in the way of the English JMdict
> effort.

I have some thoughts/comments on this, which I'll add later.
The Tatoeba example is useful. That project certainly has enabled
a huge amount of parallel translation of sentences, and does it
with very loose controls, and only after-the-event quality control,
something I'm not sure would work that well in a dictionary.

> Well, it seems like we actually have all the tools we need to do that
> now : meet Transifex (https://www.transifex.net), an online platform
> for the collaborative translation of software projects using
> internationalization libraries like GNU Gettext or Qt .ts format. For
> those who are not familiar, the principle is that a sofware team
> extracts all the strings it uses into a special format and uploads it
> there so that people can translate it into their language of choice.
> The developers then get the translated strings back and bundle them
> with their software so that it can choose the right language at
> runtime. Transifex's interface is very well designed, with a fast and
> efficient AJAX form for translations, language teams and managers,
> etc.

It's an excellent platform. I'm particularly interested in from the
position of WWWJDIC's interface, which uses a Gettext-like approach
to text strings to drive its English and Japanese versions. I'd love to
see a French version, for example.

> The idea is that, if we can do that with software, why couldn't we do
> the same with JMdict? I have thus written a small Python script that
> extract all the glosses in the JMdict file, associates them to their
> existing translation, adds some context information to make them
> non-ambiguous (entry id/sense number/gloss number) and put them into
> Gettext .po files. Upload that result to Transifex, and voilà :
>
> https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/ (for
> demonstration purposes I only extracted a subset of the JMdict - the
> partial translations also come from the JMdict itself)
>
> Now anybody with a Transifex account can translate individual glosses
> online, or download the whole .po file to do it with his favorite
> translation tool. Since every entry keeps an unambigous reference to
> the gloss it translates, all the translations can be merged back into
> the JMdict file. I tried by extracting the existing translated glosses
> and merging them back and ended with an identical JMdict.
>
> I think we could use that to
> 1) Use a single, open, standard format for handling JMdict gloss
> translations instead of the various hacks Jim is currently relying on;
> 2) Have a single translation effort that would not interfere with the
> actual JMdict;
> 3) Finally allow all the people who want to see JMdict translated into
> other languages to do it.
>
> ... and the same could be done with kanjidic2, of course.
>
> If the idea suits Jim, I'm willing to finalize my scripts and start
> maintaining the effort on Transifex.
>
> Right now the script acts by creating a translation entry for every
> gloss, and putting the keb & reb of the entry + english gloss in the
> message field, so the translator has a glimpse at both the gloss and
> Japanese word it refers (see this for instance:
> https://www.transifex.net/projects/p/jmdict-i18n/resource/jlpt5/l/fr/view/
> ). This may not be the most suitable solution, since it may not always
> be desirable to have a 1:1 match for every gloss. An alternative would
> be to have one translation entry per sense, with a special character
> to separate the translator's input into several glosses.
>
> So, this is my latest crazy idea to get more non-English stuff into
> JMdict. What do you guys think?

Several comments.

First, anything that could see JMdict move away from its present
approach of being a JE base file with other languages hacked in later
would be a Good Thing, if not a Great Thing.

Second, a very key issue is how it would be seen and handled in the
database. The easy thing would be to replicate the databases, having
a jmdictdb_en, jmdictdb_fr, etc. and squish the glosses together later.
That would result in a lot of the problems with the current approach
just continuing, although it may well simplify the translation process.

The ideal approach would be for the database itself to be truly JM. For
example, we have at present (just looking at the Eng and Fre bits)

<entry>
<ent_seq>1030630</ent_seq>
<r_ele>
<reb>エレベーター</reb>
<re_pri>gai1</re_pri>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>エレベータ</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>elevator</gloss>
<gloss>lift</gloss>
<gloss xml:lang="fre">ascenseur</gloss>
</sense>
<sense>
<xref>昇降舵</xref>
<gloss>elevator (aviation)</gloss>
<gloss xml:lang="fre">gouvernail de profondeur (aviat)</gloss>
</sense>
</entry>

[I added that second French sense using Collins Robert...)

To enable this to work, the database interface needs to be able to
handle the extra language aspects. At the very least, non-English glosses
need to have language tags. Perhaps that is enough?

At present the JEL (edit language) for the above is:

[1][n]
  elevator; lift
[2][n]
  elevator (aviation)
  [see=1938460・昇降舵[1]]

It could just have something like: "[l:fre] ascenseur" and
"[l:fre] gouvernail de profondeur (aviat)" added to make it
work.

Ideally the interface could be able to be made a bit
friendlier to people from non-English backgrounds. Use
of colours for languages, for example.

Anyway, I might stop there, and let others join in the discussion.
I'd be very interested in Stuart's views.

Thanks for raising the topic (yet again.)

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne