[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] JMdict internationalization effort - let's (finally) do it!



Hello Jim, everybody, and happy new year 2012!

> As I promised/threatened, I am giving this project some attention in
> the New Year.
> I would *really* like this to go ahead as quickly as possible.

It's nice to hear your thoughts about this. Things have also been
moving on my side, and I'm more and more convinced something positive
will get out of this.

> - concentrate on JMdict. Once French glosses from the project are going out in
> JMdict, we can turn to kanjidic2 (the way JMdict and kanjidic2 are
> built at my end
> are totally different, so I can't get much leverage between them.)

Currently both JMdict and kanjidic2 are up to translation, for French
but also any language. Of course you can choose limit the amount of
data you want to fetch back at first if it makes things easier for
you.

> - I want to bypass the JMdictDB at this stage, for several reasons:
> o "divide and conquer" is a good approach for complex projects
> o I have other language glosses being added to JMdict (German,
> Dutch, etc.) and I
> need the build to be similar for all of them
> o this way is the least extra work for me
> o Stuart's time is better spent on other JMdictDB matters

My thoughts exactly.

> - I want to take the "raw" French glosses from Alexandre's Transifex
> project and tip them into my JMdict-build process. To do this I need
> to get or build
> a text file like this (e.g. for エレベータ)
> ...
> 1030630 1 ascenseur
> 1030630 2 gouvernail de profondeur (aviat)
> ...
> If Alexandre can create them in that format, it would be fantastic.

This can easily be done.

> A monthly supply of French glosses would be fine, but frankly I could
> just as easily handle a daily one. It's just a cron job.

Same thing here. If I set up a job to generate the list for you, I can
as well run it daily.

> I realise this is not using Alexandre's merging script, but I do the merge
> at a pre-XML stage, then convert to XML later, and it's best to continue
> to do all the languages together. Also it all happens on an elderly Solaris
> system and I think the versions of Perl and Python are rather old.

Actually this upstream approach would make things easier for me too.
The merge script was here because there was no clear plan to integrate
the data back into JMdict and I wanted to use the data for at least my
own projects, so I needed a file in JMdict format. But now that you
are talking about fetching it daily, there is no point in doing a
different release anymore.

> In fact, for kanjidic2 it's not that different. I need a file like:
> ...
> 六 six
> 七 sept
> 八 huit
> 八 radical huit (no. 12)
> 九 neuf
> 十 dix
> 口 bouche
> 日 jour
> 日 soleil
> 日 Japon
> 日 compteur de jours
> 月 lune
> 月 mois
> ...
>
> Can we fly it like that?

Absolutely. In addition I would like to point to your attention that
some new languages have been submitted for kanjidic2. In particular, a
single user is about to finish translating the whole jouyou kanji in
Italian, and the translation quality seems good:

https://www.transifex.net/projects/p/jmdict-i18n/team/it/

Apart from that, most of the activity has been around the French
language, with 10 members in the team, more or less active. But we
also have some Thai around.

I will have to re-think about how to update the source strings from
new JMdict files if our translations are included, in particular about
how to detect regressions (e.g. when source strings have changed and
the translation needs to be checked again), but the scheme you propose
should make things much easier for me too. I will try to come with
something up quickly for you to try.

One point I would like to ask you about, is whether other language
sources (i.e. the sources that currently provided you with French,
German and Russian translations for JMdict) are going to be updated as
well. I have already imported them, and thus will provide you with
translated strings you already have. We could address this, and
greatly simplify the translation process, if we:

1) Decide that JMdict-i18n is the only source for translations in some
languages (particularly languages that are active like French), or

2) Be more radical and totally switch to JMdict-i18n for translation.
Divide and conquer, as you said. You would only have one source to
import for all languages and no worries to have about translations
anymore. This would make sense if no language has been receiving
updates since a while, as it would allow JMdict-i18n to take over and
ease contributions.

I can also make smarter scripts that will only output translations
that were not present in the source files or that have changed, but
would rather say that the translation import process is already more
complex than it should be, and suggest we try to simplify things,
especially since it would also make it easier for everyone to
contribute.

Looking forward to reading your thoughts!
Alex.