[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Changes in Tanaka corpus format?




On Jan 28, 2008, at 10:21 PM, Jim Breen wrote:

On 29/01/2008, Stuart McGraw <smcg4191@*****com> wrote:
> Paul Blay wrote:
> > The trigger for having an ID in the first place was to help the
> > maintainer of a multi-lingual example sentence site to keep his
> > content synchronised with the current version of the Tanaka Corpus.
> > (c.f. Tatoeba Project).
>
> I'm glad someone was able to convince you and Jim to do this, even
> if I wasn't. :-) Thanks, it is an important improvement the file's
> usability.

Yes, it was the Tatoeba effort that made it urgent. One of the sad things
is the Tatoeba people, who are working on tanslations into French, etc.
decided to go back to the original Tanaka file instead of the scrubbed
and corrected version, and consequently they have worked on versions of
sentences that we have already deleted or changed. (They wanted a version
in UTF8, and didn't notice that the edited version is also available in UTF8.)

I have now updated the docs page at
ttp://www.csse.monash.edu.au/~jwb/tanakacorpus.html to include the
#ID=nnnnn and also the ~ tag on the B line.

The format of the current Tanaka file is getting a bit hairy as features
get added. It's close to time to have a clean XML structure for export
and relegate the text-only forms to legacy use. (That doesn't mean it
can't go on being maintained as it is by Paul.)

Taking the sentence in the doc page as a starter:

A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507
B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}

How would the following go?

<sentence id="25507">
<s_group xml:lang="jp">
<s_text>その家はかなりぼろ屋になっている。</s_text>
<s_details>
<s_det><s_hw>其の</s_hw><s_rep>その</s_rep></s_det>
<s_det><s_hw>家</s_hw><s_kana>いえ</s_kana><s_sensno>1</s_sensno></s_det>
<s_det><s_hw>は</s_hw></s_det>
<s_det><s_hw>可也</s_hw><s_rep>かなり</s_rep></s_det>
<s_det><s_hw>ぼろ屋</s_hw><s_sensno>1</s_sensno><s_pri/></s_det>
<s_det><s_hw>になる</s_hw><s_sensno>1</s_sensno><s_rep></s_rep>になっている</s_det>
</s_details>
</s_group>
<s_group xml:lang="en">
<s_text>The house is quite run down.</s_text>
</s_group>
</sentence>

That could cater for multiple languages, and much more information about
each element.

It would be a relatively simple task to generate something like the above
from the current file.

Interested in comments/suggestions/brickbats...


Seems like the same thing could be accomplished with greater simplicity by adding a "C:" line, but that's just me.