On Jan 28, 2008, at 10:21 PM, Jim Breen wrote: On 29/01/2008, Stuart McGraw <smcg4191@*****com> wrote: > Paul Blay wrote: > > The trigger for having an ID in the first place was to help the > > maintainer of a multi-lingual example sentence site to keep his > > content synchronised with the current version of the Tanaka Corpus. > > (c.f. Tatoeba Project). > > I'm glad someone was able to convince you and Jim to do this, even > if I wasn't. :-) Thanks, it is an important improvement the file's > usability.
Yes, it was the Tatoeba effort that made it urgent. One of the sad things is the Tatoeba people, who are working on tanslations into French, etc. decided to go back to the original Tanaka file instead of the scrubbed and corrected version, and consequently they have worked on versions of sentences that we have already deleted or changed. (They wanted a version in UTF8, and didn't notice that the edited version is also available in UTF8.)
I have now updated the docs page at
ttp://www.csse.monash.edu.au/~jwb/tanakacorpus.html to include the #ID=nnnnn and also the ~ tag on the B line.
The format of the current Tanaka file is getting a bit hairy as features get added. It's close to time to have a clean XML structure for export and relegate the text-only forms to legacy use. (That doesn't mean it can't go on being maintained as it is by Paul.)
Taking the sentence in the doc page as a starter:
A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507 B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}
How would the following go?
<sentence id="25507"> <s_group xml:lang="jp"> <s_text>その家はかなりぼろ屋になっている。</s_text> <s_details> <s_det><s_hw>其の</s_hw><s_rep>その</s_rep></s_det> <s_det><s_hw>家</s_hw><s_kana>いえ</s_kana><s_sensno>1</s_sensno></s_det> <s_det><s_hw>は</s_hw></s_det> <s_det><s_hw>可也</s_hw><s_rep>かなり</s_rep></s_det> <s_det><s_hw>ぼろ屋</s_hw><s_sensno>1</s_sensno><s_pri/></s_det> <s_det><s_hw>になる</s_hw><s_sensno>1</s_sensno><s_rep></s_rep>になっている</s_det> </s_details> </s_group> <s_group xml:lang="en"> <s_text>The house is quite run down.</s_text> </s_group> </sentence>
That could cater for multiple languages, and much more information about each element.
It would be a relatively simple task to generate something like the above from the current file.
Interested in comments/suggestions/brickbats...
Seems like the same thing could be accomplished with greater simplicity by adding a "C:" line, but that's just me.
|