[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [edict-jmdict] jmdict language tags (was: Enhancements to the English translations)



Jim Breen wrote:
> We had quite an extensive discussion about language tags last year. I'm
> not sure a firm consensus was reached, although I was leaning towards
> a situation rather like your proposal below.
> 
> On 20/04/07, Stuart McGraw <smcg4191@frii.com> wrote:
> > This is a proposal for consolidating and unifying the use of
> >  language tags with JMdict.
> >
> >  There are three places in JMdict xml where the name of a language
> >  is used:
> >   1. entry <lang> element -- Language from which the Japanese word was derived.
> >   2. gloss "g_lang" attribute -- Language in which the gloss is written
> >   3. within gloss text -- Language and word from which the Japanese word was derived.
> >
> >  It seems to me that the third use is problematic in light of the
> >  current discussion about making the glosses more amenable to
> >  use by NLP systems.
> >
> >  It seems to me that the information in both #1 and number #3
> >  can be combined and stored using #2 (and one additional attribute.).
> 
> I see #2 as being quite separate from the other two. It only exists when JMdict
> goes into languages other than English, and is only related to the gloss itself,
> whereas #1 & #3 relate to the Japanese.

Yes, I agree.  I have a bad habit of trying to over-economise 
on data elements and conflate things that shouldn't be conflated.
As you point out, they are clearly two different things.

[...]
> I think the indication of the source-language of the Japanese
> word needs to be at the sense level, and not necessarily part of any
> gloss. A variant of what you suggest above would be:
> 
>  	<entry>
>  	<ent_seq>1019420</ent_seq>
>  	...
>  	<reb>アルバイト</reb>
>  	...
>       <sense>
>  	<lsource xml:lang="de">arbeit</lsource>
>  	<gloss>part-time job
>  	<gloss>side job
>  	[...more glosses in various languages...]
> 
> This would effectively drop <lang> entirely; replacing it with
> <lsource> at the sense level.

This seems like a good way to do it to me.
Are there any senses that have two different sources 
in the same sense?  There aren't any in the jmdict now, and 
it seems like it might be improbable

There are two glosses (I think that is all) that have some 
extra information about the source word:
     1638670 <gloss>cloth band worn around hair (ru: Katyusha - name of character in Tolstoy novel)</gloss>
     1925200 <gloss>having a fairy-tale atmosphere (de: Maerchen plus -tic)</gloss>
"name of character in Tolstoy novel"is meta-infomation 
about the source word, not part of the source word, yes?  
Where would this extra information go?  Possibly in an 
<lsource> attribute?
I'm not sure whether the second case is meta-information 
or not.

One thing I noticed while looking at the embedded source 
language info is that many of the source language words are 
capatalized even though they are not proper nouns (judging 
from the english glosses.)  Capitalization never bothered me 
much in case-insensitive MS Access (except the *#%#@ 'uK' 
and 'uk' tags! :-) but now that I am on case-sensitive Postgresql, 
I notice it a lot more.

[...]
> >  1926100
> >    <gloss g_lang="ai" source></gloss>
> >    <gloss>tufted puffin</gloss>
> 
> Or better:  <lsource xml:lang="ai"/>
> which I think is the XML way of having an "empty" entity.

Yes, better (although xml considers both the self-closed form
and the explicitly closed form identical, AFAIK)

> BTW, the reason there are some empty ones is that while
> カンタービレ *may* have come to Japanese directly from Italian,
> the English gloss is also "cantabile", and I betcha it came
> via English. I simply didn't want to clutter the entry with
> essentially duplicated words. I guess I can live with something
> like:
> 
> <lsource xml:lang="it>cantabile</lsource>
> <gloss>cantabile</gloss>
> 
> but a smart dictionary client could  do things  like notice that
> the <lsource> entity was the same as one of the glosses, and
> supress it.

I too think that's better.  I usually find it's easier for clients to 
make explicit information implicit, than to go the other way.
Certainly in the database it should be explicit.

[...]
> >  All listings below are from the 2007-04-10 JMdict.
> >  ============================================================
> >  Entries with a source <lang> tag, but without a gloss giving
> >  a specific source word in the format used for other such glosses.
> 
> Most of these I can handle and expand, but I am at a bit of a loss
> as to what to put for 併音 or チゲ.

Since your <lsource> element can be empty, not providing a word
remains an option in these two cases, yes?

[...]
> What I, or someone, needs to do is to consolidate the info
> in the <lang> tag with whatever is currently embedded in the glosses.

I visually scanned a few hundred of them and didn't see any 
discrepencies between the gloss info and the <lang> info.

The consolidation would be this?:

1. Entries with a <lang> element, but without parsable source
language info one or more glosses (the first list in my
previous email):  manually identify a source word from the 
gloss text or external sources if possible, and generate the
<lsource>/revised <gloss> elements by hand.

All the entries on that list have only a single sense, so a if one
didn't want to manually identify source words for each, one 
coud generate empty <lsource> tags automatically based on
the <lang> element.

2. Entries without a <lang> element, but with parsable language
source info in one of more glosses (the second list in my previous
email): machine-extract the source language/word from the gloss
and machine-generate <lsource> and revised <gloss> elements.

3. Entries with both <lang> element and parsable language
source info in one of more glosses: as above, but check for
and report (for manual verification) any <lang> elements not
in one of the glosses.  I would guess that no checks need
be made for gloss source language but no corresponding 
<lang> tag -- like #2 above, just generate a <lsource> tag 
based on the gloss.

I wrote a Perl script (attached) last night to do most of the 
above.  It doesn't actually copy the full jmdict file but it does, 
for each entry with a gloss containing anything like a language 
tag (regex "[^a-z][a-z][a-z]:") attempt to parse it and report
either an error message if unable, or the generated <lsource> 
and <gloss> elements (along with the original <gloss> for 
comparision).  This is just to evaluate the pattern matching 
code, since what needs to be converted is Jim's master file, 
not the JMdict file (I think).

It's something of a hack, but you can run it with
the name of a jmdict  xml file as agument, and it will write
the transformed glosses to stdout, and the error messages
about entries it couldn't figure out (none on the 2007-04-15
jmdict_e file) to stderr.

Finally (if you made it this far) I want to ask about the <dial>
element and if there are (or may arise) complications with
them.  I know nothing about dialects in Japanese but wonder
if there might be readings or senses that are dialect specific,
raising the same issues with <dial> that we are discussing 
with <lang>.




Attachment: bin5BGK9Hcbxb.bin
Description: application/ygp-stripped