[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: parsing submission data (was: [edict-jmdict] xrefs in WWWJDIC)
Jim Breen wrote:
> On 03/06/07, Stuart McGraw <smcg4191@frii.com> wrote:
> > Jim Breen wrote:
> > > Actually [[s_lang="en: the source word"]] would probably be better, just to
> > > emphasise to the user that the language code is needed.
> > >
> > > "from" may well be better than "s_lang", but it might confuse with "trans"
> > > and "lit". (Many people get those confused.)
> >
> > The last time this came up,
> > http://tech.groups.yahoo.com/group/edict-jmdict/message/1503,
> > http://tech.groups.yahoo.com/group/edict-jmdict/message/1524
> > you were considering making trans a sense.lsource
> > attribute, e.g.
> > <lsource lang="en" translit="soapland">
> > rather than a tagged gloss
>
> Well, I had __ls_type="wasei"__ in there, as "soapland"
> isn't meaningful English...
Yes, sorry, I realized that but forgot to indicate it
in the example.
> > <gloss lang="en" translit>soapland</gloss>
> > or gloss-like element
> > <translit lang="en">soapland</translit>
> > (BTW, I think the last two are informationally equivalent
> > and would have identical representations in the database.)
>
> That's still my thinking. I don't want a real gloss to contain broken or
> non-English. The 和製英語 from which a loanword is transliterated is of
> interest, but it must be seen as informtion relating to a sense; not a
> translation of the Japanese word itself.
>
> > If you did go with the first, then it would come down to
> > the teaching people:
> > 1. Use [from....] to specify the foreign language word
> > or pseudo-word a Japanese word was derived from.
> > 1a. Use [from ... trans*] when that word is not a real
> > word in the source language.
>
> Yes.
>
> > 2. Use a gloss to provide the meaning of the Japanese
> > word in English (or the specified gloss language).
> > May or may not be the same as the word given in
> > [from:...] (but will never be the same as [from...trans].)
>
> Yes, but there's no need for a [from="..."] if the 外来語 is from a
> real English/etc. word or phrase.
I realize the vast majority of gairaigo currently has no
lsource information and it would probably be hard to add
but, at least theoretically, would it be desirable to have
that information represented explicitly, on the grounds
that a while a human can tell that ラジオ/radio/, was
derived from the english word, a computer would have
a much harder time deducing the same?
> > 3. Use [lit] for gloss that is an unusually word-for-word
> > translation (but still a legitimate word/whatever in the
> > gloss language) [This is not expressed very clearly
> > but you get the idea I hope.]
>
> Mostly. Take the recent entry: 鬼の居ぬ間に洗濯. It contains:
> "(lit: refreshing oneself while the ogre is gone)." As a gloss
> that is uselessly literal, and should not be regarded as a gloss at all,
> but more some information about the background/structure of the Japanese
> phrase (most occurrences of "lit:" are for expressions like that.)
Isn't it the same case with the "explanatory" glosses
that you and Francis Bond proposed in the "Semi-automatic
Refinement..." paper? An explanation is not really a gloss
in the sense of a transelation either (even less so than a "lit:"
gloss it seems to me, although it is valid english).
> > I suspect that documenting and teaching people this will
> > be easier than teaching them some other things, like when
> > a gloss goes in an existing sense and when it is a new sense.
>
> I think so too.
>
> For a learnable syntax, how about:
>
> [from="fr: avec"] or just [from=it:] in the case of フェットチーネ.
> and
> [wasei="soaplady"] or [wasei="de: gebroken Deutsch"]
> and
> [lit="refreshing oneself while the ogre is gone"]
If you did a search for "ogre", would you expect
the 鬼の居ぬ間に洗濯 entry to appear in the results?
I am currently wrestling with a problem trying to use
the "[...]" syntax at the end of a gloss for gloss tags, ("lit",
"expl", etc), so adopting your suggested syntax (if it also
includes "expl") will simplify things a lot.
> In fact the first two will end up as the same XML entity, but with
> different attributes,
Drifting off the topic of syntax for an input language,
and into data representation,...
Taking entry 2164440, ブラフマン, as an example
(and twisting it a little so I can illustrate the point)...
I don't see any difference in the information provided
by any of the following snippets of xml.
<lsource lang=sa>Bhraman</lsource>
<gloss>Bhrama</gloss>
<expl>Hinduism, the ultimate reality of the universe</expl>
<lit>Pretend there is a literal translation</lit>
and
<lsource lang=sa>Bhraman</lsource>
<gloss>Bhrama</gloss>
<gloss type=expl>Hinduism, the ultimate reality of the universe</gloss>
<gloss type=lit>Pretend there is a literal translation</gloss>
or even
<gloss type=lsource lang=sa>Bhraman</gloss>
<gloss>Bhrama</gloss>
<gloss type=expl>Hinduism, the ultimate reality of the universe</gloss>
<gloss type=lit>Pretend there is a literal translation</gloss>
Any of these forms can be converted to another
without loss of information. I have no opinion on
which is the best for the xml file, and agree with
you that in the input language your sugestions will
encourage submitters to enter the right data, but
in the database, the third form wins I think because:
- It is simpler (only one gloss table needed instead
of separate gloss, lsource, lit, and expl tables).
- Simpler or more powerful searches: e.g. search
for "hinduism" will find this entry without needing
to search though multiple tables.
- Conversely, can search only "pure glosses"
by adding a "not" expression to the WHERE
clause to exclude expl, lit, and lsource glosses.
- Can easily generate any of the above xml forms
from it.
This doesn't imply that an lsource waseieigo word
is a gloss or translation; rather that using the
name "gloss" for the database table is misleading.
One would perhaps give a more accurate impression
if it were renamed "foreign_text"; each row having a
"type" field that indicates what kind (gloss, explanation,
literal, lsource, waseieigo) of text is in that row.