[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] xrefs in WWWJDIC
On 25/05/07, Stuart McGraw <smcg4191@frii.com> wrote:
Jim Breen wrote:
> [pos="n,vs"]
> [sense] bending
indentation
> [sense] refraction
> [sense] inflection
Or how about a syntax more familiar to wwwjdic/edict users?:
(1) [n,vs] bending; indentation
(2) refraction
(3) inflection
Hmmm. I don't like the idea of locking ";" in to only meaning a gloss-
separator, but I guess it could be made escapable, e.g. use \; to put
";" within a gloss.
Starting a gloss with "([0-9])" may be safe (it doesn't happen at present),
but I'd hate to rule it out. Maybe [1] would be better.
I don't think a pure edict format is good because of ambiguities
but it might be possible to come up with something very similar
that is unambiguously parsable and is not too rigid.
If no glosses can start with "[" then brackets could be used
to identify sense tags, as above. The misc, pos, and field
tags are all unique across all three groups so could be freely
intermixed, with or without their own sets of brackets:
[n,vs][col][comp]
[n,vs,col,comp]
[col,vs][comp][n]
are all unambiguous and would free the submitter from needing
to remember, "pos tags go before misc tags".. Of course this
would need a commitment to continue to maintain such uniqueness
among the tags which may not be desirable. The misc and pos
tags seem unlikely to grow much but the field tags could.
Yes, at present the PoS, misc, etc. tags don't overlap, but I doubt
that could, or should, be maintained forever. I rather like the
[name="values"] approach.
Some syntax is needed for stagr, stagk info. Maybe some
Japanese text and the word "only" inside brackets is enough.
How about [restr="かなことば"] or [restr="漢字言葉"] ?
One ambiguity in edict is between s_info and gloss:
Is "parenthesized text" in
(1) (parenthesized text) gloss
part of the gloss as in
[1000610] "いい年をして 【いいとしをして】 (exp) (in spite of) being old enough to know better"
or a sense.s_info comment as in:
[1565020] "吮癰舐痔 【せんようしじ】 (exp) (Chinese four-character phrase) brown-nosing;...
So that would need working out.
いい年をして wouldn't change, as the "(in spite of)" is part of the gloss.
For the 吮癰舐痔 entry, you could have:
[pos="exp",sensinf="Chinese four-character phrase"] brown-nosing;.....
<sens_inf> is a bit of a hack. I wanted a field where text containing
Japanese could go, but which would not carry through to the EDICT
form (which can't have Japanese in the "English" region.
> This is a bit like Wikimedia's style. It may well be better than having a myriad
> of boxes.
I wondered about a parsed text approach because it is already
done to some degree on the current wwwjdic new/amend form
(e.g. sense numbers to distinguish different senses), and I can't
think of a good way to provide separate boxes with having, as
you say, a myriad of them, most of which will never be used,
or some fancy javascript/ajax or similar. The latter could do
things like hiding boxes or providing a new box (e.g. kanji) when
the existing box was filled out. But that approach would probably
take me about 3 years to implement. :-)
Pavel's Ruby-on-Rails prototype had all those hidden fields that
opened up when clicked on. Effective, but a bit scary.
I'm a bit torn on this. I suspect having PoSs, etc. as parseable
thingos inside a text box is likely to be the fastest way to get
a working system. OTOH when Ilet users type in their own PoS codes
from a supplied list, I got all sorts of garbage. Only the drop-down
PoS lists fixed that. Still, if the iput is parsed and user gets an
immediate response with errors highlighted, it may well work.
So, do we proceed with a 3-text-box model? I'd be prepared to give it a go.
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/