[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [edict-jmdict] xrefs in WWWJDIC

To: <edict-jmdict@***************>
Subject: RE: [edict-jmdict] xrefs in WWWJDIC
From: "Stuart McGraw" <smcg4191@********>
Date: Sat, 26 May 2007 16:13:35 -0600
Importance: Normal
Comments in-line, apologies for the length...

Jim Breen wrote:
> On 25/05/07, Stuart McGraw <smcg4191@frii.com> wrote:
> > Jim Breen wrote:
> >  > [pos="n,vs"]
> >  > [sense] bending
> >  indentation
> >  > [sense] refraction
> >  > [sense] inflection
> >
> >  Or how about a syntax more familiar to wwwjdic/edict users?:
> >
> >  (1) [n,vs] bending; indentation
> >  (2) refraction
> >  (3) inflection
> 
> Hmmm. I don't like the idea of locking ";" in to only meaning a gloss-
> separator, but I guess it could be made escapable, e.g. use \; to put
> ";" within a gloss.

Would a different character be better?  '\n' is also available, i.e.
glosses could start on a new line, but I admit a personal bias
against cash-register-receipt text.  I thought ";" might be good 
because of it's familiarity from its current use in edict format.

> Starting a gloss with "([0-9])" may be safe (it doesn't happen at present),
> but I'd hate to rule it out. Maybe [1] would be better.

"[1]" is ok too.  Too me, "1. something" seems a little
more natural than "[1] something" but since we are proposing
brackets for other things, their use here also is not too bad.
Could also use parens, "(1)" which might be better than both 
given it's used in wwwjdic currently. 

If each sense is required to be on a single line, then
a gloss starting with a number is not a problem if starting 
a sense with a sense number is mandatory.  

1. 1. This gloss starts with a number but there is no confusion with the sense number.

If one allows defaulting the first sense number in order 
to simplify the most common case of only one sense, then
yes, one can't tell if "1." in the text is supposed to be the sense
number or part of the gloss, and one would need to prohibit
(or demand the escaping of) glosses that start with "1.", at 
least in the first gloss (or use some other disambiguation 
method).  Same is true for "[1]" and "(1)" but I 
suppose the probability of either if those strings occurring
at the front of a gloss is smaller.

It is also a problem if glosses in a sense can be distributed 
over multiple lines (desirable I think), at least if there is no 
continuation line concept.  But one could adopt a simple 
and natural continuation line convention, leading white space:

1. this is gloss-1 of sense-1; gloss-2 of sense-1;
 gloss-3 of sense-1
2. gloss-1 of sense-2;
this line is a syntax error

I think that looks good, is natural and intuitive, has a syntax 
that is easy to explain, and allows glosses with leading text
that look like a sense number.

> >  I don't think a pure edict format is good because of ambiguities
> >  but it might be possible to come up with something very similar
> >  that is unambiguously parsable and is not too rigid.
> >
> >  If no glosses can start with "[" then brackets could be used
> >  to identify sense tags, as above.  The misc, pos, and field
> >  tags are all unique across all three groups so could be freely
> >  intermixed, with or without their own sets of brackets:
> >
> >  [n,vs][col][comp]
> >     [n,vs,col,comp]
> >     [col,vs][comp][n]
> >
> >  are all unambiguous and would free the submitter from needing
> >  to remember, "pos tags go before misc tags"..  Of course this
> >  would need a commitment to continue to maintain such uniqueness
> >  among the tags which may not be desirable.  The misc and pos
> >  tags seem unlikely to grow much but the field tags could.
> 
> Yes, at present the PoS, misc, etc. tags don't overlap, but I doubt
> that could, or should, be maintained forever. I rather like the
> [name="values"] approach.

OK, if they can overlap, that will complicate things.
Such is life :-)

> >  Some syntax is needed for stagr, stagk info.  Maybe some
> >  Japanese text and the word "only" inside brackets is enough.
> 
> How about [restr="かなことば"] or [restr="漢字言葉"] ?
> 
> >  One ambiguity in edict is between s_info and gloss:
> >  Is "parenthesized text" in
> >
> >  (1) (parenthesized text) gloss
> >
> >  part of the gloss as in
> >
> >  [1000610] "いい年をして 【いいとしをして】 (exp) (in spite of) being old enough to know better"
> >
> >  or a sense.s_info comment as in:
> >
> >  [1565020] "吮癰舐痔 【せんようしじ】 (exp) (Chinese four-character phrase) brown-nosing;...
> >
> >  So that would need working out.
> 
> いい年をして wouldn't change, as the "(in spite of)" is part of the gloss.
> For the 吮癰舐痔 entry, you could have:
> 
> [pos="exp",sensinf="Chinese four-character phrase"] brown-nosing;.....

Two of the key points that make wiki markup usable it seems to me are 
 * Simple
 * Mnemonic
Thus things like * for bulleted text or n.m for outline numbered text.

The quotes, "=" characters and attribute-like meta-keywords (e.g. 
"pos") give that string an xml'ish feel, at least to me, and seem contrary 
to the simple/mnemonic concept, although they do make it easy for a 
machine to parse.  That's why I was looking for something the would 
appear natural and leverage if possible, many users' already existing 
knowledge of the edict format.  (Obvious danger there with confusion 
due to close but not identical).

Another possibility might be to make use of "\n"s,

1. gloss

or

1. exp  / vs / ichi1
{sense info}
gloss;gloss;...

I.e. gloss follows sense number only if there is no additional
sense information; otherwise it goes on a separate line.
misc, pos, etc info given is a specific order with "/" separating 
groups of each type.  "{}" are used here for distinguishing 
sens_inf text.

I guess I am still groping for something that an entry can 
be serialized to and from with no information loss, but which 
is also natural for person to understand and easy to write.
I think the [pos="...",...] syntax meets the former criterion 
but I am not so sure about the latter.

However, I keep in mind that you're the one who will have 
to live with it (at least until it can be changed) so if you 
are sure you want to go that way, that is the way I will 
build it.

> <sens_inf> is a bit of a hack. I wanted a field where text containing
> Japanese could go, but which would not carry through to the EDICT
> form (which can't have Japanese in the "English" region.

Maybe it has morphed from your original intent but it doesn't 
seem like a hack to me.  Not that I have a lot, but virtually all 
my dictionaries have entries with per-sense non-translational 
information.  If there wasn't a sens_inf field in JMdict, I would 
have added one to the database anyway.

> >  > This is a bit like Wikimedia's style. It may well be better than having a myriad
> >  > of boxes.
> >
> >  I wondered about a parsed text approach because it is already
> >  done to some degree on the current wwwjdic new/amend form
> >  (e.g. sense numbers to distinguish different senses), and I can't
> >  think of a good way to provide separate boxes with having, as
> >  you say, a myriad of them, most of which will never be used,
> >  or some fancy javascript/ajax or similar.  The latter could do
> >  things like hiding boxes or providing a new box (e.g. kanji) when
> >  the existing box was filled out.  But that approach would probably
> >  take me about 3 years to implement. :-)
> 
> Pavel's Ruby-on-Rails prototype had all those hidden fields that
> opened up when clicked on. Effective, but a bit scary.

An aspect of that prototype I did not like, and which
is probably influencing my thoughts now, it that it mirrored
the jmdict xml too closely.   This is fine for you and other
core editors who are intimately familiar with the jmdict format 
but seems to me to be too obscure for non-techies who just 
want to submit a new wwwjdic entry.  Even for potential 
editors/approvers I think there are people on this list with the 
requisite Japanese knowledge but not of the jmdict structure.
Obviously it is not difficult to learn but the fewer barriers,
the better.

> I'm a bit torn on this. I suspect having PoSs, etc. as parseable
> thingos inside a text box is likely to be the fastest way to get
> a working system. OTOH when Ilet users type in their own PoS codes
> from a supplied list, I got all sorts of garbage. Only the drop-down
> PoS lists fixed that. Still, if the iput is parsed and user gets an
> immediate response with errors highlighted, it may well work.

Yes, I think that is key.  Submissions will be parsed and 
checked for syntatic validity before they get to the database
and the submitter told in detail what is wrong   Another key 
is avoiding frustration during entry by making the syntax 
rules and keywords easily accessible: links to popup help 
near the entry box, maybe in a sidebar, etc.  

> So, do we proceed with a 3-text-box model? I'd be prepared to give it a go.

Well, I already started writing something along the lines 
above.  The parsing is based on a simple yacc grammar 
so hopefully it will be relatively easy to change, provided 
the changes continue to result in a context-free, one-token-
lookahead grammar.

I hope not to spend too much time on this because even if 
suboptimal, it can be fine tuned or changed later.  I remain 
more concerned about the issues of diff'ing hierarchical data 
in entries, and the nitty-gritty details of the approval process 
mechanics which are still a little vague in my mind.  But we 
need a way to submit modifications first.
Follow-Ups:
- Re: [edict-jmdict] xrefs in WWWJDIC
  - From: "Jim Breen" <jimbreen@*********>
References:
- Re: [edict-jmdict] xrefs in WWWJDIC
  - From: "Jim Breen" <jimbreen@*********>
Prev by Date: Re: [edict-jmdict] Re: code to parse/format edict format
Next by Date: RE: [edict-jmdict] Re: code to parse/format edict format
Previous by thread: Re: [edict-jmdict] xrefs in WWWJDIC
Next by thread: Re: [edict-jmdict] xrefs in WWWJDIC
Index(es):
- Date
- Thread