[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] RE: parsing submission data



G'day all,

Back home now, and a huge amount of catching-up to do...

On 12/07/07, Stuart McGraw <smcg4191@frii.com> wrote:

Hope you {are having / have had} a very enjoyable trip, Jim.

Great. Wait till you see the pictures (about 1500 of them - my wife has
a new camera). I managed to survive two weeks without even looking at email.

 On June 20, 2007, Jim Breen wrote:

 > - for setting up the tagging on kana, we could have something like
 > [rinf=ik,restr=KANJI] or [rinf=ik,restr="KANJI1,KANJI2"]

 In other places in the language (e.g. note="...", lsrc="...", etc.),
 quoted strings are treated as single lexical tokens.  It seemed
 more consistent to maintain that behaviour consistently thoughout
 the language.  It is also simpler to do it that way because if not,
 then either one has to add yet another state to the lexer (i.e.
 make it behave differently when parsing a quoted string following
 a "restr=" or "xref=" tag than in other places), or do 2 levels of
 parsing: use a second parser to reparse the quoted string when
 it occurs in restr or xref value.

 > -I wonder about having semicolons inside the [....] structures

 Without the quotes (which were awkward as I mentioned above)
 commas were a problem because in "adv,restr=str1,str2,adj" for
 example, is "str2" part of the restr list, or an independent tag
 like "adj"?  One could look at "str2" and disambiguate depending
 on whether it contained japanese characters (since all restr and
 xref values will have japanese text values and none of the other
 tags do) but using something other than a comma seemed simpler:
 "id,restr=str1;str2,adj" is not ambiguous.  Semicolons seemed best
 for this since that what is used to separate the kanji and kana items
 in the kanji and reading sections so there is some degree of
 consistency in that choice.

Yes, that's quite clear. Semicolons make a lot of sense there.

 > - I guess there could be an implied [1] at the start of the senses box. An empty
 > box could even be preloaded with it.

 Already done :-)  Before I left, I changed the database on Arakawa
 to add an Edit button to each displayed entry.  The Edit and New Entry
 (http://www.edrdg.org/~smg/cgi-bin/edform.pl) pages were changed
 to use the input language parser and I did what you suggest -- the
 Sense box is preloaded with "[1][n]".

 The Help links on those pages don't work yet but the language is the
 same as described in the original test page:
   http://www.edrdg.org/~smg/cgi-bin/jbparser.pl
 and the pages work -- you can add and edit entries using them (although
 I didn't have time to give them any real testing).  I'll start working on this
 again in a few days (just got home yesterday and need a few days to
 unwind :-)

I know the feeling. I need to get my Cyrcadian rhythm sorted out too. Waking at
3am is not great.

Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/