[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] RE: parsing submission data
G'day all,
Back home now, and a huge amount of catching-up to do...
On 12/07/07, Stuart McGraw <smcg4191@frii.com> wrote:
Hope you {are having / have had} a very enjoyable trip, Jim.
Great. Wait till you see the pictures (about 1500 of them - my wife has
a new camera). I managed to survive two weeks without even looking at email.
On June 20, 2007, Jim Breen wrote:
> - for setting up the tagging on kana, we could have something like
> [rinf=ik,restr=KANJI] or [rinf=ik,restr="KANJI1,KANJI2"]
In other places in the language (e.g. note="...", lsrc="...", etc.),
quoted strings are treated as single lexical tokens. It seemed
more consistent to maintain that behaviour consistently thoughout
the language. It is also simpler to do it that way because if not,
then either one has to add yet another state to the lexer (i.e.
make it behave differently when parsing a quoted string following
a "restr=" or "xref=" tag than in other places), or do 2 levels of
parsing: use a second parser to reparse the quoted string when
it occurs in restr or xref value.
> -I wonder about having semicolons inside the [....] structures
Without the quotes (which were awkward as I mentioned above)
commas were a problem because in "adv,restr=str1,str2,adj" for
example, is "str2" part of the restr list, or an independent tag
like "adj"? One could look at "str2" and disambiguate depending
on whether it contained japanese characters (since all restr and
xref values will have japanese text values and none of the other
tags do) but using something other than a comma seemed simpler:
"id,restr=str1;str2,adj" is not ambiguous. Semicolons seemed best
for this since that what is used to separate the kanji and kana items
in the kanji and reading sections so there is some degree of
consistency in that choice.
Yes, that's quite clear. Semicolons make a lot of sense there.
> - I guess there could be an implied [1] at the start of the senses box. An empty
> box could even be preloaded with it.
Already done :-) Before I left, I changed the database on Arakawa
to add an Edit button to each displayed entry. The Edit and New Entry
(http://www.edrdg.org/~smg/cgi-bin/edform.pl) pages were changed
to use the input language parser and I did what you suggest -- the
Sense box is preloaded with "[1][n]".
The Help links on those pages don't work yet but the language is the
same as described in the original test page:
http://www.edrdg.org/~smg/cgi-bin/jbparser.pl
and the pages work -- you can add and edit entries using them (although
I didn't have time to give them any real testing). I'll start working on this
again in a few days (just got home yesterday and need a few days to
unwind :-)
I know the feeling. I need to get my Cyrcadian rhythm sorted out too. Waking at
3am is not great.
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/