[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Re: code to parse/format edict format



On 24/05/07, wmaton <wmaton@yahoo.com> wrote:
--- In edict-jmdict@yahoogroups.com, "Stuart McGraw" <smcg4191@...> wrote:
 > Does anyone have or know where I can get
 > Perl (or other language) code that will parse
 > or format edict-style text?

 There's a C language example in Jeffrey Friedl's lookup program, which
 serves as the backend of the J-E server I run.  You can find it at any
 Nihongo mirror near you, or in my unofficial patch collection page at:

 http://www.wfms.org/lookup/

 I do remember that there was a list version someplace, but that was
 eons ago.

 Also, Jim Breen's xjdict (c'mon Jim, you know it still works....!)
 is another example in C.

The xjdic tarball is on the Monash ftp site. Like WWWJDIC it doesn't do much
meaningful parsing of the EDICT entry - just fiddles a few things to make the
output look better, e.g. / converts to ;, etc. The indexer parses the entry
but only looks for text strings or whitespace.

Probably the most useful parser I have is the one that converts EDICT-format
entries into my internal database. It takes something like:

KKK1;KKK2 [rrrr1;rrrr2] /(n,vs) (obsc) (see KKKK3) blah/

And turns it into

#E nnnnnnn
KKK1
KKK2
#R
rrr1
rrr2
#AU 2007-05-06 Entry created
#M
#G n vs
#X KKK3
#MI obsc
blah

I have put the (totally undocumented) source of this at:
http://www.csse.monash.edu.au/~jwb/buildzip.zip

That .zip also has a file called "tags" which is used to
sort out the n,vs,obsc, etc. situations. It's not tested
yet for the new "#SL" markup which generates the <lsource...>

Undocumented "features" of the utility are:
- it only does tags, xrefs, etc. properly (more or less) for
the first sense. I have to fix things by hand later for multi-sense
conversions;
- for multiple senses, it needs the later senses to be flagged by a "#".
When converting a batch I just use an editor to change:

.... /(1) senses1 /(2) sense 2/

to

.... /sense 1/#sense 2/

I'm sure there are other things I've overlooked.

Hope this helps

Jim

--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/