[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?

To: edict-jmdict@***************
Subject: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
From: Jim Rose <jim@*************>
Date: Tue, 31 Jul 2007 13:10:32 -0400

Dear Olivier et al,

I think this discussion ended some time ago with me caving in toshare the parsed yomigana data but not caving in to share the codeitself - so a complete set of yomigana for all of EDICT is not aproblem. However, what I can share now is probably not useful. Ihaven't ran a parse of EDICT in over a year (some time last year, mywife managed to eject my main computer (life really) out of a JeepLiberty during a horrific wreck, wiping out my productivity for sometime). I am however, slowly, slowly getting to the point where I'llsoon be back at parsing EDICT on a daily basis (mostly because Iwould like to see Ice Mocha users be able to contribute in ameaningful way to the enhancement of the Tanaka Corpus that Paul Blayand Prof. Breen are constantly improving.) At that time, perhaps weshould agree to try and think of some format for sharing yomiganadata back to EDICT/JMDICT.

For reason mentioned before, I store yomigana data in forms like"z1.44.z15.13". The internal format used now relates to a 4 to 1font display ratio, because there isn't a single instance in EDICT ofa kanji reading exceeding 4 kana when its in a compound with anotherkanji (same is not true for kanji not in compounds, but they don'thave the same geographical restrictions for yomigana display). 48pxwords to 12px yomigana, or 36px to 9px...etc. But you could alsostore it with just one zero for "bins" where the word was kana, andthe actual kana where a word has kanji... i.e.:


お祖父さん　おじいさん　0.じ.い.0.0

So in this case I picked a euphonic yomigana... because 祖 itselfcontributes い and not じ, and 父 by itself doesn't contributeeither い or じ. The reading じい really is only applicable toboth of the kanji together. In this case, Ice Mocha will evenlydivide the reading between 祖 and 父, but there are many idiomaticcases in EDICT where you have two kana symbols to divide between 3kanji etc. So another benefit of using multiple zeroes to holdplaces is that you can reposition the kana in a more fair anddistributed fashion. Unlike words vis-a-vis readings, yomigana implyaddressing spacial relationships too.

I think in Ice Mocha's development notes I made mention that my firstattempt at building a yomigana parser was very complex... perhapslike your AI tail recursion, because "AI" seems to indicatecomplexity? Extending String Theory is simple and much more powerful- and from my limited knowledge of computer science, would beclassified as a "genetic algorithm". Nevertheless, idioms andeuphony require enhanced kanji reading data not available fromKANJIDIC... so string theory or no... EDICT cannot be completelyprocessed without human intervention at least in the development ofsupporting files.

I also applied the algorithm to ENAMDICT... talk about insufficienteuphonic and idiomatic data! It would take half a man year ofenhancing the supporting files to parse ENAMDICT. EDICT isstraightforward in comparison. My euphonic collection thus fardoesn't even come close to being sufficient.

Anyway, if the interest in having yomigana data wrapped up intoJMDICT/EDICT is real, we should have some discussion about how tostore it.


Jim


On Jul 31, 2007, at 12:03 PM, olivier.binda wrote:

Greetings everyone.

I was browsing through the old posts and I happened to stumble on this
most interesting one (see further).

I might help a little bit on this problem as :
For the Japanese/English/German-French dictionnary I'm making,
I have coded a function that :
It inputs an hiragana string S1 and an hiragana+Kanji string S2 and
1) tells if the strings S1 and S2 can be paired2) if they can be
paired, it outputs the yomigata of the Kanji in string S2

And well, when I read Jim Rose's post, I recognized the way I
implemented my function.
So finally, it took 3years and an half to rediscover extending string
theory. (btw it is a much better name than "Awfull but Fun Function

that does AI tail Recursion" a.k.a AFFAIR, the one that I chose^_^)...


It works quite well : I tested it on the JMdict words that have
french/english/german translations and that DON'T have the (JF2) tags
(don't ask me why ! ^_^) and...

out of 14676 JMDict entries (containing 1624 kana only entries), it
managed to produce 18047 pairs with only
1880 failed tries at pairing (including 1038 kanji combinations that
couldn't be paired).

So basically, it's success rate is around 95%
And this is only the first version of this function (took me around 5
days to implement and debug in TeX...but well, TeX is a hell of a
language to code into ^_^)

I intend to upgrade this function and the version 2 will be able to
Pair most of the String of the type "Kana+Kanji with kanjidict2

reading+1Kanji with irregular reading" (like ¤ª»Ð¤µ¤ó and¸æ·»¤µ¤ó

Oneesan or Aniisan) that failed with the version 1 function.

But, when trying to pair º£Æü with ¤¤ç¤¦, it'll still beimpossible

to split the kana in 2 hurigana strings...

Of course, some words should be handled by humans (like my senpai, Jim
Rose, did).

Well...let's go back to the point.
I'm willing to give either the code and/or the data that I found
because :
_it's not so hard to code anyway.
_done by one, usefull to all.
_I don't have the courage of Jim Rose : I don't want to have to handle
the words that escape the scrutiny of my function
(besides, spending too much time typing hurts my back).
So...let's do 95% of the work automatically and let volunteers do the
remaining 5% progressively

_I was going to suggest to merge Furigana information in JMdictanyway.


Let's think about the students (including me ^_^) and the young ones :

when considering a japanese word, it is usefull to know :
how it can be written with kanji
or
how written in kana
(This is how Jmdict is doing things)

But

it is more usefull to know :
how it can be written with kanji
and
how this way of writing the word with kanji can be written with kana
and
how the kana relate to the Kanji (Furigana)
(this is how I dream Jmdict was doing things...with grouped
kana(furigana) and kanji readings) :

Olivier Binda

--- In edict-jmdict@yahoogroups.com, Jim Rose <jim@...> wrote:
>
> Stuart,
>
> So what happens is that you have a word. The word might be kana,
> might be all kanji, might be kana and kanji. You can parse the
> reading for the kanji from kanjidic, but you will have two large
> categories which will not parse.
>
> Idioms and words exhibiting euphony. You've already discussed the
> former. You can't parse those without human intervention - at least
> not the first time you encounter it as you are slogging through
> EDICT. Euphony can be handled, but you have to build a bigger,

> better data source... starting from kanjidic, but growing thereading

> data to include relevant euphonic variations. You don't do this by
> hand for each one, but you do have to do it by hand for each
> previously un-encountered euphonic kanji reading. This takes several
> man months, by the way, because there are so many possible euphonic
> twists in EDICT. One other trick Ice uses is to handle idioms as if
> they were euphony... by using the equivalent of the linguistic
> "zero". The reading of " ", or nothing at all... null... void...
> contributes no sound at all... and then assigning one of the kanji
> the idiomatic pronunciation if necessary.
>
> But the plain old parsing isn't trivial either. There are multiple
> ways to approach it. One of those ways is superior to all others:
>
> If you poke around Ice Mocha on KanjiCafe, you'll discover that I

> discovered the parsing algorithm while drinking an Ice Mocha...hence

> the name of the application. I call the technique "extending string
> theory". Its a bit like darwinian evolution, natural selection that
> is, and a kind of genetic algorithm. I've never discussed it
> anywhere, because I've made a bet with my myself that nobody else
> would figure it out for 5 years. Its been 3.
>
> Basically you are taking your enhanced data source of possible
> readings and in tandem with other bells and whistles, extending a
> test string. As each string grows, you test it to see if it can
> "survive" as a possible reading for the word. At each step you
> eliminate the impossible variants from your array. Then grow the
> survivors one more reading. Its trivial to show that eventually only
> one evolving string can survive all the way to the end. The process
> is fairly fast provided you've built a very robust data source which
> includes EDICT's many euphonic variants of kanji phonemes, and all
> the while you keep track of which "bins" the readings are in vis-a-
> vis the order of the symbols used to write the word - something you
> were starting to think about. The last time I used this algorithm to
> parse EDICT, I think it finished in a few minutes - but it was after
> countless hours of enhancing kanjidic's data - hence HARD - and it
> was anything but a pleasure. More like an addiction to seeing how
> much of my life I could waste proving to myself that I could do it.
>
> Japanese is expressed on your computer using fixed width fonts, and
> there are no multi-kanji words in EDICT with readings exceeding four
> kana symbols... so I use 48 px fonts for words, and 12 px fonts for
> furigana / yomigana. To center those in horizontal display when
> using the graphical imaging proxy mode, or top of the cell justify
> them in vertical, I keep track of their position using a system of 4
> spaces per kanji. Then I keep a record of the blank spaces using a
> compressing short-hand. For example I store the position of the
> yomigana using say z1.44.z15.13. This would tell ice to display kana
> 44 shifted one space to the right (if generating a yomigana image),
> then have 15 blank 12 px spaces, and display kana 13. Because the

> font is 12 px, every 4 kana or blank spaces would be aboveanother 48

> px symbol in the word. If I wasn't in proxy mode, and using real
> Japanese text, I would convert a number like "z8" by division by 4
> into two blank spaces. So you see, storing this information comes
> after determining how to parse it.
>
> All this is probably more than you wanted to know?
>

> So you were right, it is possible to do... and I do in fact havecode

> which will do it. But my code relies on many internal features of
> Ice Mocha... for example, the proxy server ordering of kanji which
> allows me to address most Japanese symbols by an internal KanjiCafe
> number. You achieved some of the rudimentary starting thoughts
> necessary to duplicate what I have accomplished. You understood that
> arrays are needed. You understood that you would have to prune

> impossible matches or you would get bogged down in impossibleamounts

> of computation. You should also know that every time EDICT grows you

> may have to add more euphonic variants to your data that were notyet

> seen in older versions of EDICT in order to get a proper parse.
>
> Would I share my code and data set? Honestly, I don't think so. As

> I've been building a set of stroke order diagrams which are now1,415

> in number, I've come to realize the personal sacrifice I've made in
> terms of time - countless man hours to give the world a set of SODs
> bigger and better than Halpern's, which are really only available
> through WWWJDIC. I'm making those available on the Internet to other
> web sites because a very few people have really helped out with the
> project. There has been volunteers and collaboration - but not from
> other web application developers as far as I can tell, and not until
> I spent several years figuring out the necessary software to allow
> the collaboration.
>
> But the Yomigana data is a powerful feature of Ice Mocha that is not
> yet seen on the Internet... not even on WWWJDIC... Its something I
> take great pride in and worked extremely long, unpaid hours. My only
> reward has been a tiny bit of Internet fame. Give my code away, and
> unlike the SODs and animated SODs, there is no visible attribution.
> People would stop seeing Ice Mocha as a valuable place to go as
> everyone cloned its features, and nobody would shop at Rolomail.com
> anymore... forcing me to go back into the job market and ending all
> future contributions to the world of Japanese on the Internet as I
> labored mindlessly for someone else's profit.
>
> Jim
>
> On Jan 22, 2007, at 11:07 PM, Stuart McGraw wrote:
>
> > Jim Rose wrote:
> > When you say "mapping kanji to reading strings", that is basically
> > what Ice Mocha does when it calculates the yomigana / furigana for
> > each EDICT word using the EDICT reading.¥Ä Not easy.¥Ä
> > Very hard stuff.¥Ä But I pipe in only because this has already
> > been accomplished with EDICT several years ago.
> > Cool! What I was writing about was just the storage of the kanji-
> > reading
> > map information in a database. The "rest of the story" (which
> > you've done)
> > is generating that information. It occurred to me too that one
> > might be able to
> > do it automatically by substituting readings from a kanjdic-like
> > source, and
> > pruning impossible matches until one could show exactly one set of
> > reading-

> > kanji correspondences that results in the same reading as givenfor

> > the word.
> > But I would happily give up the pleasure of doing that if the code
> > was already
> > written. :-)
> >
> > Is your code available? I think it would be very useful.
>

Follow-Ups:
- Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
  - From: "Paul Blay" <blay.paul@**************>
- Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
  - From: "olivier.binda" <olivier.binda@**********>

References:
- Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
  - From: "olivier.binda" <olivier.binda@**********>

Prev by Date: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Next by Date: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Previous by thread: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Next by thread: Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?
Index(es):
- Date
- Thread