[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wishlist item (Re: [edict-jmdict] "P" Markers - Google as corpus?



Dear Olivier et al,

I think this discussion ended some time ago with me caving in to share the parsed yomigana data but not caving in to share the code itself - so a complete set of yomigana for all of EDICT is not a problem. However, what I can share now is probably not useful. I haven't ran a parse of EDICT in over a year (some time last year, my wife managed to eject my main computer (life really) out of a Jeep Liberty during a horrific wreck, wiping out my productivity for some time). I am however, slowly, slowly getting to the point where I'll soon be back at parsing EDICT on a daily basis (mostly because I would like to see Ice Mocha users be able to contribute in a meaningful way to the enhancement of the Tanaka Corpus that Paul Blay and Prof. Breen are constantly improving.) At that time, perhaps we should agree to try and think of some format for sharing yomigana data back to EDICT/JMDICT.

For reason mentioned before, I store yomigana data in forms like "z1.44.z15.13". The internal format used now relates to a 4 to 1 font display ratio, because there isn't a single instance in EDICT of a kanji reading exceeding 4 kana when its in a compound with another kanji (same is not true for kanji not in compounds, but they don't have the same geographical restrictions for yomigana display). 48px words to 12px yomigana, or 36px to 9px...etc. But you could also store it with just one zero for "bins" where the word was kana, and the actual kana where a word has kanji... i.e.:

お祖父さん おじいさん 0.じ.い.0.0

So in this case I picked a euphonic yomigana... because 祖 itself contributes い and not じ, and 父 by itself doesn't contribute either い or じ. The reading じい really is only applicable to both of the kanji together. In this case, Ice Mocha will evenly divide the reading between 祖 and 父, but there are many idiomatic cases in EDICT where you have two kana symbols to divide between 3 kanji etc. So another benefit of using multiple zeroes to hold places is that you can reposition the kana in a more fair and distributed fashion. Unlike words vis-a-vis readings, yomigana imply addressing spacial relationships too.

I think in Ice Mocha's development notes I made mention that my first attempt at building a yomigana parser was very complex... perhaps like your AI tail recursion, because "AI" seems to indicate complexity? Extending String Theory is simple and much more powerful - and from my limited knowledge of computer science, would be classified as a "genetic algorithm". Nevertheless, idioms and euphony require enhanced kanji reading data not available from KANJIDIC... so string theory or no... EDICT cannot be completely processed without human intervention at least in the development of supporting files.

I also applied the algorithm to ENAMDICT... talk about insufficient euphonic and idiomatic data! It would take half a man year of enhancing the supporting files to parse ENAMDICT. EDICT is straightforward in comparison. My euphonic collection thus far doesn't even come close to being sufficient.

Anyway, if the interest in having yomigana data wrapped up into JMDICT/EDICT is real, we should have some discussion about how to store it.

Jim


On Jul 31, 2007, at 12:03 PM, olivier.binda wrote:

Greetings everyone.

I was browsing through the old posts and I happened to stumble on this
most interesting one (see further).

I might help a little bit on this problem as :
For the Japanese/English/German-French dictionnary I'm making,
I have coded a function that :
It inputs an hiragana string S1 and an hiragana+Kanji string S2 and
1) tells if the strings S1 and S2 can be paired2) if they can be
paired, it outputs the yomigata of the Kanji in string S2

And well, when I read Jim Rose's post, I recognized the way I
implemented my function.
So finally, it took 3years and an half to rediscover extending string
theory. (btw it is a much better name than "Awfull but Fun Function
that does AI tail Recursion" a.k.a AFFAIR, the one that I chose ^_^)...

It works quite well : I tested it on the JMdict words that have
french/english/german translations and that DON'T have the (JF2) tags
(don't ask me why ! ^_^) and...

out of 14676 JMDict entries (containing 1624 kana only entries), it
managed to produce 18047 pairs with only
1880 failed tries at pairing (including 1038 kanji combinations that
couldn't be paired).

So basically, it's success rate is around 95%
And this is only the first version of this function (took me around 5
days to implement and debug in TeX...but well, TeX is a hell of a
language to code into ^_^)

I intend to upgrade this function and the version 2 will be able to
Pair most of the String of the type "Kana+Kanji with kanjidict2
reading+1Kanji with irregular reading" (like ¤ª»Ð¤µ¤ó and ¸æ·»¤µ¤ó
Oneesan or Aniisan) that failed with the version 1 function.

But, when trying to pair º£Æü with ¤¤ç¤¦, it'll still be impossible
to split the kana in 2 hurigana strings...

Of course, some words should be handled by humans (like my senpai, Jim
Rose, did).

Well...let's go back to the point.
I'm willing to give either the code and/or the data that I found
because :
_it's not so hard to code anyway.
_done by one, usefull to all.
_I don't have the courage of Jim Rose : I don't want to have to handle
the words that escape the scrutiny of my function
(besides, spending too much time typing hurts my back).
So...let's do 95% of the work automatically and let volunteers do the
remaining 5% progressively
_I was going to suggest to merge Furigana information in JMdict anyway.

Let's think about the students (including me ^_^) and the young ones :

when considering a japanese word, it is usefull to know :
how it can be written with kanji
or
how written in kana
(This is how Jmdict is doing things)

But

it is more usefull to know :
how it can be written with kanji
and
how this way of writing the word with kanji can be written with kana
and
how the kana relate to the Kanji (Furigana)
(this is how I dream Jmdict was doing things...with grouped
kana(furigana) and kanji readings) :

Olivier Binda

--- In edict-jmdict@yahoogroups.com, Jim Rose <jim@...> wrote:
>
> Stuart,
>
> So what happens is that you have a word. The word might be kana,
> might be all kanji, might be kana and kanji. You can parse the
> reading for the kanji from kanjidic, but you will have two large
> categories which will not parse.
>
> Idioms and words exhibiting euphony. You've already discussed the
> former. You can't parse those without human intervention - at least
> not the first time you encounter it as you are slogging through
> EDICT. Euphony can be handled, but you have to build a bigger,
> better data source... starting from kanjidic, but growing the reading
> data to include relevant euphonic variations. You don't do this by
> hand for each one, but you do have to do it by hand for each
> previously un-encountered euphonic kanji reading. This takes several
> man months, by the way, because there are so many possible euphonic
> twists in EDICT. One other trick Ice uses is to handle idioms as if
> they were euphony... by using the equivalent of the linguistic
> "zero". The reading of " ", or nothing at all... null... void...
> contributes no sound at all... and then assigning one of the kanji
> the idiomatic pronunciation if necessary.
>
> But the plain old parsing isn't trivial either. There are multiple
> ways to approach it. One of those ways is superior to all others:
>
> If you poke around Ice Mocha on KanjiCafe, you'll discover that I
> discovered the parsing algorithm while drinking an Ice Mocha... hence
> the name of the application. I call the technique "extending string
> theory". Its a bit like darwinian evolution, natural selection that
> is, and a kind of genetic algorithm. I've never discussed it
> anywhere, because I've made a bet with my myself that nobody else
> would figure it out for 5 years. Its been 3.
>
> Basically you are taking your enhanced data source of possible
> readings and in tandem with other bells and whistles, extending a
> test string. As each string grows, you test it to see if it can
> "survive" as a possible reading for the word. At each step you
> eliminate the impossible variants from your array. Then grow the
> survivors one more reading. Its trivial to show that eventually only
> one evolving string can survive all the way to the end. The process
> is fairly fast provided you've built a very robust data source which
> includes EDICT's many euphonic variants of kanji phonemes, and all
> the while you keep track of which "bins" the readings are in vis-a-
> vis the order of the symbols used to write the word - something you
> were starting to think about. The last time I used this algorithm to
> parse EDICT, I think it finished in a few minutes - but it was after
> countless hours of enhancing kanjidic's data - hence HARD - and it
> was anything but a pleasure. More like an addiction to seeing how
> much of my life I could waste proving to myself that I could do it.
>
> Japanese is expressed on your computer using fixed width fonts, and
> there are no multi-kanji words in EDICT with readings exceeding four
> kana symbols... so I use 48 px fonts for words, and 12 px fonts for
> furigana / yomigana. To center those in horizontal display when
> using the graphical imaging proxy mode, or top of the cell justify
> them in vertical, I keep track of their position using a system of 4
> spaces per kanji. Then I keep a record of the blank spaces using a
> compressing short-hand. For example I store the position of the
> yomigana using say z1.44.z15.13. This would tell ice to display kana
> 44 shifted one space to the right (if generating a yomigana image),
> then have 15 blank 12 px spaces, and display kana 13. Because the
> font is 12 px, every 4 kana or blank spaces would be above another 48
> px symbol in the word. If I wasn't in proxy mode, and using real
> Japanese text, I would convert a number like "z8" by division by 4
> into two blank spaces. So you see, storing this information comes
> after determining how to parse it.
>
> All this is probably more than you wanted to know?
>
> So you were right, it is possible to do... and I do in fact have code
> which will do it. But my code relies on many internal features of
> Ice Mocha... for example, the proxy server ordering of kanji which
> allows me to address most Japanese symbols by an internal KanjiCafe
> number. You achieved some of the rudimentary starting thoughts
> necessary to duplicate what I have accomplished. You understood that
> arrays are needed. You understood that you would have to prune
> impossible matches or you would get bogged down in impossible amounts
> of computation. You should also know that every time EDICT grows you
> may have to add more euphonic variants to your data that were not yet
> seen in older versions of EDICT in order to get a proper parse.
>
> Would I share my code and data set? Honestly, I don't think so. As
> I've been building a set of stroke order diagrams which are now 1,415
> in number, I've come to realize the personal sacrifice I've made in
> terms of time - countless man hours to give the world a set of SODs
> bigger and better than Halpern's, which are really only available
> through WWWJDIC. I'm making those available on the Internet to other
> web sites because a very few people have really helped out with the
> project. There has been volunteers and collaboration - but not from
> other web application developers as far as I can tell, and not until
> I spent several years figuring out the necessary software to allow
> the collaboration.
>
> But the Yomigana data is a powerful feature of Ice Mocha that is not
> yet seen on the Internet... not even on WWWJDIC... Its something I
> take great pride in and worked extremely long, unpaid hours. My only
> reward has been a tiny bit of Internet fame. Give my code away, and
> unlike the SODs and animated SODs, there is no visible attribution.
> People would stop seeing Ice Mocha as a valuable place to go as
> everyone cloned its features, and nobody would shop at Rolomail.com
> anymore... forcing me to go back into the job market and ending all
> future contributions to the world of Japanese on the Internet as I
> labored mindlessly for someone else's profit.
>
> Jim
>
> On Jan 22, 2007, at 11:07 PM, Stuart McGraw wrote:
>
> > Jim Rose wrote:
> > When you say "mapping kanji to reading strings", that is basically
> > what Ice Mocha does when it calculates the yomigana / furigana for
> > each EDICT word using the EDICT reading.¥Ä Not easy.¥Ä
> > Very hard stuff.¥Ä But I pipe in only because this has already
> > been accomplished with EDICT several years ago.
> > Cool! What I was writing about was just the storage of the kanji-
> > reading
> > map information in a database. The "rest of the story" (which
> > you've done)
> > is generating that information. It occurred to me too that one
> > might be able to
> > do it automatically by substituting readings from a kanjdic-like
> > source, and
> > pruning impossible matches until one could show exactly one set of
> > reading-
> > kanji correspondences that results in the same reading as given for
> > the word.
> > But I would happily give up the pleasure of doing that if the code
> > was already
> > written. :-)
> >
> > Is your code available? I think it would be very useful.
>