[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pos, misc and the parsing of japanese...

To: edict-jmdict@***************
Subject: Re: pos, misc and the parsing of japanese...
From: "spartan_entertainment" <olivier.binda@**********>
Date: Thu, 25 Jul 2013 08:22:04 -0000

Thanks, yes, it really helps.


--- In edict-jmdict@yahoogroups.com, Jim Breen <jimbreen@...> wrote:
>
> Hi Olivier,
> 
> Bienvenue
> 
> On 25 July 2013 15:25, Olivier Binda <olivier.binda@...> wrote:
> 
> > https://play.google.com/store/apps/details?id=org.lakedaemon.japanese.dictionary
> >
> > and an enhanced  japanese text reader (that I'm currently rewriting from
> > scratch, the unpublished version 2 is already much better)
> >
> > https://play.google.com/store/apps/details?id=japanese.toolkit
> 
> I must look at them when I get a chance.
> 


to test the japanese reader, wait for version 2 that'll hopefully get published soon as version 1 is a bit lacking




> > 1) First, I noticed that the jmdict DTD as regards to MISC is
> > inconsistent with it's usage as the jmdict DTD says that MISC behaves
> > like POS
> > and that information will usually apply to several senses. So, you would
> > expect that same misc tags wouldn't be repeated.
> > Yet, in the entry for suru, there are lots of consecutive senses that
> > hold the "uk" MISC tag and it looks like it's the same in other entries.
> >
> > What the DTD says about misc tag isn't valid anymore, right ?
> 
> Correct. I must change it. We can make the POS propagate because it will
> get over-ridden by another, but there is no way to override a "uk".
> 

ok, this makes sense.



> > 2) I'm looking for a mapping between  pos tags output by mecab and tags
> > used by jmdict, does one such map exist ?
> > (I kinda remember reading that mecab had been used to tag some jmdict
> > entries at some point in time)
> 
> Well, MeCab doesn't have POS tags. It just uses the ones in whatever morpheme
> lexicon you give it: UniDic, IPADIC, etc. And they all are different.
> 
> Also, those lexicons use very fine-grained POS tags meant to operate at the
> morpheme level. Some are really weird, and probably only exist because they
> help the machine-learning in MeCab (HMM/CRF) to work). Trying to map the
> many POS tags in Unidic (I've never coi=unted them)to the 70 or so in JMdict
> would be a BIG task, especially as a lot we have such as "exp" don't
> exist in Unidic,
> and we bunch them up, as in "n" and "vs", whereas in Unidic they are
> in a single
> hierarchical tag.
> 
> > I have been building one, but as I'm no expert on japanese, there is
> > bound to be a lot of mistakes in there and, if possible, I would like to
> > use a better/more accurate one.
> 
> Well, good luck with that. I don't know what you are doing, but I'm guessing
> it's to do with the text reader you are developing. 

indeed 

> I have given some thought to
> redoing WWWJDIC's Text Glossing function, using MeCab/Unidic parsing.
> I'd probably use a greedy longest-match algorithm on the morpheme stream
> from MeCab.

At some point, I'll probably have to implement such a thing.
I'm doing single tokens for now.


> 
> Would you REALLY try and get MeCab/Unidic running under Android?
> 

I already did. 

Version 1 of my text reader used a port of mecab (ndk/C++) but this is quite useless for a mobile app as it required like 90Mo to hold the mecab dictionay in ram

My dictionary and version 2 of my text reader uses the Kuromoji Japanese Tokenizer (which is a somewhat java clone of mecab that uses much less memory) of the Lucene project
It brings requirement to around 10 Mo, much better... 

My problem is to map the unidic entries to jmdict entries/senses

I try to do that by mapping on the mecab/unidic side a token with 
surface form
dict form
reading
mecab pos 


to senses on jmdict side that have 
keb
reb
jmdict pos
uk, stagk, stagr, as well as sense restr to keb/reb


It works quite well (enough to allow someone with moderate japanese skills like me to read a book)  

Of course, I would just love to directly get english/french/german/portuguese/Spanish/Russian/dutch meanings for unidic/ipadic entries

but I don't know if such resources exist... so I'm correlating unidic with jmdict


> > 3) Given the kanjidic structure, it looks like the meanings of kanji
> > were meant to be grouped with readings.
> 
> Well, grouped, yes. Sometimes those groupings are associated with
> particular readings, and sometimes they aren't.
> 
> > Will it be done at some point in the (near) future ? are there plans for
> > that ?
> 
> It was always my intention to let it happen some time, which is why the
> kanjidic2 DTD was structured that way. But it would be a hUGE job. I
> doubt I'll ever have time to do it.
> 

Too bad, I would really love to have that. 
Any chance it could be done by a crowd instead of just by you ?



> > 4) The sets of words of jmdict and of the dictionaries shipped with
> > mecab (unidic,...) are quite different, which makes tokenizing for
> > jmdict less usefull/efficient.
> 
> Why are you doing it? They are very different dictionaries with quite
> different purposes.
> 
> > Is there somewhere a version of the dictionaries shipped with mecab that
> > has been extended to hold all jmdict entries and trained against a big
> > corpus ?
> 
> Not really. At one stage I experimented with adding terms from JMdict into
> NAIST-JDIC and seeing if MeCab could find be used to dig new words out
> of text by looking at what didn't get detected. It didn't work well enough to
> take any further. I didn't attempt to train it 

Maybee one day I'll get the time (and the brains) to try to train it this way, to see if it does a difference... low odds though

Ah well, that's good to know

-  just used the common
> weightings
> according to POS. Anyway, UniDic operates at the strict morpheme level.
> It doesn't have ÆüËÜ¸ì as it treats it (correctly) as ÆüËÜ + ¸ì.
> 
> > 5) uK tag appear between RE_INF markup but uk tag appear betwean MISC
> > markup. Isn't it weird ?
> 
> Not really. They are quite different. "uk" is associated with the
> sense, and uK with the Japanese surface forms.
> 
> > Aren't there senses that are only valid for Kanji ?
> 
> Not really sure what you mean. In almost all cases the senses are
> "valid" for both the kanji and kana.
> 
> Hope this helps.
> 
> Cheers
> 
> Jim
> 
> -- 
> Jim Breen
> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
>

Follow-Ups:
- Re: [edict-jmdict] Re: pos, misc and the parsing of japanese...
  - From: Jim Breen <jimbreen@*********>

References:
- Re: [edict-jmdict] pos, misc and the parsing of japanese...
  - From: Jim Breen <jimbreen@*********>

Prev by Date: Re: [edict-jmdict] pos, misc and the parsing of japanese...
Next by Date: Re: [edict-jmdict] Re: pos, misc and the parsing of japanese...
Previous by thread: Re: [edict-jmdict] pos, misc and the parsing of japanese...
Next by thread: Re: [edict-jmdict] Re: pos, misc and the parsing of japanese...
Index(es):
- Date
- Thread