[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] A few questions

To: edict-jmdict@***************
Subject: Re: [edict-jmdict] A few questions
From: Olivier Binda <olivier.binda@**********>
Date: Tue, 22 Oct 2013 08:16:03 +0200

On 10/22/2013 05:43 AM, Jim Breen wrote:
>
> Responding first to the morphological analyzer stuff.
>
>
> On 21 October 2013 16:37, Olivier Binda <olivier.binda@wanadoo.fr> wrote:
>
> Jim: > I'd avoid IPADIC. Its owners (NAIST) are advising that Unidic
> be used instead.
> >
> > I had the same kind of advice from the author of the j.depP library
> (a C++ japanese dependency parser).
> > He told me to consider using juman.
> >
> > I have no idea how the different japanese dictionaries compare in
> quality/accuracy (Ipadic, juman, naist, unidic, ...)
>
> Unidic is way ahead of IPADIC/NAIST-DIC. A key issue is the
> delineation of morphemes.
> the older dictionaries were a bit woolly about this - IPADIC treats
> 日本語 as a single
> morpheme whereas Unidic (correctly) treats it as two. I don't know
> much about the
> current state of Juman. It was the first Japanese morphological
> analyzer, and still has
> a following, especially with 京大 graduates. I heard recently that there
> was a new edition
> that was very good, but I haven't followed up on it. One thing Juman
> used to do was treat
> long katakana strings as single morphemes. I just tried out their
> on-line demo site with
> クーリングパイプ and it returned a single string. MeCab/Unidic says "
> クーリング + パイプ".
>
Ok, this clears all the confusion I had about thos files.

... and I really want to use unidic if I can.

I looked at the unidic files though and they are huge (I'm developping
phone and tablet apps for japanese :/).

The dictionary file (lex.csv) takes 180 MB... as there is a lot of empty
stuff and redundancy in there it can somehow be compressed into a quite
compact and efficient format (that's what the kuromoji guy did)... so,
maybee it can be reduced to 20/30MB


My main concern is that when ipadic had around 1300 pos tags, unidic
seems to had 6000.
On the bright side, it's way more accurate but on the pessimistic side :
the connection matrix that you have to keep in ram for the parser to be
fast is about 30 times bigger :

6000 * 6000 * 2 bytes use like 72MB

It's still possible to code an android app with those kind of memory
requirements but the starting time is going to suffer (you have to load
the matrix in ram at start)

A few months ago, I tried to find ways to compress the (sparse) matrix
in ram and to still keep lookup fast.
I managed to reduce the ipadic connection cost matrix to 1.4MB from 4.3MB.
I'll have to experiment with this one, this might be worth it.

Caches could also be implemented. By trading complexity and time agains
ram size, it could become better

So, for now, I'll keep on working with ipadic but still throw a little
time now and them into porting my stuff to unidic

> > I have been using ipadic because the Kuromoji parser that ship with
> lucene comes out of the box with it has a lot less memory requirements
> than mecab (everybody fits in under 10MB when the files of mecab take
> like 90MB)
>
> Kuromoji has been getting a lot of use in smartphone/tablet apps, which is
> understandable, but for serious use the smaller morpheme lexicons are
> quite
> a compromise.
>
> > Using juman or unidic with the kuromoji parser should definitely be
> possible but would require me to adapt the kuromoji Parser to those
> files ( heavy optimizing in there to reduce memory consumption and
> redundancy) and I haven't seriously looked into it and thrown time at
> it yet.
> >
> > Would you happen to know which one of the two is (juman, unidic) is
> the best/most accurate/maintained ?
>
> Unidic is in active development. I can't comment on the state of
> Juman's lexicon. It
> all depends on what you want. Experiment a bit with them - you can try
> them online.
> Juman: http://reed.kuee.kyoto-u.ac.jp/nl-resource/cgi-bin/juman.cgi
> MeCab/Unidic: http://www.edrdg.org/~jwb/mecabdemo.html
>
> ChaSen/IPADIC: http://www.edrdg.org/~jwb/chasendemo.html
>
nice links !

Olivier

> Jim
>
> -- 
> Jim Breen
> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
>
>

References:
- A few questions
  - From: Olivier Binda <olivier.binda@**********>
- Re: [edict-jmdict] A few questions
  - From: Jim Breen <jimbreen@*********>
- Re: [edict-jmdict] A few questions
  - From: Olivier Binda <olivier.binda@**********>
- Re: [edict-jmdict] A few questions
  - From: Jim Breen <jimbreen@*********>

Prev by Date: Re: [edict-jmdict] A few questions
Next by Date: RE: Re: [edict-jmdict] RE: Announcing gSho, my new J/E dictionary app for Android!
Previous by thread: Re: [edict-jmdict] A few questions
Next by thread: Google Translate and JMdict/EDICT
Index(es):
- Date
- Thread