[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Examples - some stats



On 06/02/2008, Paul Blay <blay.paul@googlemail.com> wrote:

> As Jim's just updated the dictionary I've just re-checked the examples
>  indexing against them so here are the latest statistics.
>
>  151,750 Example sentence records, of which
>  1,936 have been added after the original Tanaka Corpus records.
>  27,115 Edict entries supported by example sentences.
>  0 records have only been partially indexed.
>
>  That looks like a very reasonable 5 sentences per Edict entry,
>  but, of course, it doesn't work like that. ;-)

Indeed it doesn't.

The 27,115 is quite an improvement over the ~23k that I had linked at the
beginning. I think we were a bit over 25k when Paul took over the
maintenance.

Not long after I linked the Tanaka corpus to WWWJDIC I wrote a paper
about it for a workshop
(http://www.csse.monash.edu.au/~jwb/papillon/dicexamples.html)

I included some stats there, based on the 23,000 words. It would be
interesting to rerun the generation and see if the rankings have
changed much. I expect the particles, which I initially excluded,
will be the winners.

Cheers

Jim

-- 
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/