[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Examples - some stats
On 06/02/2008, Paul Blay <blay.paul@googlemail.com> wrote:
> As Jim's just updated the dictionary I've just re-checked the examples
> indexing against them so here are the latest statistics.
>
> 151,750 Example sentence records, of which
> 1,936 have been added after the original Tanaka Corpus records.
> 27,115 Edict entries supported by example sentences.
> 0 records have only been partially indexed.
>
> That looks like a very reasonable 5 sentences per Edict entry,
> but, of course, it doesn't work like that. ;-)
Indeed it doesn't.
The 27,115 is quite an improvement over the ~23k that I had linked at the
beginning. I think we were a bit over 25k when Paul took over the
maintenance.
Not long after I linked the Tanaka corpus to WWWJDIC I wrote a paper
about it for a workshop
(http://www.csse.monash.edu.au/~jwb/papillon/dicexamples.html)
I included some stats there, based on the 23,000 words. It would be
interesting to rerun the generation and see if the rankings have
changed much. I expect the particles, which I initially excluded,
will be the winners.
Cheers
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/