Re: [edict-jmdict] A few questions

Subject: Re: [edict-jmdict] A few questions

From: Olivier Binda <olivier.binda@**********>

Date: Mon, 21 Oct 2013 07:37:04 +0200

Hello and many thanks for the fast and accurate answers

On 10/21/2013 03:19 AM, Jim Breen wrote:

Greetings

On 20 October 2013 23:26, Olivier Binda <olivier.binda@**********> wrote:

> A) Regarding the variant tag in the kanjidic2.xml file.
> How does one differentiate between
> 1) a cross-reference to another kanji, usually regarded as a variant (ucs ?)
> 2) an alternative indexing code

It's an attribute, as explained in comments in the DTD.

E.g.

<variant var_type="jis208">82-45</variant>
<variant var_type="nelson_c">5324</variant>

Generally if the "var_type" is a JIS/UCS codepoint it's
pointing to a variant, and if it's anything else it's an
alternative index. In the examples above the more common
鯵 doesn't have an Old Nelson code because Nelson indexed the
older form (鰺).

I see, this helps !

That said, those variant codes are a bit vague, and could do with
tidying, but who has the time?

If something may be done programmatically to fix/improve this. I may help but if it requires expertise in kanji/Japanese I wouldn't be able to s I don't have it. :/

> B) Why doesn' the jmdict entry for できる (dekiru) have an entry like this one :
> <sense>
> <pos>&suf;</pos>
> <pos>&vs-i;</pos>
> <misc>&uk;</misc>
> <gloss>verbalizing suffix (applies to nouns noted in this dictionary with the part of speech "vs")</gloss>
> </sense>

Simple answer: no-one has proposed it as an extra sense. It would have
to say "potential" in there
somewhere.

I see. I'll try to propose a fitting sense
(again, I'm quite good at technical skills... but I'm not that good at artistic/stylistic/litterary related decisions, so this will have to be edited by someone more competent)

> is it because kekkon-dekiru is considered to be a potential form of suru and not the different verb dekiru ?

Not really; more a matter that the entries have been edited by different
people at different times.

> C) I see on the wwwjdic page that there are dictionary files for swedish, italian...and lots of other dictionary.
> Is it possible to freely use those dictionary

Each has its own permissions. You need to follow the documentation trails.

>.... and (if yes) is there a place where I could download those files ?

You'll have to follow the documentation. Some projects, such as Wadoku
and Warandic have
downloads. I think the Italian project has too. Others such as the Hungarian
and Swedish files were one-offs.

I see. This helps. I'll look into those directions.

> D) I see that a lot of pos tags have been recently added to jmdict. Mostly regarding archaic v2 verbs.
>
> As I'm trying to map the ipadic part of speech to jmdict pos.

I'd avoid IPADIC. Its owners (NAIST) are advising that Unidic be used instead.

I had the same kind of advice from the author of the j.depP library (a C++ japanese dependency parser).
He told me to consider using juman.

I have no idea how the different japanese dictionaries compare in quality/accuracy (Ipadic, juman, naist, unidic, ...)

I have been using ipadic because the Kuromoji parser that ship with lucene comes out of the box with it has a lot less memory requirements than mecab (everybody fits in under 10MB when the files of mecab take like 90MB)

Using juman or unidic with the kuromoji parser should definitely be possible but would require me to adapt the kuromoji Parser to those files ( heavy optimizing in there to reduce memory consumption and redundancy) and I haven't seriously looked into it and thrown time at it yet.

Would you happen to know which one of the two is (juman, unidic) is the best/most accurate/maintained ?

> Could you point me to a few japanese text files where those new jmdict pos tags are susceptible to be in use,
> so that I could test and improve those mappings ?

I certainly can't. Try some of the old texts at the University of
Virginia, or the
ones Michael Watson has linked from his page:
http://www.meijigakuin.ac.jp/~watson/

There is a version of Unidic specifically for classical Japanese. I have
not looked at it and don't know what its POS tags are.

Didn't know that. I'll look into those links.

Good luck

Jim

Again, thanks for you time and the (very helpful) answers.

I had some more questions :

I might be able to help by contributing grouped reading-meaning to kanjidic but I don't know what would be acceptable (regarding the license/copyright)

1) Is it acceptable if I look into a published kanji-dictionnary to help me decide which on/kun reading should be associated by which english meaning ?
If not as grouping reading with meaning only requires shuffling glosses around the kanjidic2 file inside the right rmgroup tags (I wouldn't write senses/glosses...nor copy/paste stuff from copyrighted works)

2) Is it acceptable to use algorithm/nlp techniques/jmdict to help group meanings to readings ?

3) grouping readings and meanings would probably take a lot of time...
what if, at first only the english meanings (and the french ones) are grouped ?
or if the english meanings were slowly but steadily grouped with readings ? say first 1%, then 2%, then 3%...

Would those updates be daily published (which would allow other people to edit/contribute) or would you want for all english meanings to be grouped before they are pushed to a public release ?

4) what if I programmatically, built a html5 page, backed by a database and the right functions, that would allow many people to group meaning to reading, in an efficient way ?
Would it help ?

Best regards,
Olivier

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University

Follow-Ups:

Re: [edict-jmdict] A few questions
- From: J Greely <jmdict@***********>
Re: [edict-jmdict] A few questions
- From: Jim Breen <jimbreen@*********>

References:

A few questions
- From: Olivier Binda <olivier.binda@**********>
Re: [edict-jmdict] A few questions
- From: Jim Breen <jimbreen@*********>