[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Domains

To: edict-jmdict@***************
Subject: Re: [edict-jmdict] Domains
From: "Jim Breen" <jimbreen@*********>
Date: Mon, 4 Jun 2007 18:33:45 +1000

On 02/06/07, Jeroen Hoek <mail@jeroenhoek.nl> wrote:

Can entries be tagged on the gloss-level with this approach? There are a
lot of words that have varying meanings depending on the field you find
them in. (e.g. 集合 has "set (math)" as one meaning).


At present that tagging is at the sense level, which I think is
appropriate. Multiple glosses within a sense really indicate different
ways of translating the Japanese, but the field indication applies to
the Japanese; not to one of the glosses.

(My apologies if the following sounds too much like a software engineer
thinking out loud)

You could start with a small set of toplevel categories and label
entries with those in time. All entries would be tagged as "/" (the root
node) at first. If the design allows for sub-categories to be created,
the tagging process could become somewhat intuitive. For example, if
there is a main category "Religion" with no sub-categories, one could
label 菩薩 (Boddhisattva) with "Religion/Buddhism".

For this to work the editor would have to show "Religion/Buddhism" as an
available tag after that. The domain-tree would grow dynamically this
way. If by accident two similar categories are created, they could
easily be merged of course.

The problem with this approach is that while it would allow a dictionary
interface to clearly display tagged glosses as being used in a certain
domain, it does not address the problem of how to prevent the user from
being overwhelmed with all sorts of specific terms when searching for
words. You display these domains in a different colour or indenting, but
wouldn't both the common コンピュータ and the more specific 文字符号系
fall into the same "Technology/Computing" domain?

One solution could be to say that コンピュータ is a "root-level word" in
the category "Technology/Computing" and should thus always be displayed
when searched for words starting with, for example, コン. 文字符号系
could be considered to specific for "root-level", but should show up at
least from "level 1" ("Technology"):

コンピュータ ("Technology/Computing", 0)
文字符号系 ("Technology/Computing", 1)

Ideally all matching words would show in a search, but tagged glosses
would be indented/coloured to indicate a specific sense. One nice thing
about this approach is that it isn't necessary to tag words unless they
are getting in the way of user-frinedlyness. A large set of very
specific terms for bodyparts could be tagged with ("Medicine/Human
body", 2) and not get in the way of the average user. This way, if one
wants too it would be possible to tag 足 ("Medicine/Human body", 0)
which wouldn't change the way it is displayed at all, it only adds a
tag.


All useful and valid comments. On of my frustrations is that there is
no standard and accepted classification systems for dictionary entries.
There have been several attempts, but nothing concrete seems to emerge.
The Wordnet symsets are there, and used, but are quite limited.  Their
great advantage is that a "bag of words" attack could be done on JMdict
entries with a fair probablility of getting alignment with the relevant
symset.

Yes, a hierarchy: Science->Chemistry->Organic Chemistry would be great.
Designing it would be fun - populating the database would be hell. In some
areas we can use existing dictionaries. For example I have a 科学 dictionary on
file which has fiels tags, e.g.:
nickel carbonyl ニッケルカルボニル [機械,化学]

The huge NTT 日本語の語彙特性 collection (which costs more than I've been
prepared to pay,) has (I think) very detailed PoS and field tagging. It is
copyright, but I guess it could be "used" to assist our populating, provided
we didn't use the same system and didn't blatantly pinch their data.

One of my later-this-year tasks is incorporating the COMPDIC data into the
main JMdict/EDICT. I want to do this in a way that a useful subset can
be extracted.
Tagging for this purpose is a bit problematic. Take the word モデリング, which
is in COMPDIC is "modelling/modeling". That's a very appropriate entry for
COMPDIC. It is also an entry in JMdict/EDICT, of course, because it has very
wide application. In the merge I can add a field tag {comp}, but that can't be
taken to mean that モデリング is *only* applicable to the computing domain.
OTOH, ユーザーアカウント *is* pretty much related to that domain alone.

If it wasn't for the desire to be able to generate focussed subsets, I'd
restrict the {comp} to just things in the computing domain. Any suggestions
of how to handle this would be good.

Cheers

Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/

Follow-Ups:
- Re: [edict-jmdict] Domains
  - From: Jeroen Hoek <mail@*************>

References:
- Domains
  - From: "Jim Breen" <jimbreen@*********>
- Re: [edict-jmdict] Domains
  - From: Jeroen Hoek <mail@*************>

Prev by Date: Re: parsing submission data (was: [edict-jmdict] xrefs in WWWJDIC)
Next by Date: RE: parsing submission data (was: [edict-jmdict] xrefs in WWWJDIC)
Previous by thread: Re: [edict-jmdict] Domains
Next by thread: Re: [edict-jmdict] Domains
Index(es):
- Date
- Thread