[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Typo Prevention

To: edict-jmdict@***************
Subject: Re: [edict-jmdict] Typo Prevention
From: Jim Rose <jim@*************>
Date: Tue, 14 Aug 2007 20:23:49 -0400


On Aug 14, 2007, at 6:29 PM, Jim Breen wrote:

On 15/08/07, Jim Rose <jim@kanjicafe.com> wrote:
> I just want to elaborate two disjoined thoughts I wrote about before
> that speak directly to the project:
>
> 1) Storing readings and yomigana really is redundant. If you have
> yomigana, no matter the parsing quality, the reading is implied.

This is true, but (correct me if I am misinterpeting here) if for the
entry for 発表 you only include はっ and ぴょうwith some sorts of
delimiters or wrappers, you lose easy access to the complete string of
はっぴょう. To me as a lexicographer はっぴょう is farmore important as thekana representation of the word 発表 than はっand ぴょう are as the読み仮名 of 発 and 表 within that word. 読み仮名 and 振り仮名 areteaching aids (overused IMNSHO) and in no way integral to adictionary.

Yeah, I was thinking about that too, and the questions which came tomy mind:

1) Is there a technique in most programming languages, or in SQLwhich can be used to ignore delimiters in string searches ( sothat に.ほん.ご.じ.てん would besearched as if it was にほんごじてん).

And

2) Is JMDICT the actual file that people will search, or because ofits XML complexity, is it more likely the case that the end user isgoing to import it into mysql or some other DB system (afterdecomplexing it through some XML filter), at which time non-delimitedversions of the string could built into the end user DB. (Or whenJMDICT is exported to EDICT format).

If the answer is yes to #1, then replacing the reading from JMDICThas viability though perhaps not preferability.

If the answer is yes to #2, the case is a little bit stronger from afile size consideration.

> 2) It would be a nice feature such that when you went to save a new
> entry into EDICT that a yomigana parser verified that it could parse
> your entry. That would be a simple way to catch many typos. Unless
> say you already checked off "irregular", if it couldn't parse it
> might ask you "are you sure there are no typos because the reading
> seems irregular"... And then maybe you catch something before it
> gets into the system.

Once we are "online" I envisage having bots that come into play atvarioustimes and do all sorts of validation/housekeeping finctions.A 読みがな

parser could well be among such a bot collection.


Gets more exciting here every day then.

References:
- Typo Prevention
  - From: Jim Rose <jim@*************>
- Re: [edict-jmdict] Typo Prevention
  - From: "Jim Breen" <jimbreen@*********>

Prev by Date: Re: [edict-jmdict] Re: Typo Prevention
Next by Date: Re: Catching up
Previous by thread: Re: [edict-jmdict] Re: Typo Prevention
Next by thread: creating jmdict database in postgresql
Index(es):
- Date
- Thread