[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [edict-jmdict] Typo Prevention




On Aug 14, 2007, at 6:29 PM, Jim Breen wrote:

On 15/08/07, Jim Rose <jim@kanjicafe.com> wrote:
> I just want to elaborate two disjoined thoughts I wrote about before
> that speak directly to the project:
>
> 1) Storing readings and yomigana really is redundant. If you have
> yomigana, no matter the parsing quality, the reading is implied.

This is true, but (correct me if I am misinterpeting here) if for the
entry for 発表 you only include はっ and ぴょう with some sorts of
delimiters or wrappers, you lose easy access to the complete string of
はっぴょう. To me as a lexicographer はっぴょう is far more important as the kana representation of the word 発表 than はっ and ぴょう are as the 読み仮名 of 発 and 表 within that word. 読み 仮名 and 振り仮名 are teaching aids (overused IMNSHO) and in no way integral to a dictionary.

Yeah, I was thinking about that too, and the questions which came to my mind:

1) Is there a technique in most programming languages, or in SQL which can be used to ignore delimiters in string searches ( so that に.ほん.ご.じ.てん would be searched as if it was にほんごじてん).

And

2) Is JMDICT the actual file that people will search, or because of its XML complexity, is it more likely the case that the end user is going to import it into mysql or some other DB system (after decomplexing it through some XML filter), at which time non-delimited versions of the string could built into the end user DB. (Or when JMDICT is exported to EDICT format).

If the answer is yes to #1, then replacing the reading from JMDICT has viability though perhaps not preferability.

If the answer is yes to #2, the case is a little bit stronger from a file size consideration.



> 2) It would be a nice feature such that when you went to save a new
> entry into EDICT that a yomigana parser verified that it could parse
> your entry. That would be a simple way to catch many typos. Unless
> say you already checked off "irregular", if it couldn't parse it
> might ask you "are you sure there are no typos because the reading
> seems irregular"... And then maybe you catch something before it
> gets into the system.

Once we are "online" I envisage having bots that come into play at various times and do all sorts of validation/housekeeping finctions. A 読みがな
parser could well be among such a bot collection.

Gets more exciting here every day then.