[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Typo Prevention
On Aug 14, 2007, at 6:29 PM, Jim Breen wrote:
On 15/08/07, Jim Rose <jim@kanjicafe.com> wrote:
> I just want to elaborate two disjoined thoughts I wrote about before
> that speak directly to the project:
>
> 1) Storing readings and yomigana really is redundant. If you have
> yomigana, no matter the parsing quality, the reading is implied.
This is true, but (correct me if I am misinterpeting here) if for the
entry for 発表 you only include はっ and ぴょう
with some sorts of
delimiters or wrappers, you lose easy access to the complete string of
はっぴょう. To me as a lexicographer はっぴょう is far
more important as the
kana representation of the word 発表 than はっ
and ぴょう are as the
読み仮名 of 発 and 表 within that word. 読み
仮名 and 振り仮名 are
teaching aids (overused IMNSHO) and in no way integral to a
dictionary.
Yeah, I was thinking about that too, and the questions which came to
my mind:
1) Is there a technique in most programming languages, or in SQL
which can be used to ignore delimiters in string searches ( so
that に.ほん.ご.じ.てん would be
searched as if it was にほんごじてん).
And
2) Is JMDICT the actual file that people will search, or because of
its XML complexity, is it more likely the case that the end user is
going to import it into mysql or some other DB system (after
decomplexing it through some XML filter), at which time non-delimited
versions of the string could built into the end user DB. (Or when
JMDICT is exported to EDICT format).
If the answer is yes to #1, then replacing the reading from JMDICT
has viability though perhaps not preferability.
If the answer is yes to #2, the case is a little bit stronger from a
file size consideration.
> 2) It would be a nice feature such that when you went to save a new
> entry into EDICT that a yomigana parser verified that it could parse
> your entry. That would be a simple way to catch many typos. Unless
> say you already checked off "irregular", if it couldn't parse it
> might ask you "are you sure there are no typos because the reading
> seems irregular"... And then maybe you catch something before it
> gets into the system.
Once we are "online" I envisage having bots that come into play at
various
times and do all sorts of validation/housekeeping finctions.
A 読みがな
parser could well be among such a bot collection.
Gets more exciting here every day then.