[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [edict-jmdict] Determining Language via string analysis
[Igor Skochinsky (Re: [edict-jmdict] Determining Language via string analysis) writes:]
>Mathieu Tozer asked:
>> m> I'm wondering if anyone knew of any existing tools out there that might be able to tell me,
>> m> given the string of a word or a bunch of words, which language it is in?
>> Here's a tool, except it's not free: http://www.basistech.com/language-identification/
Igor beat me to it. That's an Industrial Strength tool, such as is used
by Google, Yahoo et al. Such tools usually have a collection of
short strings ("n-grams" is their trade name) which have known
occurrences and frequency patterns in various languages. The text
is classified according to those n-grams.
If your range of languages is restricted, you may be able to use
simple tests. For example, only Japanese is likely to have much
if any kana, hangul is only used in Korean, many Western European
languages can be reliably detected by the types and frequencies
of diacritic characters
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学