Language detection module..

A

AR

Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)
 
B

Ben Morrow

AR said:
Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)

100%? No. What language is this string: "hotel"?

Ben
 
M

Martin Quensel

J.B. Moreno said:
Start by adding all words from all the dictionaries in the world in a file.
Then using statistics you get the most likely one.

or why not just?

#!/usr/bin/perl -w

print "String is in any known language or some constructed language such
as Esperanto, Volapuk, Glosa, Loglan, or even klingon.\n";


Now that would almost certainly cover 95% of all the languages (missed
adding the tolkien languages, but i leave that as a programmin excercise
). But im not sure if its 100% future proof. The "any known language"
could be interpreted as "known" to the person running the program.

Best Regards
Martin Quensel
 
J

J.B. Moreno

Martin Quensel said:
Start by adding all words from all the dictionaries in the world in a
file. Then using statistics you get the most likely one.

The phrases "100%" and "most likely one" aren't equivalent.

And look up the James Nicoll quote on the purity of the english
language.
 
A

Anno Siegel

Ben Morrow said:
100%? No. What language is this string: "hotel"?

Well, one-word-samples are hard, and 100% is unattainable.

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample. Then concatenate
it with texts of similar lengths taken from known languages and compress
again. If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one. If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words. I have always been
meaning to play with it, but haven't got around.

Anno
 
E

Eric Wilhelm

The phrases "100%" and "most likely one" aren't equivalent

This is true, but in the real world, something which gives a 99.9%
probability is about as good as we are going to get. No sense in
refusing to use a circle simply because it is impossible to make a
perfect one.

IMO, 99.9% might be a low estimate even if the program takes a naive
approach. If the dictionaries include "adopted" phrases (e.g. Latin
expressions which are often cited in English, etc.) and some kind of
best-fit spell check is used, you might push the probabilities into
99.99%. Now feed some works of literature from each language into a
phrase-counter and use phrases as well, and you might find that a text of
100 words or more can be predicted correctly 99.9999% of the time.

If that isn't good enough (missing 1 of 10^6), you're going to be working
on the thing for so long that half of the languages in use at its
conception are out of use before you reach the prototype.

--Eric
 
M

Malcolm Dew-Jones

Ben Morrow ([email protected]) wrote:

: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.
 
M

Malcolm Dew-Jones

J.B. Moreno ([email protected]) wrote:

: > J.B. Moreno wrote:
: > >
: > >>
: > >>>Does exist any module/script that can 100% detect text language..
: > >>>for example English, German, French, ... (European languages, at least
: > >>>English...)
: > >>
: > >>100%? No. What language is this string: "hotel"?
: > >
: > > Swahili?
: >
: > Start by adding all words from all the dictionaries in the world in a
: > file. Then using statistics you get the most likely one.

: The phrases "100%" and "most likely one" aren't equivalent.

: And look up the James Nicoll quote on the purity of the english
: language.

Every language is 100% pure all the time - they are moving targets defined
by their own use.
 
M

Malcolm Dew-Jones

Anno Siegel ([email protected]) wrote:
: >
: > > Does exist any module/script that can 100% detect text language..
: > > for example English, German, French, ... (European languages, at least
: > > English...)
: >
: > 100%? No. What language is this string: "hotel"?

: Well, one-word-samples are hard, and 100% is unattainable.

: Entirely off topic, I have recently heard of an approach to text
: classification (with an eye to language recognition) that I found
: interesting.

: Use a Ziv-Lempel-like method to compress your sample. Then concatenate
: it with texts of similar lengths taken from known languages and compress
: again. If the compression rate is similar or better than that of the
: original text, the appended text is similar to the original one. If
: the compression deteriorates, the texts are dissimilar.

: The source (some idle chat on IRC, sorry) said that this works for
: rather small samples of fewer than a hundred words. I have always been
: meaning to play with it, but haven't got around.

: Anno

Sounds reasonable, basically it would be testing for similarity of letter
sequences.

I might also suggest using a bayesian filter such as ifile or similar.
They try to file each message into the correct one (of many) folder. (I've
nevr used ifile, just read of it.)

You would provide samples in the languages you anticipate and then let the
filter categorize each document.

$0.02
 
J

Joe Smith

Malcolm said:
Ben Morrow ([email protected]) wrote:

: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.

Taxi!
 
A

Andreas Marcel Riechert

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample. Then concatenate
it with texts of similar lengths taken from known languages and compress
again. If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one. If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words. I have always been
meaning to play with it, but haven't got around.


Probably:
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
"Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
(http://link.aps.org/abstract/PRL/v88/e048702)

http://arxiv.org/abs/cond-mat/0108530 (The Paper)
http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)

Cheers,

Andreas
 
A

Anno Siegel

Malcolm Dew-Jones said:
Anno Siegel ([email protected]) wrote:
: >
: > > Does exist any module/script that can 100% detect text language..
: > > for example English, German, French, ... (European languages, at least
: > > English...)
: >
: > 100%? No. What language is this string: "hotel"?

: Well, one-word-samples are hard, and 100% is unattainable.

: Entirely off topic, I have recently heard of an approach to text
: classification (with an eye to language recognition) that I found
: interesting.

: Use a Ziv-Lempel-like method to compress your sample. Then concatenate
: it with texts of similar lengths taken from known languages and compress
: again. If the compression rate is similar or better than that of the
: original text, the appended text is similar to the original one. If
: the compression deteriorates, the texts are dissimilar.

: The source (some idle chat on IRC, sorry) said that this works for
: rather small samples of fewer than a hundred words. I have always been
: meaning to play with it, but haven't got around.

: Anno

Sounds reasonable, basically it would be testing for similarity of letter
sequences.

That's the idea. Trouble is, it would cost quite some research on how
co-compressibility actually varies with text samples to tune the parameters
you need to make decisions. That's what's stopping me from "playing"
with it, I'll leave that for someone with a diploma in statistics (or
the need for one).
I might also suggest using a bayesian filter such as ifile or similar.

Yes, it came up as an alternative in a discussion of bayesian spam filters.
They try to file each message into the correct one (of many) folder. (I've
nevr used ifile, just read of it.)

You would provide samples in the languages you anticipate and then let the
filter categorize each document.

All these treat the problem of language identification as a case of
general text classification. Specific methods may apply, such as testing
for frequent key words in each language. For some (inflecting) languages,
an analysis of word endings may be highly distinctive. And so on, since
we're off topic. If these fail, one might decide not to decide, or fall
back on more expensive text classification, a la above.

Anno
 
A

Anno Siegel

Andreas Marcel Riechert said:
Probably:
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
"Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
(http://link.aps.org/abstract/PRL/v88/e048702)

http://arxiv.org/abs/cond-mat/0108530 (The Paper)
http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)

I believe those names were mentioned, thanks for the reference.

I didn't read the papers yet, but I notice that part of the reply is
about the article being off topic in Physical Review Letters. Some
things won't change, no matter what the medium...

Anno
 
T

Tad McClellan

Anno Siegel said:
Now reverse.


What is going on here?

I've heard of "scalar context".

I've heard of "list context".

What is this "silly context" that seems to have taken over this thread?
 
J

Jürgen Exner

Malcolm said:
I can say with 100% certainty that that is an english word.

If you would have said "It is a word of the English language", then would
have concured.
However, "an English word"? No.

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top