Language detection module..

AR · Jan 21, 2004

Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)

Ben Morrow · Jan 21, 2004

AR said:
Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)

100%? No. What language is this string: "hotel"?

Ben

J.B. Moreno · Jan 21, 2004

Ben Morrow said:
100%? No. What language is this string: "hotel"?

Swahili?

Alan J. Flavell · Jan 21, 2004

100%? No. What language is this string: "hotel"?

Yeah, ask a German speaker what language this is: "Gift".

Tad McClellan · Jan 21, 2004

Ben Morrow said:
100%? No. What language is this string: "hotel"?

Military? (the letter "H") ?

Martin Quensel · Jan 21, 2004

J.B. Moreno said:
Swahili?

Start by adding all words from all the dictionaries in the world in a file.
Then using statistics you get the most likely one.

or why not just?

#!/usr/bin/perl -w

print "String is in any known language or some constructed language such
as Esperanto, Volapuk, Glosa, Loglan, or even klingon.\n";

Now that would almost certainly cover 95% of all the languages (missed
adding the tolkien languages, but i leave that as a programmin excercise
). But im not sure if its 100% future proof. The "any known language"
could be interpreted as "known" to the person running the program.

Best Regards
Martin Quensel

J.B. Moreno · Jan 22, 2004

Martin Quensel said:
Start by adding all words from all the dictionaries in the world in a
file. Then using statistics you get the most likely one.

The phrases "100%" and "most likely one" aren't equivalent.

And look up the James Nicoll quote on the purity of the english
language.

Anno Siegel · Jan 22, 2004

Ben Morrow said:
100%? No. What language is this string: "hotel"?

Well, one-word-samples are hard, and 100% is unattainable.

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample. Then concatenate
it with texts of similar lengths taken from known languages and compress
again. If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one. If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words. I have always been
meaning to play with it, but haven't got around.

Anno

Eric Wilhelm · Jan 22, 2004

The phrases "100%" and "most likely one" aren't equivalent

This is true, but in the real world, something which gives a 99.9%
probability is about as good as we are going to get. No sense in
refusing to use a circle simply because it is impossible to make a
perfect one.

IMO, 99.9% might be a low estimate even if the program takes a naive
approach. If the dictionaries include "adopted" phrases (e.g. Latin
expressions which are often cited in English, etc.) and some kind of
best-fit spell check is used, you might push the probabilities into
99.99%. Now feed some works of literature from each language into a
phrase-counter and use phrases as well, and you might find that a text of
100 words or more can be predicted correctly 99.9999% of the time.

If that isn't good enough (missing 1 of 10^6), you're going to be working
on the thing for so long that half of the languages in use at its
conception are out of use before you reach the prototype.

--Eric

Malcolm Dew-Jones · Jan 22, 2004

Ben Morrow ([email protected]) wrote:

: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.

Malcolm Dew-Jones · Jan 22, 2004

J.B. Moreno ([email protected]) wrote:

: > J.B. Moreno wrote:
: > >
: > >>
: > >>>Does exist any module/script that can 100% detect text language..
: > >>>for example English, German, French, ... (European languages, at least
: > >>>English...)
: > >>
: > >>100%? No. What language is this string: "hotel"?
: > >
: > > Swahili?
: >
: > Start by adding all words from all the dictionaries in the world in a
: > file. Then using statistics you get the most likely one.

: The phrases "100%" and "most likely one" aren't equivalent.

: And look up the James Nicoll quote on the purity of the english
: language.

Every language is 100% pure all the time - they are moving targets defined
by their own use.

Malcolm Dew-Jones · Jan 22, 2004

Anno Siegel ([email protected]) wrote:
: >
: > > Does exist any module/script that can 100% detect text language..
: > > for example English, German, French, ... (European languages, at least
: > > English...)
: >
: > 100%? No. What language is this string: "hotel"?

: Well, one-word-samples are hard, and 100% is unattainable.

: Entirely off topic, I have recently heard of an approach to text
: classification (with an eye to language recognition) that I found
: interesting.

: Use a Ziv-Lempel-like method to compress your sample. Then concatenate
: it with texts of similar lengths taken from known languages and compress
: again. If the compression rate is similar or better than that of the
: original text, the appended text is similar to the original one. If
: the compression deteriorates, the texts are dissimilar.

: The source (some idle chat on IRC, sorry) said that this works for
: rather small samples of fewer than a hundred words. I have always been
: meaning to play with it, but haven't got around.

: Anno

Sounds reasonable, basically it would be testing for similarity of letter
sequences.

I might also suggest using a bayesian filter such as ifile or similar.
They try to file each message into the correct one (of many) folder. (I've
nevr used ifile, just read of it.)

You would provide samples in the languages you anticipate and then let the
filter categorize each document.

$0.02

Joe Smith · Jan 22, 2004

Malcolm said:
Ben Morrow ([email protected]) wrote:

: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.

Taxi!

Andreas Marcel Riechert · Jan 22, 2004

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample. Then concatenate
it with texts of similar lengths taken from known languages and compress
again. If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one. If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words. I have always been
meaning to play with it, but haven't got around.

Probably:
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
"Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
(http://link.aps.org/abstract/PRL/v88/e048702)

http://arxiv.org/abs/cond-mat/0108530 (The Paper)
http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)

Cheers,

Andreas

John W. Krahn · Jan 22, 2004

Joe said:
Taxi!

Beer!

John

Anno Siegel · Jan 22, 2004

Malcolm Dew-Jones said:
Anno Siegel ([email protected]) wrote:
: >
: > > Does exist any module/script that can 100% detect text language..
: > > for example English, German, French, ... (European languages, at least
: > > English...)
: >
: > 100%? No. What language is this string: "hotel"?

: Well, one-word-samples are hard, and 100% is unattainable.

: Entirely off topic, I have recently heard of an approach to text
: classification (with an eye to language recognition) that I found
: interesting.

: Use a Ziv-Lempel-like method to compress your sample. Then concatenate
: it with texts of similar lengths taken from known languages and compress
: again. If the compression rate is similar or better than that of the
: original text, the appended text is similar to the original one. If
: the compression deteriorates, the texts are dissimilar.

: The source (some idle chat on IRC, sorry) said that this works for
: rather small samples of fewer than a hundred words. I have always been
: meaning to play with it, but haven't got around.

: Anno

Sounds reasonable, basically it would be testing for similarity of letter
sequences.

That's the idea. Trouble is, it would cost quite some research on how
co-compressibility actually varies with text samples to tune the parameters
you need to make decisions. That's what's stopping me from "playing"
with it, I'll leave that for someone with a diploma in statistics (or
the need for one).

I might also suggest using a bayesian filter such as ifile or similar.

Yes, it came up as an alternative in a discussion of bayesian spam filters.

They try to file each message into the correct one (of many) folder. (I've
nevr used ifile, just read of it.)

You would provide samples in the languages you anticipate and then let the
filter categorize each document.

All these treat the problem of language identification as a case of
general text classification. Specific methods may apply, such as testing
for frequent key words in each language. For some (inflecting) languages,
an analysis of word endings may be highly distinctive. And so on, since
we're off topic. If these fail, one might decide not to decide, or fall
back on more expensive text classification, a la above.

Anno

Anno Siegel · Jan 22, 2004

Andreas Marcel Riechert said:
Probably:
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
"Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
(http://link.aps.org/abstract/PRL/v88/e048702)

http://arxiv.org/abs/cond-mat/0108530 (The Paper)
http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)

I believe those names were mentioned, thanks for the reference.

I didn't read the papers yet, but I notice that part of the reply is
about the article being off topic in Physical Review Letters. Some
things won't change, no matter what the medium...

Anno

Anno Siegel · Jan 22, 2004

John W. Krahn said:
Beer!

Now reverse.

Anno.

Tad McClellan · Jan 23, 2004

Anno Siegel said:
Now reverse.

What is going on here?

I've heard of "scalar context".

I've heard of "list context".

What is this "silly context" that seems to have taken over this thread?

Jürgen Exner · Jan 23, 2004

Malcolm said:
I can say with 100% certainty that that is an english word.

If you would have said "It is a word of the English language", then would
have concured.
However, "an English word"? No.

jue

Translater + module + tkinter	1	Feb 16, 2023
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Keyboard event detection in C#	1	Feb 8, 2023
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
What programming language to choose?	4	Jul 3, 2022
Can't decide which language to get back into programming with	1	Mar 28, 2023
C language. work with text	3	Dec 10, 2021
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023

Language detection module..

AR

Ben Morrow

J.B. Moreno

Alan J. Flavell

Tad McClellan

Martin Quensel

J.B. Moreno

Anno Siegel

Eric Wilhelm

Malcolm Dew-Jones

Malcolm Dew-Jones

Malcolm Dew-Jones

Joe Smith

Andreas Marcel Riechert

John W. Krahn

Anno Siegel

Anno Siegel

Anno Siegel

Tad McClellan

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads