Java Library - to determine whether given text is in English ?

anonym · Mar 14, 2008

Hi,

I am looking for an available java function or library that takes a
sentence or a text as an input and outputs whether the text is in
English or not.

Thank you.

Roedy Green · Mar 14, 2008

I am looking for an available java function or library that takes a
sentence or a text as an input and outputs whether the text is in
English or not.

A simple test would look for some common English works such as "is"
"an" "the". You could cook up a similar list for other languages and
get a best match.

Arne Vajhøj · Mar 16, 2008

anonym said:
I am looking for an available java function or library that takes a
sentence or a text as an input and outputs whether the text is in
English or not.

You can look at monograph or digraph frequencies and make
a guess based on those.

I did some experiments a long time ago.

See the C snippet below for some ideas.

Arne

=====================================================

// monograph RIO analysis
if((f['r']+f['R'])>(f['i']+f['I'])) {
indicator[DK]++;
indicator[FR]--;
}
if((f['O']+f['o'])>(f['R']+f['r'])) {
indicator[UK]++;
indicator[ES]++;
indicator[DK]--;
}
if((f['I']+f['i'])>(f['O']+f['o'])) {
indicator[DE]++;
indicator[UK]--;
indicator[ES]--;
}
// characteristic digraph analysis
if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {
indicator[UK]++;
indicator[DK]--;
indicator[FR]--;
indicator[DE]--;
indicator[ES]--;
}
if((ff['c'*256+'h']+ff['C'*256+'H']+ff['C'*256+'h'])>0.01*l) {
indicator[DE]++;
indicator[DK]--;
indicator[FR]--;
indicator[ES]--;
}
if((ff['o'*256+'u']+ff['O'*256+'U']+ff['O'*256+'u'])>0.01*l) {
indicator[UK]++;
indicator[FR]++;
indicator[DE]--;
indicator[DK]--;
indicator[ES]--;
}
if((ff['n'*256+'t']+ff['N'*256+'T']+ff['N'*256+'t'])>0.01*l) {
indicator[FR]++;
indicator[UK]--;
indicator[DE]--;
indicator[ES]--;
}
if((ff['u'*256+'e']+ff['U'*256+'E']+ff['U'*256+'e'])>0.01*l) {
indicator[ES]++;
indicator[DK]--;
indicator[UK]--;
indicator[FR]--;
indicator[DE]--;
}
if((ff['l'*256+'a']+ff['L'*256+'A']+ff['L'*256+'a'])>0.01*l) {
indicator[ES]++;
indicator[DK]--;
indicator[FR]--;
indicator[DE]--;
}
// unused characters analysis
if((f['j']+f['J'])>0.01*l) {
indicator[DE]--;
}
if((f['k']+f['K'])>0.01*l) {
indicator[DK]++;
indicator[FR]--;
indicator[ES]--;
}
if((f['w']+f['W'])>0.01*l) {
indicator[UK]++;
indicator[DE]++;
indicator[FR]--;
indicator[ES]--;
}
if((f['y']+f['Y'])>0.01*l) {
indicator[UK]++;
indicator[FR]--;
indicator[DE]--;
}
// special characters analysis
if((f[UCHAR('Æ')]+f[UCHAR('Ø')]+f[UCHAR('Å')]+
f[UCHAR('æ')]+f[UCHAR('ø')]+f[UCHAR('å')])>0) { // danish
indicator[DK]++;
indicator[UK]--;
indicator[FR]--;
indicator[DE]--;
indicator[ES]--;
}
if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut
indicator[DE]++;
indicator[DK]--;
indicator[UK]--;
indicator[FR]--;
indicator[ES]--;
}
if((f[UCHAR('É')]+f[UCHAR('Í')]+f[UCHAR('Ó')]+
f[UCHAR('é')]+f[UCHAR('í')]+f[UCHAR('ó')])>0) { // roman slash
indicator[FR]++;
indicator[ES]++;
indicator[DK]--;
indicator[UK]--;
indicator[DE]--;
}
if((f[UCHAR('Ñ')]+f[UCHAR('ñ')])>0) { // spanish n tilde
indicator[ES]++;
indicator[DK]--;
indicator[UK]--;
indicator[FR]--;
indicator[DE]--;
}
if((f[UCHAR('Ç')]+f[UCHAR('ç')])>0) { // french c cedile
indicator[FR]++;
indicator[DK]--;
indicator[UK]--;
indicator[DE]--;
indicator[ES]--;
}
if((f[UCHAR('ß')])>0) { // german double s
indicator[DE]++;
indicator[FR]--;
indicator[DK]--;
indicator[UK]--;
indicator[ES]--;
}
if((f[UCHAR('À')]+f[UCHAR('È')]+f[UCHAR('Ò')]+
f[UCHAR('à')]+f[UCHAR('è')]+f[UCHAR('ò')])>0) { // roman backslash
indicator[FR]++;
indicator[DK]--;
indicator[UK]--;
indicator[DE]--;
}
if((f[UCHAR('Ê')]+f[UCHAR('Î')]+f[UCHAR('Ô')]+
f[UCHAR('ê')]+f[UCHAR('î')]+f[UCHAR('ô')])>0) { // roman hat
indicator[FR]++;
indicator[DK]--;
indicator[UK]--;
indicator[DE]--;
}

Roedy Green · Mar 17, 2008

if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {

Is this supposed to work with Unicode too, or only with an 8-bit
encoding? is l the length of the string in chars?

Roedy Green · Mar 17, 2008

if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut

what does your UCHAR function do?

Roger Lindsjö · Mar 17, 2008

Arne said:
You can look at monograph or digraph frequencies and make
a guess based on those.

For english see here:
http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if
you have a large sample of text you can build your own tables.

Then build a table of the text you want to test and match it to the
"best" language using a chi square test for example. I used something
similar in an exersize many years ago for finding probable language in a
cryptography class.

The test gets more accurate if you have lots of text. Very short texts
can not be tested reliably with these simple tests.

"I like my dog" is made of just Swedish words, although the meaning in
Swedish is gibberish.

Jeff Higgins · Mar 17, 2008

Roger said:
For english see here:
http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if you
have a large sample of text you can build your own tables.

Thanks to above posters for the intersting ideas.

I would like to find a link to Google Corporation's similar
list taken from a sample of 1.252 X 10^100 email and usenet spam posts.

Arne Vajhøj · Mar 18, 2008

Roedy said:
if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {

Click to expand...

Is this supposed to work with Unicode too, or only with an 8-bit
encoding? is l the length of the string in chars?

Nope. As written it is C/C++. And it is assuming a single
byte character set (ISO-8859-1). But the idea could easily
be extended to Unicode.

Arne

Arne Vajhøj · Mar 18, 2008

Roedy said:
if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut

Click to expand...

what does your UCHAR function do?

It is a typedef for unsigned char.

Signed chars is a curse.

Arne

Getting incorrect output in finding the maximum pair sum in the given array.	7	Apr 6, 2023
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
People are needed for a mental model study of concurrent programming. (>19 years old, English Speaking, Programmers who know concurrency)	1	Sep 19, 2022
Adding modules to library? / package?	1	Aug 29, 2023
Translater + module + tkinter	1	Feb 16, 2023
Mandatory Elements To Conduct JavaScript Form Manipulation	7	Aug 22, 2023
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
FAQ 4.73 How do I determine whether a scalar is a number/whole/integer/float?	0	Jan 30, 2011

Java Library - to determine whether given text is in English ?

anonym

Roedy Green

Arne Vajhøj

Roedy Green

Roedy Green

Roger Lindsjö

Jeff Higgins

Arne Vajhøj

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads