Java Library - to determine whether given text is in English ?

Discussion in 'Java' started by anonym, Mar 14, 2008.

  1. anonym

    anonym Guest

    Hi,

    I am looking for an available java function or library that takes a
    sentence or a text as an input and outputs whether the text is in
    English or not.

    Thank you.
    anonym, Mar 14, 2008
    #1
    1. Advertising

  2. anonym

    Roedy Green Guest

    On Fri, 14 Mar 2008 11:53:59 -0700 (PDT), anonym <>
    wrote, quoted or indirectly quoted someone who said :

    > I am looking for an available java function or library that takes a
    >sentence or a text as an input and outputs whether the text is in
    >English or not.


    A simple test would look for some common English works such as "is"
    "an" "the". You could cook up a similar list for other languages and
    get a best match.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Mar 14, 2008
    #2
    1. Advertising

  3. anonym

    Arne Vajhøj Guest

    Re: Java Library - to determine whether given text is in English?

    anonym wrote:
    > I am looking for an available java function or library that takes a
    > sentence or a text as an input and outputs whether the text is in
    > English or not.


    You can look at monograph or digraph frequencies and make
    a guess based on those.

    I did some experiments a long time ago.

    See the C snippet below for some ideas.

    Arne

    =====================================================

    // monograph RIO analysis
    if((f['r']+f['R'])>(f['i']+f['I'])) {
    indicator[DK]++;
    indicator[FR]--;
    }
    if((f['O']+f['o'])>(f['R']+f['r'])) {
    indicator[UK]++;
    indicator[ES]++;
    indicator[DK]--;
    }
    if((f['I']+f['i'])>(f['O']+f['o'])) {
    indicator[DE]++;
    indicator[UK]--;
    indicator[ES]--;
    }
    // characteristic digraph analysis
    if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {
    indicator[UK]++;
    indicator[DK]--;
    indicator[FR]--;
    indicator[DE]--;
    indicator[ES]--;
    }
    if((ff['c'*256+'h']+ff['C'*256+'H']+ff['C'*256+'h'])>0.01*l) {
    indicator[DE]++;
    indicator[DK]--;
    indicator[FR]--;
    indicator[ES]--;
    }
    if((ff['o'*256+'u']+ff['O'*256+'U']+ff['O'*256+'u'])>0.01*l) {
    indicator[UK]++;
    indicator[FR]++;
    indicator[DE]--;
    indicator[DK]--;
    indicator[ES]--;
    }
    if((ff['n'*256+'t']+ff['N'*256+'T']+ff['N'*256+'t'])>0.01*l) {
    indicator[FR]++;
    indicator[UK]--;
    indicator[DE]--;
    indicator[ES]--;
    }
    if((ff['u'*256+'e']+ff['U'*256+'E']+ff['U'*256+'e'])>0.01*l) {
    indicator[ES]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[FR]--;
    indicator[DE]--;
    }
    if((ff['l'*256+'a']+ff['L'*256+'A']+ff['L'*256+'a'])>0.01*l) {
    indicator[ES]++;
    indicator[DK]--;
    indicator[FR]--;
    indicator[DE]--;
    }
    // unused characters analysis
    if((f['j']+f['J'])>0.01*l) {
    indicator[DE]--;
    }
    if((f['k']+f['K'])>0.01*l) {
    indicator[DK]++;
    indicator[FR]--;
    indicator[ES]--;
    }
    if((f['w']+f['W'])>0.01*l) {
    indicator[UK]++;
    indicator[DE]++;
    indicator[FR]--;
    indicator[ES]--;
    }
    if((f['y']+f['Y'])>0.01*l) {
    indicator[UK]++;
    indicator[FR]--;
    indicator[DE]--;
    }
    // special characters analysis
    if((f[UCHAR('Æ')]+f[UCHAR('Ø')]+f[UCHAR('Å')]+
    f[UCHAR('æ')]+f[UCHAR('ø')]+f[UCHAR('å')])>0) { // danish
    indicator[DK]++;
    indicator[UK]--;
    indicator[FR]--;
    indicator[DE]--;
    indicator[ES]--;
    }
    if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
    f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut
    indicator[DE]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[FR]--;
    indicator[ES]--;
    }
    if((f[UCHAR('É')]+f[UCHAR('Í')]+f[UCHAR('Ó')]+
    f[UCHAR('é')]+f[UCHAR('í')]+f[UCHAR('ó')])>0) { // roman slash
    indicator[FR]++;
    indicator[ES]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[DE]--;
    }
    if((f[UCHAR('Ñ')]+f[UCHAR('ñ')])>0) { // spanish n tilde
    indicator[ES]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[FR]--;
    indicator[DE]--;
    }
    if((f[UCHAR('Ç')]+f[UCHAR('ç')])>0) { // french c cedile
    indicator[FR]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[DE]--;
    indicator[ES]--;
    }
    if((f[UCHAR('ß')])>0) { // german double s
    indicator[DE]++;
    indicator[FR]--;
    indicator[DK]--;
    indicator[UK]--;
    indicator[ES]--;
    }
    if((f[UCHAR('À')]+f[UCHAR('È')]+f[UCHAR('Ò')]+
    f[UCHAR('à')]+f[UCHAR('è')]+f[UCHAR('ò')])>0) { // roman backslash
    indicator[FR]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[DE]--;
    }
    if((f[UCHAR('Ê')]+f[UCHAR('Î')]+f[UCHAR('Ô')]+
    f[UCHAR('ê')]+f[UCHAR('î')]+f[UCHAR('ô')])>0) { // roman hat
    indicator[FR]++;
    indicator[DK]--;
    indicator[UK]--;
    indicator[DE]--;
    }
    Arne Vajhøj, Mar 16, 2008
    #3
  4. anonym

    Roedy Green Guest

    On Sun, 16 Mar 2008 19:16:20 -0400, Arne Vajhøj <>
    wrote, quoted or indirectly quoted someone who said :

    > if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {


    Is this supposed to work with Unicode too, or only with an 8-bit
    encoding? is l the length of the string in chars?
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Mar 17, 2008
    #4
  5. anonym

    Roedy Green Guest

    On Sun, 16 Mar 2008 19:16:20 -0400, Arne Vajhøj <>
    wrote, quoted or indirectly quoted someone who said :

    > if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
    > f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut


    what does your UCHAR function do?
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Mar 17, 2008
    #5
  6. Re: Java Library - to determine whether given text is in English?

    Arne Vajhøj wrote:
    > anonym wrote:
    >> I am looking for an available java function or library that takes a
    >> sentence or a text as an input and outputs whether the text is in
    >> English or not.

    >
    > You can look at monograph or digraph frequencies and make
    > a guess based on those.


    For english see here:
    http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if
    you have a large sample of text you can build your own tables.

    Then build a table of the text you want to test and match it to the
    "best" language using a chi square test for example. I used something
    similar in an exersize many years ago for finding probable language in a
    cryptography class.

    The test gets more accurate if you have lots of text. Very short texts
    can not be tested reliably with these simple tests.

    "I like my dog" is made of just Swedish words, although the meaning in
    Swedish is gibberish.

    --
    Roger Lindsjö
    Roger Lindsjö, Mar 17, 2008
    #6
  7. anonym

    Jeff Higgins Guest

    Roger Lindsjö wrote:
    > Arne Vajhøj wrote:
    >> anonym wrote:
    >>> I am looking for an available java function or library that takes a
    >>> sentence or a text as an input and outputs whether the text is in
    >>> English or not.

    >>
    >> You can look at monograph or digraph frequencies and make
    >> a guess based on those.

    >
    > For english see here:
    > http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if you
    > have a large sample of text you can build your own tables.
    >

    Thanks to above posters for the intersting ideas.

    I would like to find a link to Google Corporation's similar
    list taken from a sample of 1.252 X 10^100 email and usenet spam posts.
    Jeff Higgins, Mar 17, 2008
    #7
  8. anonym

    Arne Vajhøj Guest

    Re: Java Library - to determine whether given text is in English?

    Roedy Green wrote:
    > On Sun, 16 Mar 2008 19:16:20 -0400, Arne Vajhøj <>
    > wrote, quoted or indirectly quoted someone who said :
    >> if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {

    >
    > Is this supposed to work with Unicode too, or only with an 8-bit
    > encoding? is l the length of the string in chars?


    Nope. As written it is C/C++. And it is assuming a single
    byte character set (ISO-8859-1). But the idea could easily
    be extended to Unicode.

    Arne
    Arne Vajhøj, Mar 18, 2008
    #8
  9. anonym

    Arne Vajhøj Guest

    Re: Java Library - to determine whether given text is in English?

    Roedy Green wrote:
    > On Sun, 16 Mar 2008 19:16:20 -0400, Arne Vajhøj <>
    > wrote, quoted or indirectly quoted someone who said :
    >> if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
    >> f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut

    >
    > what does your UCHAR function do?


    It is a typedef for unsigned char.

    Signed chars is a curse.

    Arne
    Arne Vajhøj, Mar 18, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?UmFlZCBTYXdhbGhh?=

    English/English DLL

    =?Utf-8?B?UmFlZCBTYXdhbGhh?=, Oct 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    1,659
    =?Utf-8?B?UmFlZCBTYXdhbGhh?=
    Oct 16, 2005
  2. google_java
    Replies:
    1
    Views:
    359
    Roedy Green
    Jul 8, 2003
  3. IchBin
    Replies:
    1
    Views:
    758
  4. QQ
    Replies:
    13
    Views:
    408
  5. QQ
    Replies:
    10
    Views:
    461
    CBFalconer
    Jun 19, 2006
Loading...

Share This Page