Language detection module..

Discussion in 'Perl Misc' started by AR, Jan 21, 2004.

  1. AR

    AR Guest

    Does exist any module/script that can 100% detect text language..
    for example English, German, French, ... (European languages, at least
    English...)
     
    AR, Jan 21, 2004
    #1
    1. Advertising

  2. AR

    Ben Morrow Guest

    AR <> wrote:
    > Does exist any module/script that can 100% detect text language..
    > for example English, German, French, ... (European languages, at least
    > English...)


    100%? No. What language is this string: "hotel"?

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
     
    Ben Morrow, Jan 21, 2004
    #2
    1. Advertising

  3. AR

    J.B. Moreno Guest

    Ben Morrow <> wrote:

    > AR <> wrote:
    > > Does exist any module/script that can 100% detect text language..
    > > for example English, German, French, ... (European languages, at least
    > > English...)

    >
    > 100%? No. What language is this string: "hotel"?


    Swahili?

    --
    JBM
    "Everything is futile." -- Marvin of Borg
     
    J.B. Moreno, Jan 21, 2004
    #3
  4. On Wed, 21 Jan 2004, Ben Morrow wrote:

    > 100%? No. What language is this string: "hotel"?


    Yeah, ask a German speaker what language this is: "Gift".
     
    Alan J. Flavell, Jan 21, 2004
    #4
  5. Ben Morrow <> wrote:
    >
    > AR <> wrote:
    >> Does exist any module/script that can 100% detect text language..
    >> for example English, German, French, ... (European languages, at least
    >> English...)

    >
    > 100%? No. What language is this string: "hotel"?



    Military? (the letter "H") ?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jan 21, 2004
    #5
  6. J.B. Moreno wrote:
    > Ben Morrow <> wrote:
    >
    >
    >>AR <> wrote:
    >>
    >>>Does exist any module/script that can 100% detect text language..
    >>>for example English, German, French, ... (European languages, at least
    >>>English...)

    >>
    >>100%? No. What language is this string: "hotel"?

    >
    >
    > Swahili?

    Start by adding all words from all the dictionaries in the world in a file.
    Then using statistics you get the most likely one.

    or why not just?

    #!/usr/bin/perl -w

    print "String is in any known language or some constructed language such
    as Esperanto, Volapuk, Glosa, Loglan, or even klingon.\n";


    Now that would almost certainly cover 95% of all the languages (missed
    adding the tolkien languages, but i leave that as a programmin excercise
    ). But im not sure if its 100% future proof. The "any known language"
    could be interpreted as "known" to the person running the program.

    Best Regards
    Martin Quensel
     
    Martin Quensel, Jan 21, 2004
    #6
  7. AR

    J.B. Moreno Guest

    Martin Quensel <> wrote:

    > J.B. Moreno wrote:
    > > Ben Morrow <> wrote:
    > >
    > >>AR <> wrote:
    > >>
    > >>>Does exist any module/script that can 100% detect text language..
    > >>>for example English, German, French, ... (European languages, at least
    > >>>English...)
    > >>
    > >>100%? No. What language is this string: "hotel"?

    > >
    > > Swahili?

    >
    > Start by adding all words from all the dictionaries in the world in a
    > file. Then using statistics you get the most likely one.


    The phrases "100%" and "most likely one" aren't equivalent.

    And look up the James Nicoll quote on the purity of the english
    language.

    --
    JBM
    "Everything is futile." -- Marvin of Borg
     
    J.B. Moreno, Jan 22, 2004
    #7
  8. AR

    Anno Siegel Guest

    Ben Morrow <> wrote in comp.lang.perl.misc:
    >
    > AR <> wrote:
    > > Does exist any module/script that can 100% detect text language..
    > > for example English, German, French, ... (European languages, at least
    > > English...)

    >
    > 100%? No. What language is this string: "hotel"?


    Well, one-word-samples are hard, and 100% is unattainable.

    Entirely off topic, I have recently heard of an approach to text
    classification (with an eye to language recognition) that I found
    interesting.

    Use a Ziv-Lempel-like method to compress your sample. Then concatenate
    it with texts of similar lengths taken from known languages and compress
    again. If the compression rate is similar or better than that of the
    original text, the appended text is similar to the original one. If
    the compression deteriorates, the texts are dissimilar.

    The source (some idle chat on IRC, sorry) said that this works for
    rather small samples of fewer than a hundred words. I have always been
    meaning to play with it, but haven't got around.

    Anno
     
    Anno Siegel, Jan 22, 2004
    #8
  9. AR

    Eric Wilhelm Guest

    On Thu, 22 Jan 2004 00:35:12 -0600, J.B. Moreno wrote:

    >> Start by adding all words from all the dictionaries in the world in a
    >> file. Then using statistics you get the most likely one.

    >
    > The phrases "100%" and "most likely one" aren't equivalent


    This is true, but in the real world, something which gives a 99.9%
    probability is about as good as we are going to get. No sense in
    refusing to use a circle simply because it is impossible to make a
    perfect one.

    IMO, 99.9% might be a low estimate even if the program takes a naive
    approach. If the dictionaries include "adopted" phrases (e.g. Latin
    expressions which are often cited in English, etc.) and some kind of
    best-fit spell check is used, you might push the probabilities into
    99.99%. Now feed some works of literature from each language into a
    phrase-counter and use phrases as well, and you might find that a text of
    100 words or more can be predicted correctly 99.9999% of the time.

    If that isn't good enough (missing 1 of 10^6), you're going to be working
    on the thing for so long that half of the languages in use at its
    conception are out of use before you reach the prototype.

    --Eric
     
    Eric Wilhelm, Jan 22, 2004
    #9
  10. Ben Morrow () wrote:

    : AR <> wrote:
    : > Does exist any module/script that can 100% detect text language..
    : > for example English, German, French, ... (European languages, at least
    : > English...)

    : 100%? No. What language is this string: "hotel"?

    I can say with 100% certainty that that is an english word.
     
    Malcolm Dew-Jones, Jan 22, 2004
    #10
  11. J.B. Moreno () wrote:
    : Martin Quensel <> wrote:

    : > J.B. Moreno wrote:
    : > > Ben Morrow <> wrote:
    : > >
    : > >>AR <> wrote:
    : > >>
    : > >>>Does exist any module/script that can 100% detect text language..
    : > >>>for example English, German, French, ... (European languages, at least
    : > >>>English...)
    : > >>
    : > >>100%? No. What language is this string: "hotel"?
    : > >
    : > > Swahili?
    : >
    : > Start by adding all words from all the dictionaries in the world in a
    : > file. Then using statistics you get the most likely one.

    : The phrases "100%" and "most likely one" aren't equivalent.

    : And look up the James Nicoll quote on the purity of the english
    : language.

    Every language is 100% pure all the time - they are moving targets defined
    by their own use.
     
    Malcolm Dew-Jones, Jan 22, 2004
    #11
  12. Anno Siegel (-berlin.de) wrote:
    : Ben Morrow <> wrote in comp.lang.perl.misc:
    : >
    : > AR <> wrote:
    : > > Does exist any module/script that can 100% detect text language..
    : > > for example English, German, French, ... (European languages, at least
    : > > English...)
    : >
    : > 100%? No. What language is this string: "hotel"?

    : Well, one-word-samples are hard, and 100% is unattainable.

    : Entirely off topic, I have recently heard of an approach to text
    : classification (with an eye to language recognition) that I found
    : interesting.

    : Use a Ziv-Lempel-like method to compress your sample. Then concatenate
    : it with texts of similar lengths taken from known languages and compress
    : again. If the compression rate is similar or better than that of the
    : original text, the appended text is similar to the original one. If
    : the compression deteriorates, the texts are dissimilar.

    : The source (some idle chat on IRC, sorry) said that this works for
    : rather small samples of fewer than a hundred words. I have always been
    : meaning to play with it, but haven't got around.

    : Anno

    Sounds reasonable, basically it would be testing for similarity of letter
    sequences.

    I might also suggest using a bayesian filter such as ifile or similar.
    They try to file each message into the correct one (of many) folder. (I've
    nevr used ifile, just read of it.)

    You would provide samples in the languages you anticipate and then let the
    filter categorize each document.

    $0.02
     
    Malcolm Dew-Jones, Jan 22, 2004
    #12
  13. AR

    Joe Smith Guest

    Malcolm Dew-Jones wrote:

    > Ben Morrow () wrote:
    >
    > : AR <> wrote:
    > : > Does exist any module/script that can 100% detect text language..
    > : > for example English, German, French, ... (European languages, at least
    > : > English...)
    >
    > : 100%? No. What language is this string: "hotel"?
    >
    > I can say with 100% certainty that that is an english word.


    Taxi!
     
    Joe Smith, Jan 22, 2004
    #13
  14. -berlin.de (Anno Siegel) writes:

    > Entirely off topic, I have recently heard of an approach to text
    > classification (with an eye to language recognition) that I found
    > interesting.
    >
    > Use a Ziv-Lempel-like method to compress your sample. Then concatenate
    > it with texts of similar lengths taken from known languages and compress
    > again. If the compression rate is similar or better than that of the
    > original text, the appended text is similar to the original one. If
    > the compression deteriorates, the texts are dissimilar.
    >
    > The source (some idle chat on IRC, sorry) said that this works for
    > rather small samples of fewer than a hundred words. I have always been
    > meaning to play with it, but haven't got around.



    Probably:
    Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
    "Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
    (http://link.aps.org/abstract/PRL/v88/e048702)

    http://arxiv.org/abs/cond-mat/0108530 (The Paper)
    http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)

    Cheers,

    Andreas
     
    Andreas Marcel Riechert, Jan 22, 2004
    #14
  15. Joe Smith wrote:
    >
    > Malcolm Dew-Jones wrote:
    >
    > > Ben Morrow () wrote:
    > >
    > > : AR <> wrote:
    > > : > Does exist any module/script that can 100% detect text language..
    > > : > for example English, German, French, ... (European languages, at least
    > > : > English...)
    > >
    > > : 100%? No. What language is this string: "hotel"?
    > >
    > > I can say with 100% certainty that that is an english word.

    >
    > Taxi!


    Beer!


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Jan 22, 2004
    #15
  16. AR

    Anno Siegel Guest

    Malcolm Dew-Jones <> wrote in comp.lang.perl.misc:
    > Anno Siegel (-berlin.de) wrote:
    > : Ben Morrow <> wrote in comp.lang.perl.misc:
    > : >
    > : > AR <> wrote:
    > : > > Does exist any module/script that can 100% detect text language..
    > : > > for example English, German, French, ... (European languages, at least
    > : > > English...)
    > : >
    > : > 100%? No. What language is this string: "hotel"?
    >
    > : Well, one-word-samples are hard, and 100% is unattainable.
    >
    > : Entirely off topic, I have recently heard of an approach to text
    > : classification (with an eye to language recognition) that I found
    > : interesting.
    >
    > : Use a Ziv-Lempel-like method to compress your sample. Then concatenate
    > : it with texts of similar lengths taken from known languages and compress
    > : again. If the compression rate is similar or better than that of the
    > : original text, the appended text is similar to the original one. If
    > : the compression deteriorates, the texts are dissimilar.
    >
    > : The source (some idle chat on IRC, sorry) said that this works for
    > : rather small samples of fewer than a hundred words. I have always been
    > : meaning to play with it, but haven't got around.
    >
    > : Anno
    >
    > Sounds reasonable, basically it would be testing for similarity of letter
    > sequences.


    That's the idea. Trouble is, it would cost quite some research on how
    co-compressibility actually varies with text samples to tune the parameters
    you need to make decisions. That's what's stopping me from "playing"
    with it, I'll leave that for someone with a diploma in statistics (or
    the need for one).

    > I might also suggest using a bayesian filter such as ifile or similar.


    Yes, it came up as an alternative in a discussion of bayesian spam filters.

    > They try to file each message into the correct one (of many) folder. (I've
    > nevr used ifile, just read of it.)
    >
    > You would provide samples in the languages you anticipate and then let the
    > filter categorize each document.


    All these treat the problem of language identification as a case of
    general text classification. Specific methods may apply, such as testing
    for frequent key words in each language. For some (inflecting) languages,
    an analysis of word endings may be highly distinctive. And so on, since
    we're off topic. If these fail, one might decide not to decide, or fall
    back on more expensive text classification, a la above.

    Anno
     
    Anno Siegel, Jan 22, 2004
    #16
  17. AR

    Anno Siegel Guest

    Andreas Marcel Riechert <> wrote in comp.lang.perl.misc:
    > -berlin.de (Anno Siegel) writes:
    >
    > > Entirely off topic, I have recently heard of an approach to text
    > > classification (with an eye to language recognition) that I found
    > > interesting.
    > >
    > > Use a Ziv-Lempel-like method to compress your sample. Then concatenate
    > > it with texts of similar lengths taken from known languages and compress
    > > again. If the compression rate is similar or better than that of the
    > > original text, the appended text is similar to the original one. If
    > > the compression deteriorates, the texts are dissimilar.
    > >
    > > The source (some idle chat on IRC, sorry) said that this works for
    > > rather small samples of fewer than a hundred words. I have always been
    > > meaning to play with it, but haven't got around.

    >
    >
    > Probably:
    > Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
    > "Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
    > (http://link.aps.org/abstract/PRL/v88/e048702)
    >
    > http://arxiv.org/abs/cond-mat/0108530 (The Paper)
    > http://arxiv.org/abs/cond-mat/0202383 (One very critical answer)


    I believe those names were mentioned, thanks for the reference.

    I didn't read the papers yet, but I notice that part of the reply is
    about the article being off topic in Physical Review Letters. Some
    things won't change, no matter what the medium...

    Anno
     
    Anno Siegel, Jan 22, 2004
    #17
  18. AR

    Anno Siegel Guest

    John W. Krahn <> wrote in comp.lang.perl.misc:
    > Joe Smith wrote:
    > > Malcolm Dew-Jones wrote:
    > > > Ben Morrow () wrote:
    > > >
    > > > > "hotel"

    > >
    > > Taxi!

    >
    > Beer!


    Now reverse.

    Anno.
     
    Anno Siegel, Jan 22, 2004
    #18
  19. Anno Siegel <-berlin.de> wrote:
    > John W. Krahn <> wrote in comp.lang.perl.misc:
    >> Joe Smith wrote:
    >> > Malcolm Dew-Jones wrote:
    >> > > Ben Morrow () wrote:
    >> > >
    >> > > > "hotel"
    >> >
    >> > Taxi!

    >>
    >> Beer!

    >
    > Now reverse.



    What is going on here?

    I've heard of "scalar context".

    I've heard of "list context".

    What is this "silly context" that seems to have taken over this thread?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jan 23, 2004
    #19
  20. Malcolm Dew-Jones wrote:
    > Ben Morrow () wrote:
    >
    >> AR <> wrote:
    >>> Does exist any module/script that can 100% detect text language..
    >>> for example English, German, French, ... (European languages, at
    >>> least English...)

    >
    >> 100%? No. What language is this string: "hotel"?

    >
    > I can say with 100% certainty that that is an english word.


    If you would have said "It is a word of the English language", then would
    have concured.
    However, "an English word"? No.

    jue
     
    Jürgen Exner, Jan 23, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ed
    Replies:
    24
    Views:
    1,067
    Dimitri Maziuk
    Mar 27, 2006
  2. DaveInSidney
    Replies:
    0
    Views:
    459
    DaveInSidney
    May 9, 2005
  3. Maric Michaud
    Replies:
    0
    Views:
    7,219
    Maric Michaud
    Jun 24, 2006
  4. pabbu
    Replies:
    8
    Views:
    768
    Marc Boyer
    Nov 7, 2005
  5. Thomas Nitsche

    Natural language detection library

    Thomas Nitsche, May 7, 2007, in forum: Ruby
    Replies:
    4
    Views:
    136
    Thomas Nitsche
    May 9, 2007
Loading...

Share This Page