identify the language of a web page

Discussion in 'Javascript' started by usgog@yahoo.com, Apr 11, 2008.

  1. Guest

    Suppose I need to classify 10000 web pages based on their languages.
    What should I look for to determine the language of each web page? Any
    advice is welcome.
    , Apr 11, 2008
    #1
    1. Advertising

  2. GArlington Guest

    On Apr 11, 6:29 am, "" <> wrote:
    > Suppose I need to classify 10000 web pages based on their languages.
    > What should I look for to determine the language of each web page? Any
    > advice is welcome.


    You can look for encoding in the header, but if (as many do) the page
    is using utf-8, then it can not be done. Again, UTF-8 allows multiple
    langauges on one page, which one will you want to pick?
    GArlington, Apr 11, 2008
    #2
    1. Advertising

  3. pr Guest

    wrote:
    > Suppose I need to classify 10000 web pages based on their languages.
    > What should I look for to determine the language of each web page? Any
    > advice is welcome.


    You could search the content for a common word in a given language that
    is used in neither HTML nor script: " the " (including the spaces) would
    be, I guess, a reasonable choice to identify English, although there's
    no guarantee some bright spark hasn't named a script variable 'the', or
    used the word in a comment.

    Are you sure this is a JavaScript question?
    pr, Apr 11, 2008
    #3
  4. Joost Diepenmaat, Apr 11, 2008
    #4
  5. In comp.lang.javascript message <cda0f617-c9a0-4389-b79e-a02ad24852a6@k1
    0g2000prm.googlegroups.com>, Thu, 10 Apr 2008 22:29:01,
    "" <> posted:
    >Suppose I need to classify 10000 web pages based on their languages.
    >What should I look for to determine the language of each web page? Any
    >advice is welcome.


    Consider <URL:http://www.merlyn.demon.co.uk/zel-82px.htm> and siblings.

    --
    (c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
    Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
    Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
    Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
    Dr J R Stockton, Apr 11, 2008
    #5
  6. VK Guest

    On Apr 12, 2:35 am, Dr J R Stockton <> wrote:
    > In comp.lang.javascript message <cda0f617-c9a0-4389-b79e-a02ad24852a6@k1
    > 0g2000prm.googlegroups.com>, Thu, 10 Apr 2008 22:29:01,
    > "" <> posted:
    >
    > >Suppose I need to classify 10000 web pages based on their languages.
    > >What should I look for to determine the language of each web page? Any
    > >advice is welcome.

    >
    > Consider <URL:http://www.merlyn.demon.co.uk/zel-82px.htm> and siblings.


    <OT>
    more for ciwah, so OT, but still:
    is there language code for multilanguage document, like lang="multi"
    or something?
    </OT>
    VK, Apr 13, 2008
    #6
  7. Joost Diepenmaat, Apr 13, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. KK
    Replies:
    2
    Views:
    376
    Hermit Dave
    Jan 25, 2004
  2. javadev
    Replies:
    2
    Views:
    389
    Adam Maass
    Apr 14, 2006
  3. Replies:
    2
    Views:
    457
    Richard Tobin
    Apr 11, 2008
  4. sqlcamel

    How to identify double bytes language?

    sqlcamel, Nov 13, 2009, in forum: Perl Misc
    Replies:
    8
    Views:
    142
    Peter J. Holzer
    Nov 14, 2009
  5. Andrew K
    Replies:
    1
    Views:
    117
    kaeli
    Feb 23, 2005
Loading...

Share This Page