how to detect the character encoding in a web page ?

Discussion in 'Python' started by iMath, Dec 24, 2012.

  1. iMath

    iMath Guest

    how to detect the character encoding in a web page ?
    such as this page

    http://python.org/
    iMath, Dec 24, 2012
    #1
    1. Advertising

  2. Re: how to detect the character encoding in a web page ?

    On Mon, Dec 24, 2012 at 11:34 AM, iMath <> wrote:
    > how to detect the character encoding in a web page ?
    > such as this page
    >
    > http://python.org/


    You read part-way into the page, where you find this:

    <meta http-equiv="content-type" content="text/html; charset=utf-8" />

    That tells you that the character set is UTF-8.

    ChrisA
    Chris Angelico, Dec 24, 2012
    #2
    1. Advertising

  3. iMath

    Hans Mulder Guest

    On 24/12/12 01:34:47, iMath wrote:
    > how to detect the character encoding in a web page ?


    That depends on the site: different sites indicate
    their encoding differently.

    > such as this page: http://python.org/


    If you download that page and look at the HTML code, you'll find a line:

    <meta http-equiv="content-type" content="text/html; charset=utf-8" />

    So it's encoded as utf-8.

    Other sites declare their charset in the Content-Type HTTP header line.
    And then there are sites relying on the default. And sites that get
    it wrong, and send data in a different encoding from what they declare.


    Welcome to the real world,

    -- HansM
    Hans Mulder, Dec 24, 2012
    #3
  4. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    but how to let python do it for you ?

    such as this page

    http://python.org/

    how to detect the character encoding in this web page by python ?
    iMath, Dec 24, 2012
    #4
  5. iMath

    iMath Guest

    iMath, Dec 24, 2012
    #5
  6. iMath

    iMath Guest

    iMath, Dec 24, 2012
    #6
  7. iMath

    Kurt Mueller Guest

    Am 24.12.2012 um 04:03 schrieb iMath:
    > but how to let python do it for you ?
    > such as these 2 pages
    > http://python.org/
    > http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
    > how to detect the character encoding in these 2 pages by python ?



    If you have the html code, let
    chardetect.py
    do an educated guess for you.

    http://pypi.python.org/pypi/chardet

    Example:
    $ wget -q -O - http://python.org/ | chardetect.py
    stdin: ISO-8859-2 with confidence 0.803579722043
    $

    $ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py
    stdin: utf-8 with confidence 0.87625
    $


    Grüessli
    --
    Kurt Mueller, Dec 24, 2012
    #7
  8. iMath

    Kwpolska Guest

    Re: how to detect the character encoding in a web page ?

    On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
    <> wrote:
    > $ wget -q -O - http://python.org/ | chardetect.py
    > stdin: ISO-8859-2 with confidence 0.803579722043
    > $


    And it sucks, because it uses magic, and not reading the HTML tags.
    The RIGHT thing to do for websites is detect the meta charset
    definition, which is

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

    or

    <meta charset="utf-8">

    The second one for HTML5 websites, and both may require case
    conversion and the useless ` /` at the end. But if somebody is using
    HTML5, you are pretty much guaranteed to get UTF-8.

    In today’s world, the proper assumption to make is “UTF-8 or GTFOâ€.
    Because nobody in the right mind would use something else today.

    --
    Kwpolska <http://kwpolska.tk>
    stop html mail | always bottom-post
    www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
    GPG KEY: 5EAAEA16
    Kwpolska, Dec 24, 2012
    #8
  9. Re: how to detect the character encoding in a web page ?

    On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:

    > On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
    > <> wrote:
    >> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
    >> with confidence 0.803579722043 $

    >
    > And it sucks, because it uses magic, and not reading the HTML tags. The
    > RIGHT thing to do for websites is detect the meta charset definition,
    > which is
    >
    > <meta http-equiv="content-type" content="text/html; charset=utf-8">
    >
    > or
    >
    > <meta charset="utf-8">
    >
    > The second one for HTML5 websites, and both may require case conversion
    > and the useless ` /` at the end. But if somebody is using HTML5, you
    > are pretty much guaranteed to get UTF-8.
    >
    > In today’s world, the proper assumption to make is “UTF-8 or GTFOâ€.
    > Because nobody in the right mind would use something else today.


    Alas, there are many, many, many, MANY websites that are created by
    people who are *not* in their right mind. To say nothing of 15 year old
    websites that use a legacy encoding. And to support those, you may need
    to guess the encoding, and for that, chardetect.py is the solution.


    --
    Steven
    Steven D'Aprano, Dec 24, 2012
    #9
  10. iMath

    Roy Smith Guest

    Re: how to detect the character encoding in a web page ?

    In article <rn%Bs.693798$4>,
    Alister <> wrote:

    > Indeed due to the poor quality of most websites it is not possible to be
    > 100% accurate for all sites.
    >
    > personally I would start by checking the doc type & then the meta data as
    > these should be quick & correct, I then use chardectect only if these
    > fail to provide any result.


    I agree that checking the metadata is the right thing to do. But, I
    wouldn't go so far as to assume it will always be correct. There's a
    lot of crap out there with perfectly formed metadata which just happens
    to be wrong.

    Although it pains me greatly to quote Ronald Reagan as a source of
    wisdom, I have to admit he got it right with "Trust, but verify". It's
    the only way to survive in the unicode world. Write defensive code.
    Wrap try blocks around calls that might raise exceptions if the external
    data is borked w/r/t what the metadata claims it should be.
    Roy Smith, Dec 24, 2012
    #10
  11. 在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    first setup chardet


    import chardet
    #抓å–网页html
    html_1 = urllib2.urlopen(line,timeout=120).read()
    #print html_1
    mychar=chardet.detect(html_1)
    #print mychar
    bianma=mychar['encoding']
    if bianma == 'utf-8' or bianma == 'UTF-8':
    #html=html.decode('utf-8','ignore').encode('utf-8')
    html=html_1
    else :
    html =html_1.decode('gb2312','ignore').encode('utf-8')
    python培训, Dec 28, 2012
    #11
  12. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    up to now , maybe chadet is the only way to let python automatically do it ..
    iMath, Jan 7, 2013
    #12
  13. Re: how to detect the character encoding in a web page ?

    In article <>,
    Roy Smith <> wrote:
    >In article <rn%Bs.693798$4>,
    > Alister <> wrote:
    >
    >> Indeed due to the poor quality of most websites it is not possible to be
    >> 100% accurate for all sites.
    >>
    >> personally I would start by checking the doc type & then the meta data as
    >> these should be quick & correct, I then use chardectect only if these
    >> fail to provide any result.

    >
    >I agree that checking the metadata is the right thing to do. But, I
    >wouldn't go so far as to assume it will always be correct. There's a
    >lot of crap out there with perfectly formed metadata which just happens
    >to be wrong.
    >
    >Although it pains me greatly to quote Ronald Reagan as a source of
    >wisdom, I have to admit he got it right with "Trust, but verify". It's


    Not surprisingly, as an actor, Reagan was as good as his script.
    This one he got from Stalin.

    >the only way to survive in the unicode world. Write defensive code.
    >Wrap try blocks around calls that might raise exceptions if the external
    >data is borked w/r/t what the metadata claims it should be.


    The way to go, of course.

    Groetjes Albert
    --
    Albert van der Horst, UTRECHT,THE NETHERLANDS
    Economic growth -- being exponential -- ultimately falters.
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
    Albert van der Horst, Jan 14, 2013
    #13
  14. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    I found PyQt’s QtextStream can very accurately detect the character encoding in a web page .
    even for this bad page

    chardet and beautiful soup failed ,but QtextStream can get the right result
    iMath, Jun 5, 2013
    #14
  15. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    I found PyQt’s QtextStream can very accurately detect the character encoding in a web page .
    even for this bad page
    http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html
    chardet and beautiful soup failed ,but QtextStream can get the right result.

    here is my code

    from PyQt4.QtCore import *
    from PyQt4.QtGui import *
    from PyQt4.QtNetwork import *
    import sys
    def slotSourceDownloaded(reply):
    redirctLocation=reply.header(QNetworkRequest.LocationHeader)
    redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
    print(redirctLocationUrl)

    if (reply.error()!= QNetworkReply.NoError):
    print('11111111', reply.errorString())
    return

    content=QTextStream(reply).readAll()
    if content=='':
    print('---------', 'cannot find any resource !')
    return

    print(content)

    reply.deleteLater()
    qApp.quit()


    if __name__ == '__main__':
    app =QCoreApplication(sys.argv)
    manager=QNetworkAccessManager ()
    url =input('input url :')
    request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
    request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
    manager.get(request)
    manager.finished.connect(slotSourceDownloaded)
    sys.exit(app.exec_())
    iMath, Jun 5, 2013
    #15
  16. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !
    iMath, Jun 5, 2013
    #16
  17. Re: how to detect the character encoding in a web page ?

    On Thu, Jun 6, 2013 at 1:14 AM, iMath <> wrote:
    > 在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    >> how to detect the character encoding in a web page ?
    >>
    >> such as this page
    >>
    >>
    >>
    >> http://python.org/

    >
    > by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !


    The rules for web pages are (massively oversimplified):

    1) HTTP header
    2) ASCII-compatible encoding and meta tag

    The HTTP header is completely out of band. This is the best way to
    transmit encoding information. Otherwise, you assume 7-bit ASCII and
    start parsing. Once you find a meta tag, you stop parsing and go back
    to the top, decoding in the new way. "ASCII-compatible" covers a huge
    number of encodings, so it's not actually much of a problem to do
    this.

    ChrisA
    Chris Angelico, Jun 5, 2013
    #17
  18. iMath

    Nobody Guest

    Re: how to detect the character encoding in a web page ?

    On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:

    > The HTTP header is completely out of band. This is the best way to
    > transmit encoding information. Otherwise, you assume 7-bit ASCII and start
    > parsing. Once you find a meta tag, you stop parsing and go back to the
    > top, decoding in the new way.


    Provided that the meta tag indicates an ASCII-compatible encoding, and you
    haven't encountered any decode errors due to 8-bit characters, then
    there's no need to go back to the top.

    > "ASCII-compatible" covers a huge number of
    > encodings, so it's not actually much of a problem to do this.


    With slight modifications, you can also handle some
    almost-ASCII-compatible encodings such as shift-JIS.

    Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
    have actually been seen, and only re-start parsing from the top if the
    encoding change actually affects the interpretation of any of those bytes.

    And if the encoding isn't even remotely ASCII-compatible, you aren't going
    to be able to recognise the meta tag in the first place. But I don't think
    I've ever seen a web page encoded in UTF-16 or EBCDIC.

    Tools like chardet are meant for the situation where either no encoding is
    specified or the specified encoding can't be trusted (which is rather
    common; why else would web browsers have a menu to allow the user to
    select the encoding?).
    Nobody, Jun 6, 2013
    #18
  19. Re: how to detect the character encoding in a web page ?

    On Thu, Jun 6, 2013 at 4:22 PM, Nobody <> wrote:
    > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
    >
    >> The HTTP header is completely out of band. This is the best way to
    >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
    >> parsing. Once you find a meta tag, you stop parsing and go back to the
    >> top, decoding in the new way.

    >
    > Provided that the meta tag indicates an ASCII-compatible encoding, and you
    > haven't encountered any decode errors due to 8-bit characters, then
    > there's no need to go back to the top.


    Technically and conceptually, you go back to the start and re-parse.
    Sure, you might optimize that if you can, but not every parser will,
    hence it's advisable to put the content-type as early as possible.

    >> "ASCII-compatible" covers a huge number of
    >> encodings, so it's not actually much of a problem to do this.

    >
    > With slight modifications, you can also handle some
    > almost-ASCII-compatible encodings such as shift-JIS.
    >
    > Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
    > have actually been seen, and only re-start parsing from the top if the
    > encoding change actually affects the interpretation of any of those bytes.


    Hrm, it'd be equally valid to guess UTF-8. But as long as you're
    prepared to re-parse after finding the content-type, that's just a
    choice of optimization and has no real impact.

    ChrisA
    Chris Angelico, Jun 6, 2013
    #19
  20. iMath

    iMath Guest

    在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
    > how to detect the character encoding in a web page ?
    >
    > such as this page
    >
    >
    >
    > http://python.org/


    Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely
    even for this bad page
    http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html

    this script
    http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

    and this page without chardet in its source code
    http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx


    from PyQt4.QtCore import *
    from PyQt4.QtGui import *
    from PyQt4.QtNetwork import *
    import sys
    import chardet

    def slotSourceDownloaded(reply):
    redirctLocation=reply.header(QNetworkRequest.LocationHeader)
    redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
    #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))

    if (reply.error()!= QNetworkReply.NoError):
    print('11111111', reply.errorString())
    return

    pageCode=reply.readAll()
    charCodecInfo=chardet.detect(pageCode.data())

    textStream=QTextStream(pageCode)
    codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
    textStream.setCodec(codec)
    content=textStream.readAll()
    print(content)

    if content=='':
    print('---------', 'cannot find any resource !')
    return

    reply.deleteLater()
    qApp.quit()


    if __name__ == '__main__':
    app =QCoreApplication(sys.argv)
    manager=QNetworkAccessManager ()
    url =input('input url :')
    request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
    request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
    manager.get(request)
    manager.finished.connect(slotSourceDownloaded)
    sys.exit(app.exec_())
    iMath, Jun 9, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sunil
    Replies:
    0
    Views:
    600
    sunil
    Jul 28, 2004
  2. HK
    Replies:
    7
    Views:
    8,586
    John C. Bollinger
    Jun 7, 2005
  3. raavi
    Replies:
    2
    Views:
    908
    raavi
    Mar 2, 2006
  4. Michal

    Detect character encoding

    Michal, Dec 4, 2005, in forum: Python
    Replies:
    13
    Views:
    1,063
    The new guy
    Dec 6, 2005
  5. Bo Wiklund

    Web Service and Swedish Character Encoding

    Bo Wiklund, Sep 22, 2003, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    174
    Bo Wiklund
    Sep 22, 2003
Loading...

Share This Page