character encoding conversion

Discussion in 'Python' started by Dylan, Dec 12, 2004.

  1. Dylan

    Dylan Guest

    Here's what I'm trying to do:

    - scrape some html content from various sources

    The issue I'm running to:

    - some of the sources have incorrectly encoded characters... for
    example, cp1252 curly quotes that were likely the result of the author
    copying and pasting content from Word

    I've searched and read for many hours, but have not found a solution
    for handling the case where the page author does not use the character
    encoding that they have specified.

    Things I have tried include encode()/decode(), and replacement lookup
    tables (i.e. something like
    http://groups-beta.google.com/group..._doneTitle=Back to Search&&d#11991de6ced3406b
    ) . However, I am still unable to convert the characters to something
    meaningful. In the case of the lookup table, this failed as all of
    the imporoperly encoded characters were returning as ? rather than
    their original encoding.

    I'm using urllib and htmllib to open, read, and parse the html
    fragments, Python 2.3 on OS X 10.3

    Any ideas or pointers would be greatly appreciated.

    -Dylan Schiemann
    http://www.dylanschiemann.com/
     
    Dylan, Dec 12, 2004
    #1
    1. Advertising

  2. Dylan wrote:
    > Things I have tried include encode()/decode()


    This should work. If you somehow manage to guess the encoding,
    e.g. guess it as cp1252, then

    htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

    will give you a file that contains only ASCII characters, and
    character references for everything else.

    Now, how should you guess the encoding? Here is a strategy:
    1. use the encoding that was sent through the HTTP header. Be
    absolutely certain to not ignore this encoding.
    2. use the encoding in the XML declaration (if any).
    3. use the encoding in the http-equiv meta element (if any)
    4. use UTF-8
    5. use Latin-1, and check that there are no characters in the
    range(128,160)
    6. use cp1252
    7. use Latin-1

    In the order from 1 to 6, check whether you manage to decode
    the input. Notice that in step 5, you will definitely get successful
    decoding; consider this a failure if you have get any control
    characters (from range(128, 160)); then try in step 7 latin-1
    again.

    When you find the first encoding that decodes correctly, encode
    it with ascii and xmlcharrefreplace, and you won't need to worry
    about the encoding, anymore.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 12, 2004
    #2
    1. Advertising

  3. Martin v. Löwis wrote:
    > Dylan wrote:
    >
    >> Things I have tried include encode()/decode()

    >
    >
    > This should work. If you somehow manage to guess the encoding,
    > e.g. guess it as cp1252, then
    >
    > htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
    >
    > will give you a file that contains only ASCII characters, and
    > character references for everything else.
    >
    > Now, how should you guess the encoding? Here is a strategy:
    > 1. use the encoding that was sent through the HTTP header. Be
    > absolutely certain to not ignore this encoding.
    > 2. use the encoding in the XML declaration (if any).
    > 3. use the encoding in the http-equiv meta element (if any)
    > 4. use UTF-8
    > 5. use Latin-1, and check that there are no characters in the
    > range(128,160)
    > 6. use cp1252
    > 7. use Latin-1
    >
    > In the order from 1 to 6, check whether you manage to decode
    > the input. Notice that in step 5, you will definitely get successful
    > decoding; consider this a failure if you have get any control
    > characters (from range(128, 160)); then try in step 7 latin-1
    > again.
    >
    > When you find the first encoding that decodes correctly, encode
    > it with ascii and xmlcharrefreplace, and you won't need to worry
    > about the encoding, anymore.
    >
    > Regards,
    > Martin

    I have a similar problem, with characters like äöüAÖÜß and so on. I am
    extracting some content out of webpages, and they deliver whatever,
    sometimes not even giving any encoding information in the header. But
    your solution sounds quite good, i just do not know if
    - it works with the characters i mentioned
    - what encoding do you have in the end
    - and how exactly are you doing all this? All with somestring.decode()
    or... Can you please give an example for these 7 steps?
    Thanx in advance for the help
    Chris
     
    Christian Ergh, Dec 12, 2004
    #3
  4. Christian Ergh wrote:
    > - it works with the characters i mentioned


    It does.

    > - what encoding do you have in the end


    US-ASCII

    > - and how exactly are you doing all this? All with somestring.decode()
    > or... Can you please give an example for these 7 steps?


    I could, but I don't have the time - just try to come up with some
    code, and I try to comment on it.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 12, 2004
    #4
  5. Martin v. Löwis wrote:
    > Dylan wrote:
    >
    >> Things I have tried include encode()/decode()

    >
    >
    > This should work. If you somehow manage to guess the encoding,
    > e.g. guess it as cp1252, then
    >
    > htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
    >
    > will give you a file that contains only ASCII characters, and
    > character references for everything else.
    >
    > Now, how should you guess the encoding? Here is a strategy:
    > 1. use the encoding that was sent through the HTTP header. Be
    > absolutely certain to not ignore this encoding.
    > 2. use the encoding in the XML declaration (if any).
    > 3. use the encoding in the http-equiv meta element (if any)
    > 4. use UTF-8
    > 5. use Latin-1, and check that there are no characters in the
    > range(128,160)
    > 6. use cp1252
    > 7. use Latin-1
    >
    > In the order from 1 to 6, check whether you manage to decode
    > the input. Notice that in step 5, you will definitely get successful
    > decoding; consider this a failure if you have get any control
    > characters (from range(128, 160)); then try in step 7 latin-1
    > again.
    >
    > When you find the first encoding that decodes correctly, encode
    > it with ascii and xmlcharrefreplace, and you won't need to worry
    > about the encoding, anymore.
    >
    > Regards,
    > Martin


    Something like this?
    Chris

    import urllib2

    url = 'www.someurl.com'
    f = urllib2.urlopen(url)
    data = f.read()
    # if it is not in the pagecode, how do i get the encoding of the page?
    pageencoding = ???
    xmlencoding = 'whatever i parsed out of the file'
    htmlmetaencoding = 'whatever i parsed out of the metatag'
    f.close()
    try:
    data = data.decode(pageencoding)
    except:
    try:
    data = data.decode(xmlencoding)
    except:
    try:
    data = data.decode(htmlmetaencoding)
    except:
    try:
    data = data.encode('UTF-8')
    except:
    flag = true
    for char in data:
    if 127 < ord(char) < 128:
    flag = false
    if flag:
    try:
    data = data.encode('latin-1')
    except:
    pass
    try:
    data = data.encode('cp1252')
    except:
    pass
    try:
    data = data.encode('latin-1')
    except:
    pass:
    data = data.encode("ascii", "xmlcharrefreplace")
     
    Christian Ergh, Dec 13, 2004
    #5
  6. Christian Ergh wrote:
    > flag = true
    > for char in data:
    > if 127 < ord(char) < 128:
    > flag = false
    > if flag:
    > try:
    > data = data.encode('latin-1')
    > except:
    > pass


    A little OT, but (assuming I got your indentation right[1]) this kind of
    loop is exactly what the else clause of a for-loop is for:

    for char in data:
    if 127 < ord(char) < 128:
    break
    else:
    try:
    data = data.encode('latin-1')
    except:
    pass

    Only saves you one line of code, but you don't have to keep track of a
    'flag' variable. Generally, I find that when I want to set a 'flag'
    variable, I can usually do it with a for/else instead.

    Steve

    [1] Messed up indentation happens in a lot of clients if you have tabs
    in your code. If you can replace tabs with spaces before posting, this
    usually solves the problem.
     
    Steven Bethard, Dec 13, 2004
    #6
  7. Dylan

    Peter Otten Guest

    Steven Bethard wrote:

    > Christian Ergh wrote:
    >> flag = true
    >> for char in data:
    >> if 127 < ord(char) < 128:
    >> flag = false
    >> if flag:
    >> try:
    >> data = data.encode('latin-1')
    >> except:
    >> pass

    >
    > A little OT, but (assuming I got your indentation right[1]) this kind of
    > loop is exactly what the else clause of a for-loop is for:
    >
    > for char in data:
    > if 127 < ord(char) < 128:
    > break
    > else:
    > try:
    > data = data.encode('latin-1')
    > except:
    > pass
    >
    > Only saves you one line of code, but you don't have to keep track of a
    > 'flag' variable. Generally, I find that when I want to set a 'flag'
    > variable, I can usually do it with a for/else instead.
    >
    > Steve
    >
    > [1] Messed up indentation happens in a lot of clients if you have tabs
    > in your code. If you can replace tabs with spaces before posting, this
    > usually solves the problem.


    Even more off-topic:

    >>> for char in data:

    .... if 127 < ord(char) < 128:
    .... break
    ....
    >>> print char

    127.5

    :)

    Peter
     
    Peter Otten, Dec 13, 2004
    #7
  8. Peter Otten wrote:
    > Steven Bethard wrote:
    >
    >
    >>Christian Ergh wrote:
    >>
    >>>flag = true
    >>>for char in data:
    >>> if 127 < ord(char) < 128:
    >>> flag = false
    >>>if flag:
    >>> try:
    >>> data = data.encode('latin-1')
    >>> except:
    >>> pass

    >>
    >>A little OT, but (assuming I got your indentation right[1]) this kind of
    >>loop is exactly what the else clause of a for-loop is for:
    >>
    >>for char in data:
    >> if 127 < ord(char) < 128:
    >> break
    >>else:
    >> try:
    >> data = data.encode('latin-1')
    >> except:
    >> pass
    >>
    >>Only saves you one line of code, but you don't have to keep track of a
    >>'flag' variable. Generally, I find that when I want to set a 'flag'
    >>variable, I can usually do it with a for/else instead.
    >>
    >>Steve
    >>
    >>[1] Messed up indentation happens in a lot of clients if you have tabs
    >>in your code. If you can replace tabs with spaces before posting, this
    >>usually solves the problem.

    >
    >
    > Even more off-topic:
    >
    >
    >>>>for char in data:

    >
    > ... if 127 < ord(char) < 128:
    > ... break
    > ...
    >
    >>>>print char

    >
    > 127.5
    >
    > :)
    >
    > Peter
    >

    Well yes, that happens when doing a quick hack and not reviewing it, 128
    has to be 160 of course...
     
    Christian Ergh, Dec 13, 2004
    #8
  9. Once more, indention should be correct now, and the 128 is gone too. So,
    something like this?
    Chris

    import urllib2

    url = 'www.someurl.com'
    f = urllib2.urlopen(url)
    data = f.read()
    # if it is not in the pagecode, how do i get the encoding of the page?
    pageencoding = '???'
    xmlencoding = 'whatever i parsed out of the file'
    htmlmetaencoding = 'whatever i parsed out of the metatag'
    f.close()
    try:
    data = data.decode(pageencoding)
    except:
    try:
    data = data.decode(xmlencoding)
    except:
    try:
    data = data.decode(htmlmetaencoding)
    except:
    try:
    data = data.encode('UTF-8')
    except:
    flag = true
    for char in data:
    if 127 < ord(char) < 160:
    flag = false
    if flag:
    try:
    data = data.encode('latin-1')
    except:
    pass
    try:
    data = data.encode('cp1252')
    except:
    pass
    try:
    data = data.encode('latin-1')
    except:
    pass
    data = data.encode("ascii", "xmlcharrefreplace")
     
    Christian Ergh, Dec 13, 2004
    #9
  10. Dylan

    Max M Guest

    Christian Ergh wrote:

    A smiple way to try out different encodings in a given order:

    # -*- coding: latin-1 -*-

    def get_encoded(st, encodings):
    "Returns an encoding that doesn't fail"
    for encoding in encodings:
    try:
    st_encoded = st.decode(encoding)
    return st_encoded, encoding
    except UnicodeError:
    pass


    st = 'Test characters æøå ÆØÅ'
    encodings = ['utf-8', 'latin-1', 'ascii', ]
    print get_encoded(st, encodings)

    (u'Test characters \xe6\xf8\xe5 \xc6\xd8\xc5', 'latin-1')

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
     
    Max M, Dec 13, 2004
    #10
  11. - snip -
    > def get_encoded(st, encodings):
    > "Returns an encoding that doesn't fail"
    > for encoding in encodings:
    > try:
    > st_encoded = st.decode(encoding)
    > return st_encoded, encoding
    > except UnicodeError:
    > pass

    -snip-
    This works fine, but after this you have three possible encodings (or
    even more, looking at the data in the net you'll see a lot of
    encodings...)- what we need is just one for all.
    Chris
     
    Christian Ergh, Dec 13, 2004
    #11
  12. Dylan wrote:
    > Here's what I'm trying to do:
    >
    > - scrape some html content from various sources
    >
    > The issue I'm running to:
    >
    > - some of the sources have incorrectly encoded characters... for
    > example, cp1252 curly quotes that were likely the result of the author
    > copying and pasting content from Word
    >

    Finally: For me this works, all inside my own class, and the module has
    a logger, for reuse you would need to fix this stuff... Im am updating a
    postgreSQL Database, in case someone wonders about the __setattr__, and
    my class inherits from SQLObject.

    def doDecode(self, st):
    "Returns an encoding that doesn't fail"
    for encoding in encodings:
    try:
    stEncoded = st.decode(encoding)
    return stEncoded
    except UnicodeError:
    pass

    def setAttribute(self, name, data):
    import HTMLFilter
    data = self.doDecode(data)
    try:
    data = data.encode('ascii', "xmlcharrefreplace")
    except:
    log.warn('new method did not fit')

    try:
    if '&#' in data:
    data = HTMLFilter.HTMLDecode(data)
    except UnicodeDecodeError:
    log.debug('HTML decoding failed!!!')

    try:
    data = data.encode('utf-8')
    except:
    log.warn('new utf 8 method did not fit')

    try:
    self.__setattr__(name, data)
    except:
    log.debug('1. try failed: ')
    log.warning(type(data))
    log.debug(data)
    log.warning('Some unicode error while updating')
     
    Christian Ergh, Dec 13, 2004
    #12
  13. Forgot a part... You need the encoding list:

    encodings = [
    'utf-8',
    'latin-1',
    'ascii',
    'cp1252',
    ]

    Christian Ergh wrote:
    > Dylan wrote:
    >
    >> Here's what I'm trying to do:
    >>
    >> - scrape some html content from various sources
    >>
    >> The issue I'm running to:
    >>
    >> - some of the sources have incorrectly encoded characters... for
    >> example, cp1252 curly quotes that were likely the result of the author
    >> copying and pasting content from Word
    >>

    > Finally: For me this works, all inside my own class, and the module has
    > a logger, for reuse you would need to fix this stuff... Im am updating a
    > postgreSQL Database, in case someone wonders about the __setattr__, and
    > my class inherits from SQLObject.
    >
    > def doDecode(self, st):
    > "Returns an encoding that doesn't fail"
    > for encoding in encodings:
    > try:
    > stEncoded = st.decode(encoding)
    > return stEncoded
    > except UnicodeError:
    > pass
    >
    > def setAttribute(self, name, data):
    > import HTMLFilter
    > data = self.doDecode(data)
    > try:
    > data = data.encode('ascii', "xmlcharrefreplace")
    > except:
    > log.warn('new method did not fit')
    >
    > try:
    > if '&#' in data:
    > data = HTMLFilter.HTMLDecode(data)
    > except UnicodeDecodeError:
    > log.debug('HTML decoding failed!!!')
    >
    > try:
    > data = data.encode('utf-8')
    > except:
    > log.warn('new utf 8 method did not fit')
    >
    > try:
    > self.__setattr__(name, data)
    > except:
    > log.debug('1. try failed: ')
    > log.warning(type(data))
    > log.debug(data)
    > log.warning('Some unicode error while updating')
     
    Christian Ergh, Dec 13, 2004
    #13
  14. Christian Ergh wrote:
    > Once more, indention should be correct now, and the 128 is gone too. So,
    > something like this?


    Yes, something like this. The tricky part is of, course, then the
    fragments which you didn't implement.

    Also, it might be possible to do this in a for loop, e.g.

    for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
    "UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
    try:
    data = data.encode(encoding)
    break;
    except UnicodeError:
    pass

    You then just need to add the Latin-1-no-controls codec, or you need
    to special-case this in the loop.

    > # if it is not in the pagecode, how do i get the encoding of the page?
    > pageencoding = '???'


    You need to remember the HTTP connection that you got the HTML file
    from. The webserver may have sent a Content-Type header.

    > xmlencoding = 'whatever i parsed out of the file'
    > htmlmetaencoding = 'whatever i parsed out of the metatag'


    Depending on the library you use, these aren't that trivial, either.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 13, 2004
    #14
  15. Max M wrote:
    > A smiple way to try out different encodings in a given order:


    The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
    somewhat redundant. The 'ASCII' case is never considered, since
    Latin-1 effectively works as a catch-all encoding (as all byte
    sequences can be considered Latin-1 - whether they are meaningful
    data is a different question).

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 13, 2004
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,873
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,374
    Real Gagnon
    Oct 8, 2004
  3. raavi
    Replies:
    2
    Views:
    913
    raavi
    Mar 2, 2006
  4. Replies:
    0
    Views:
    3,411
  5. Michael
    Replies:
    1
    Views:
    354
Loading...

Share This Page