Ascii Encoding Error with UTF-8 encoder

Discussion in 'Python' started by Mike Currie, Jun 27, 2006.

  1. Mike Currie

    Mike Currie Guest

    Can anyone explain why I'm getting an ascii encoding error when I'm trying
    to write out using a UTF-8 encoder?

    Thanks

    Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> filterMap = {}
    >>> for i in range(0,255):

    .... filterMap[chr(i)] = chr(i)
    ....
    >>> filterMap[chr(9)] = chr(136)
    >>> filterMap[chr(10)] = chr(133)
    >>> filterMap[chr(136)] = chr(9)
    >>> filterMap[chr(133)] = chr(10)
    >>> line = '''this has

    .... tabs and line
    .... breaks'''
    >>> filteredLine = ''.join([ filterMap[a] for a in line])
    >>> import codecs
    >>> f = codecs.open('foo.txt', 'wU', 'utf-8')
    >>> print filteredLine

    thisêhasêàtabsêandêlineàbreaks
    >>> f.write(filteredLine)

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    File "C:\Python24\lib\codecs.py", line 501, in write
    return self.writer.write(data)
    File "C:\Python24\lib\codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
    ordinal
    not in range(128)
     
    Mike Currie, Jun 27, 2006
    #1
    1. Advertising

  2. Mike Currie

    Robert Kern Guest

    Mike Currie wrote:
    > Can anyone explain why I'm getting an ascii encoding error when I'm trying
    > to write out using a UTF-8 encoder?


    Please read the Python Unicode HOWTO.

    http://www.amk.ca/python/howto/unicode

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Jun 27, 2006
    #2
    1. Advertising

  3. Mike Currie

    John Machin Guest

    On 28/06/2006 7:46 AM, Mike Currie wrote:
    > Can anyone explain why I'm getting an ascii encoding error when I'm trying
    > to write out using a UTF-8 encoder?
    >


    >>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
    >>>> print filteredLine

    > thisêhasêàtabsêandêlineàbreaks
    >>>> f.write(filteredLine)

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > File "C:\Python24\lib\codecs.py", line 501, in write
    > return self.writer.write(data)
    > File "C:\Python24\lib\codecs.py", line 178, in write
    > data, consumed = self.encode(object, self.errors)
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
    > ordinal
    > not in range(128)
    >


    Your fundamental problem is that you are trying to decode an 8-bit
    string to UTF-8. The codec tries to convert your string to Unicode
    first, using the default encoding (ascii), which fails.

    Get this into your head:
    You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
    into an 8-bit string.
    You decode whatever from an 8-bit string into Unicode.

    Here is a run-down on your problem, using just the encode/decode methods
    instead of codecs for illustration purposes:

    (1) Equivalent to what you did.
    |>> '\x88'.encode('utf-8')
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
    ordinal not in range(128)

    (2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
    |>> '\x88'.decode('ascii').encode('utf-8')
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
    ordinal not in range(128)

    (3) Encoding Unicode as UTF-8 works, as expected.
    |>> u'\x88'.encode('utf-8')
    '\xc2\x88'

    (4) But you need to know what your 8-bit data is supposed to be encoded
    in, before you start.
    |>> '\x88'.decode('cp1252').encode('utf-8')
    '\xcb\x86'
    |>> '\x88'.decode('latin1').encode('utf-8')
    '\xc2\x88'

    I am rather puzzled as to what you are trying to achieve. You appear to
    believe that you possess one or more 8-bit strings, encoded in latin1,
    which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
    corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change
    LF to NEL, and NEL to LF and similarly with the other pair. Then you
    want to write the result, encoded in UTF-8, to a file. The purpose
    behind that baroque/byzantine capering would be .... what?
     
    John Machin, Jun 28, 2006
    #3
  4. Mike Currie

    Mike Currie Guest

    Thanks for the thorough explanation.

    What I am doing is converting data for processing that will be tab (for
    columns) and newline (for row) delimited. Some of the data contains tabs
    and newlines so, I have to convert them to something else so the file
    integrity is good.

    Not my idea, I've been left with the implementation however.

    "John Machin" <> wrote in message
    news:44a1bbcb$...
    > On 28/06/2006 7:46 AM, Mike Currie wrote:
    >> Can anyone explain why I'm getting an ascii encoding error when I'm
    >> trying to write out using a UTF-8 encoder?
    >>

    >
    >>>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
    >>>>> print filteredLine

    >> thisêhasêàtabsêandêlineàbreaks
    >>>>> f.write(filteredLine)

    >> Traceback (most recent call last):
    >> File "<stdin>", line 1, in ?
    >> File "C:\Python24\lib\codecs.py", line 501, in write
    >> return self.writer.write(data)
    >> File "C:\Python24\lib\codecs.py", line 178, in write
    >> data, consumed = self.encode(object, self.errors)
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
    >> ordinal
    >> not in range(128)
    >>

    >
    > Your fundamental problem is that you are trying to decode an 8-bit string
    > to UTF-8. The codec tries to convert your string to Unicode first, using
    > the default encoding (ascii), which fails.
    >
    > Get this into your head:
    > You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
    > into an 8-bit string.
    > You decode whatever from an 8-bit string into Unicode.
    >
    > Here is a run-down on your problem, using just the encode/decode methods
    > instead of codecs for illustration purposes:
    >
    > (1) Equivalent to what you did.
    > |>> '\x88'.encode('utf-8')
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
    > ordinal not in range(128)
    >
    > (2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
    > |>> '\x88'.decode('ascii').encode('utf-8')
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
    > ordinal not in range(128)
    >
    > (3) Encoding Unicode as UTF-8 works, as expected.
    > |>> u'\x88'.encode('utf-8')
    > '\xc2\x88'
    >
    > (4) But you need to know what your 8-bit data is supposed to be encoded
    > in, before you start.
    > |>> '\x88'.decode('cp1252').encode('utf-8')
    > '\xcb\x86'
    > |>> '\x88'.decode('latin1').encode('utf-8')
    > '\xc2\x88'
    >
    > I am rather puzzled as to what you are trying to achieve. You appear to
    > believe that you possess one or more 8-bit strings, encoded in latin1,
    > which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
    > corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
    > to NEL, and NEL to LF and similarly with the other pair. Then you want to
    > write the result, encoded in UTF-8, to a file. The purpose behind that
    > baroque/byzantine capering would be .... what?
    >
     
    Mike Currie, Jun 28, 2006
    #4
  5. Mike Currie

    John Machin Guest

    On 28/06/2006 9:44 AM, Mike Currie wrote:
    >
    > What I am doing is converting data for processing that will be tab (for
    > columns) and newline (for row) delimited. Some of the data contains tabs
    > and newlines so, I have to convert them to something else so the file
    > integrity is good.
    >
    > Not my idea, I've been left with the implementation however.
    >


    Do you *need* UTF-8? Or is that only there to hide away the \x88 and
    \x83? Apart from tab and linefeed, what (if any) other characters are
    there in the data that are not printable ASCII characters?

    In any case, if you have 8-bit string data, the CSV file format would
    appear to meet the requirement: it preserves your data by "quoting"
    delimiters and newlines that appear in the actual data. The Python csv
    module is included in every Python distribution since 2.3.

    Cheers,
    John
     
    John Machin, Jun 28, 2006
    #5
  6. Mike Currie

    Serge Orlov Guest

    On 6/27/06, Mike Currie <> wrote:
    > Thanks for the thorough explanation.
    >
    > What I am doing is converting data for processing that will be tab (for
    > columns) and newline (for row) delimited. Some of the data contains tabs
    > and newlines so, I have to convert them to something else so the file
    > integrity is good.


    Usually it is done by escaping: translate tab -> \t, new line -> \n,
    back slash -> \\.
    Python strings already have a method to do it in just one line:
    >>> s=chr(9)+chr(10)+chr(92)
    >>> print s.encode("string_escape")

    \t\n\\

    when you're ready to convert it back you call decode("string_escape")


    > Not my idea, I've been left with the implementation however.


    The idea is actually not bad as long as you know how to cope with unicode.
     
    Serge Orlov, Jun 28, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Binary to ascii encoder

    , Dec 21, 2004, in forum: C Programming
    Replies:
    2
    Views:
    506
  2. Bjoern Hoehrmann

    More elegant UTF-8 encoder

    Bjoern Hoehrmann, Jun 10, 2007, in forum: C Programming
    Replies:
    34
    Views:
    1,129
    Dik T. Winter
    Jun 25, 2007
  3. H.S.
    Replies:
    12
    Views:
    1,333
    Victor Bazarov
    Aug 10, 2007
  4. Replies:
    2
    Views:
    373
  5. Replies:
    2
    Views:
    384
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page