Python UTF-8 and codecs

Discussion in 'Python' started by Mike Currie, Jun 27, 2006.

  1. Mike Currie

    Mike Currie Guest

    I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
    them. Every configuration I try I get a UnicodeError: ascii codec can't
    decode byte 0x85 in position 255: oridinal not in range(128)

    I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
    and that doesn't work and I've also try wrapping the file in an utf8_writer
    using codecs.lookup('utf8')

    Any clues?

    Thanks
    Mike
    Mike Currie, Jun 27, 2006
    #1
    1. Advertising

  2. Mike Currie wrote:
    > I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
    > them. Every configuration I try I get a UnicodeError: ascii codec can't
    > decode byte 0x85 in position 255: oridinal not in range(128)
    >
    > I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
    > and that doesn't work
    > [...]


    You want to write to a file but you used the 'rU' mode. This should be
    'wU'. Don't know if this is the only reason it doesn't work. Could you
    show more of your code?


    Bye,
    Dennis
    Dennis Benzinger, Jun 27, 2006
    #2
    1. Advertising

  3. Mike Currie

    Serge Orlov Guest

    On 6/27/06, Mike Currie <> wrote:
    > I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
    > them. Every configuration I try I get a UnicodeError: ascii codec can't
    > decode byte 0x85 in position 255: oridinal not in range(128)
    >
    > I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
    > and that doesn't work and I've also try wrapping the file in an utf8_writer
    > using codecs.lookup('utf8')
    >
    > Any clues?


    Use unicode strings for non-ascii characters. The following program "works":

    import codecs

    c1 = unichr(0x85)
    f = codecs.open('foo.txt', 'wU', 'utf-8')
    f.write(c1)
    f.close()

    But unichr(0x85) is a control characters, are you sure you want it?
    What is the encoding of your data?
    Serge Orlov, Jun 27, 2006
    #3
  4. Mike Currie

    Mike Currie Guest

    I did make a mistake, it should have been 'wU'.

    The starting data is ASCII.

    What I'm doing is data processing on files with new line and tab characters
    inside quoted fields. The idea is to convert all the new line and
    characters to 0x85 and 0x88 respectivly, then process the files. Finally
    right before importing them into a database convert them back to new line
    and tab's thus preserving the field values.

    Will python not handle the control characters correctly?


    "Serge Orlov" <> wrote in message
    news:...
    > On 6/27/06, Mike Currie <> wrote:
    >> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
    >> them. Every configuration I try I get a UnicodeError: ascii codec can't
    >> decode byte 0x85 in position 255: oridinal not in range(128)
    >>
    >> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
    >> errors='strict')
    >> and that doesn't work and I've also try wrapping the file in an
    >> utf8_writer
    >> using codecs.lookup('utf8')
    >>
    >> Any clues?

    >
    > Use unicode strings for non-ascii characters. The following program
    > "works":
    >
    > import codecs
    >
    > c1 = unichr(0x85)
    > f = codecs.open('foo.txt', 'wU', 'utf-8')
    > f.write(c1)
    > f.close()
    >
    > But unichr(0x85) is a control characters, are you sure you want it?
    > What is the encoding of your data?
    Mike Currie, Jun 27, 2006
    #4
  5. Mike Currie

    Mike Currie Guest

    Okay,

    Here is a sample of what I'm doing:


    Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> filterMap = {}
    >>> for i in range(0,255):

    .... filterMap[chr(i)] = chr(i)
    ....
    >>> filterMap[chr(9)] = chr(136)
    >>> filterMap[chr(10)] = chr(133)
    >>> filterMap[chr(136)] = chr(9)
    >>> filterMap[chr(133)] = chr(10)
    >>> line = '''this has

    .... tabs and line
    .... breaks'''
    >>> filteredLine = ''.join([ filterMap[a] for a in line])
    >>> import codecs
    >>> f = codecs.open('foo.txt', 'wU', 'utf-8')
    >>> print filteredLine

    thisêhasêàtabsêandêlineàbreaks
    >>> f.write(filteredLine)

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    File "C:\Python24\lib\codecs.py", line 501, in write
    return self.writer.write(data)
    File "C:\Python24\lib\codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
    ordinal
    not in range(128)
    >>>


    "Mike Currie" <> wrote in message
    news:5Hgog.627$Gv.173@fed1read09...
    >I did make a mistake, it should have been 'wU'.
    >
    > The starting data is ASCII.
    >
    > What I'm doing is data processing on files with new line and tab
    > characters inside quoted fields. The idea is to convert all the new line
    > and characters to 0x85 and 0x88 respectivly, then process the files.
    > Finally right before importing them into a database convert them back to
    > new line and tab's thus preserving the field values.
    >
    > Will python not handle the control characters correctly?
    >
    >
    > "Serge Orlov" <> wrote in message
    > news:...
    >> On 6/27/06, Mike Currie <> wrote:
    >>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08
    >>> in
    >>> them. Every configuration I try I get a UnicodeError: ascii codec can't
    >>> decode byte 0x85 in position 255: oridinal not in range(128)
    >>>
    >>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
    >>> errors='strict')
    >>> and that doesn't work and I've also try wrapping the file in an
    >>> utf8_writer
    >>> using codecs.lookup('utf8')
    >>>
    >>> Any clues?

    >>
    >> Use unicode strings for non-ascii characters. The following program
    >> "works":
    >>
    >> import codecs
    >>
    >> c1 = unichr(0x85)
    >> f = codecs.open('foo.txt', 'wU', 'utf-8')
    >> f.write(c1)
    >> f.close()
    >>
    >> But unichr(0x85) is a control characters, are you sure you want it?
    >> What is the encoding of your data?

    >
    >
    Mike Currie, Jun 27, 2006
    #5
  6. Mike Currie

    Serge Orlov Guest

    On 6/27/06, Mike Currie <> wrote:
    > Okay,
    >
    > Here is a sample of what I'm doing:
    >
    >
    > Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
    > win32
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> filterMap = {}
    > >>> for i in range(0,255):

    > ... filterMap[chr(i)] = chr(i)
    > ...
    > >>> filterMap[chr(9)] = chr(136)
    > >>> filterMap[chr(10)] = chr(133)
    > >>> filterMap[chr(136)] = chr(9)
    > >>> filterMap[chr(133)] = chr(10)


    This part is incorrect, it should be:

    filterMap = {}
    for i in range(0,128):
    filterMap[chr(i)] = chr(i)

    filterMap[chr(9)] = unichr(136)
    filterMap[chr(10)] = unichr(133)
    filterMap[unichr(136)] = chr(9)
    filterMap[unichr(133)] = chr(10)
    Serge Orlov, Jun 27, 2006
    #6
  7. Mike Currie

    Mike Currie Guest

    Well, not really. It doesn't affect the result. I still get the error
    message. Did you get a different result?


    "Serge Orlov" <> wrote in message
    news:...
    > On 6/27/06, Mike Currie <> wrote:
    >> Okay,
    >>
    >> Here is a sample of what I'm doing:
    >>
    >>
    >> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
    >> win32
    >> Type "help", "copyright", "credits" or "license" for more information.
    >> >>> filterMap = {}
    >> >>> for i in range(0,255):

    >> ... filterMap[chr(i)] = chr(i)
    >> ...
    >> >>> filterMap[chr(9)] = chr(136)
    >> >>> filterMap[chr(10)] = chr(133)
    >> >>> filterMap[chr(136)] = chr(9)
    >> >>> filterMap[chr(133)] = chr(10)

    >
    > This part is incorrect, it should be:
    >
    > filterMap = {}
    > for i in range(0,128):
    > filterMap[chr(i)] = chr(i)
    >
    > filterMap[chr(9)] = unichr(136)
    > filterMap[chr(10)] = unichr(133)
    > filterMap[unichr(136)] = chr(9)
    > filterMap[unichr(133)] = chr(10)
    Mike Currie, Jun 28, 2006
    #7
  8. Mike Currie

    Serge Orlov Guest

    On 6/27/06, Mike Currie <> wrote:
    > Well, not really. It doesn't affect the result. I still get the error
    > message. Did you get a different result?


    Yes, the program succesfully wrote text file. Without magic abilities
    to read the screen of your computer I guess you now get exception in
    print statement. It is because you use legacy windows console (I use
    unicode-capable console of lightning compiler
    <http://www.python.org/pypi/Lightning%20Compiler> to run snippets of
    code). You can either change console or comment out print statement or
    change your program to print unicode representation: print
    repr(filteredLine)
    Serge Orlov, Jun 28, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zhongjian Lu
    Replies:
    1
    Views:
    580
    Fuzzyman
    Mar 17, 2006
  2. smitty1e
    Replies:
    2
    Views:
    284
    smitty1e
    Jun 11, 2007
  3. jmfauth
    Replies:
    4
    Views:
    305
    jmfauth
    Oct 13, 2010
  4. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    934
    Grzegorz ¦liwiñski
    Jan 19, 2011
  5. Karl Knechtel
    Replies:
    2
    Views:
    357
    Walter Dörwald
    Jul 10, 2012
Loading...

Share This Page