How do I encode and decode this data to write to a file?

Discussion in 'Python' started by cl@isbd.net, Apr 29, 2013.

  1. Guest

    I am debugging some code that creates a static HTML gallery from a
    directory hierarchy full of images. It's this package:-
    https://pypi.python.org/pypi/Gallery2.py/2.0


    It's basically working and does pretty much what I want so I'm happy to
    put some effort into it and fix things.

    The problem I'm currently chasing is that it can't cope with directory
    names that have accented characters in them, it fails when it tries to
    write the HTML that creates the page with the thumbnails on.

    The code that's failing is:-

    raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
    file = open(raw, "w")
    file.write("".join(html).encode('utf-8'))
    file.close()

    The variable html is a list containing the lines of HTML to write to the
    file. It fails when it contains accented characters (an é in this
    case). Here's the traceback:-

    Traceback (most recent call last):
    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
    File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
    File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
    File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
    File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
    File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)



    If I understand correctly the encode() is saying that it can't
    understand the data in the html because there's a character 0xc3 in it.
    I *think* this means that the é is encoded in UTF-8 already in the
    incoming data stream (should be as my system is wholly UTF-8 as far as I
    know and I created the directory name).

    So how do I change the code so I don't get the error? Do I just
    decode() the data first and then encode() it?

    --
    Chris Green
    , Apr 29, 2013
    #1
    1. Advertising

  2. Andrew Berg Guest

    On 2013.04.29 04:47, wrote:
    > If I understand correctly the encode() is saying that it can't
    > understand the data in the html because there's a character 0xc3 in it.
    > I *think* this means that the é is encoded in UTF-8 already in the
    > incoming data stream (should be as my system is wholly UTF-8 as far as I
    > know and I created the directory name).

    You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding(). If it returns 'ascii', then your locale settings
    are incorrect.

    --
    CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
    Andrew Berg, Apr 29, 2013
    #2
    1. Advertising

  3. Peter Otten Guest

    wrote:

    > I am debugging some code that creates a static HTML gallery from a
    > directory hierarchy full of images. It's this package:-
    > https://pypi.python.org/pypi/Gallery2.py/2.0
    >
    >
    > It's basically working and does pretty much what I want so I'm happy to
    > put some effort into it and fix things.
    >
    > The problem I'm currently chasing is that it can't cope with directory
    > names that have accented characters in them, it fails when it tries to
    > write the HTML that creates the page with the thumbnails on.
    >
    > The code that's failing is:-
    >
    > raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
    > file = open(raw, "w")
    > file.write("".join(html).encode('utf-8'))
    > file.close()
    >
    > The variable html is a list containing the lines of HTML to write to the
    > file. It fails when it contains accented characters (an é in this
    > case). Here's the traceback:-
    >
    > Traceback (most recent call last):
    > File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line
    > 41, in run self._recurse() File
    > "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272,
    > in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
    > File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name,
    > func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk
    > walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246,
    > in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py",
    > line 238, in walk func(arg, top, names) File
    > "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263,
    > in processDir self.createGallery() File
    > "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215,
    > in createGallery self.picturemanager.createPictureHTMLs(self.footer)
    > File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py",
    > line 84, in createPictureHTMLs
    > curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(),
    > self.fullsize, footer) File
    > "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
    > in createPictureHTML file.write("".join(html).encode('utf-8'))
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
    > 783: ordinal not in range(128)
    >
    >
    >
    > If I understand correctly the encode() is saying that it can't
    > understand the data in the html because there's a character 0xc3 in it.
    > I *think* this means that the é is encoded in UTF-8 already in the
    > incoming data stream (should be as my system is wholly UTF-8 as far as I
    > know and I created the directory name).
    >
    > So how do I change the code so I don't get the error? Do I just
    > decode() the data first and then encode() it?
    >


    Note that you are getting a *UnicodeDecodeError*, not a UnicodeEncodeError.
    Try omitting the encode() step, i. e. instead of

    > file.write("".join(html).encode('utf-8'))


    use

    file.write(""join(html))

    Background (applies to Python 2 only): the str type deals with bytes, not
    code points. The right thing to do is to use .decode(...) to convert from
    str to unicode and .encode(...) to convert from unicode to str. In Python 2
    however the str type has an encode(...) method which is basically equivalent
    to

    class str:
    # imaginary python implementation of python2's str
    ...
    def encode(self, encoding):
    return self.decode("ascii").encode(encoding)

    and is almost never called intentionally.

    PS Python3 has relabeled unicode to str and thus uses unicode by default.
    str was renamed to bytes and the annoying bytes.encode() method is gone.
    Peter Otten, Apr 29, 2013
    #3
  4. Dave Angel Guest

    On 04/29/2013 05:47 AM, wrote:

    A couple of generic comments: your email program made a mess of the
    traceback by appending each source line to the location information.

    Please mention your Python version & OS. Apparently you're running 2.7
    on Linux or similar.

    > I am debugging some code that creates a static HTML gallery from a
    > directory hierarchy full of images. It's this package:-
    > https://pypi.python.org/pypi/Gallery2.py/2.0
    >
    >
    > It's basically working and does pretty much what I want so I'm happy to
    > put some effort into it and fix things.
    >
    > The problem I'm currently chasing is that it can't cope with directory
    > names that have accented characters in them, it fails when it tries to
    > write the HTML that creates the page with the thumbnails on.
    >
    > The code that's failing is:-
    >
    > raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
    > file = open(raw, "w")
    > file.write("".join(html).encode('utf-8'))


    You can't encode byte data, it's already encoded. So you're forcing the
    Python system to implicitly decode it (using ASCII codec) before letting
    you encode it to utf-8. If you think it's already in utf-8, then omit
    the encode() call there.

    Additionally, you can debug things with some simple print statements, at
    least if you decompose your 3-function line so you can get at the
    intermediate data. Split the line into three parts;
    temp1 = "".join(html) #temp1 is byte data
    temp2 = temp1.decode() #temp2 is unicode data
    temp3 = temp2.encode("utf-8") #temp3 is byte data again
    file.write(temp3)

    Now, you'll presumably get the error on the second line, so examine the
    bytes around byte 783. Make sure it's really in utf-8, and if it is,
    then skip the decode and the encode. If it's not, then Andrew's advice
    is pertinent.

    I would also look at the variable html. It's a list, but what are the
    types of the elements in it?

    > file.close()
    >
    > The variable html is a list containing the lines of HTML to write to the
    > file. It fails when it contains accented characters (an é in this
    > case). Here's the traceback:-
    >
    > Traceback (most recent call last):
    > File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
    > File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
    > File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
    > File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
    > File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
    > File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
    > File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
    > File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)
    >
    >
    >
    > If I understand correctly the encode() is saying that it can't
    > understand the data in the html because there's a character 0xc3 in it.
    > I *think* this means that the é is encoded in UTF-8 already in the
    > incoming data stream (should be as my system is wholly UTF-8 as far as I
    > know and I created the directory name).
    >
    > So how do I change the code so I don't get the error? Do I just
    > decode() the data first and then encode() it?
    >



    --
    DaveA
    Dave Angel, Apr 29, 2013
    #4
  5. Guest

    Andrew Berg <> wrote:
    > On 2013.04.29 04:47, wrote:
    > > If I understand correctly the encode() is saying that it can't
    > > understand the data in the html because there's a character 0xc3 in it.
    > > I *think* this means that the é is encoded in UTF-8 already in the
    > > incoming data stream (should be as my system is wholly UTF-8 as far as I
    > > know and I created the directory name).

    > You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding().
    > If it returns 'ascii', then your locale settings
    > are incorrect.
    >


    chris$ python
    Python 2.7.3 (default, Sep 26 2012, 21:51:14)
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.getfilesystemencoding()

    'UTF-8'
    >>>


    So I am set up right for UTF-8.

    --
    Chris Green
    , Apr 29, 2013
    #5
  6. Guest

    Dave Angel <> wrote:
    > On 04/29/2013 05:47 AM, wrote:
    >
    > A couple of generic comments: your email program made a mess of the
    > traceback by appending each source line to the location information.
    >

    What's me email program got to do with it? :) I'm using a dedicated
    newsreader (tin) as I posted via the gmane/usenet interface. The posting
    looks perfectly OK to me when I read it back from usenet.


    > Please mention your Python version & OS. Apparently you're running 2.7
    > on Linux or similar.
    >

    Sorry, yes you're spot on.


    > > I am debugging some code that creates a static HTML gallery from a
    > > directory hierarchy full of images. It's this package:-
    > > https://pypi.python.org/pypi/Gallery2.py/2.0
    > >
    > >
    > > It's basically working and does pretty much what I want so I'm happy to
    > > put some effort into it and fix things.
    > >
    > > The problem I'm currently chasing is that it can't cope with directory
    > > names that have accented characters in them, it fails when it tries to
    > > write the HTML that creates the page with the thumbnails on.
    > >
    > > The code that's failing is:-
    > >
    > > raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
    > > file = open(raw, "w")
    > > file.write("".join(html).encode('utf-8'))

    >
    > You can't encode byte data, it's already encoded. So you're forcing the
    > Python system to implicitly decode it (using ASCII codec) before letting
    > you encode it to utf-8. If you think it's already in utf-8, then omit
    > the encode() call there.
    >

    It's the way the code was as I installed it from pypi. What you say
    makes a lot of sense though, I'll remove the encode().


    > Additionally, you can debug things with some simple print statements, at
    > least if you decompose your 3-function line so you can get at the
    > intermediate data. Split the line into three parts;
    > temp1 = "".join(html) #temp1 is byte data
    > temp2 = temp1.decode() #temp2 is unicode data
    > temp3 = temp2.encode("utf-8") #temp3 is byte data again
    > file.write(temp3)
    >

    OK, thanks for this and all the other advice on this thread.

    --
    Chris Green
    , Apr 29, 2013
    #6
  7. Robert Kern Guest

    On 2013-04-29 13:59, wrote:
    > Dave Angel <> wrote:
    >> On 04/29/2013 05:47 AM, wrote:
    >>
    >> A couple of generic comments: your email program made a mess of the
    >> traceback by appending each source line to the location information.
    >>

    > What's me email program got to do with it? :) I'm using a dedicated
    > newsreader (tin) as I posted via the gmane/usenet interface. The posting
    > looks perfectly OK to me when I read it back from usenet.


    FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Apr 29, 2013
    #7
  8. Guest

    Robert Kern <> wrote:
    > On 2013-04-29 13:59, wrote:
    > > Dave Angel <> wrote:
    > >> On 04/29/2013 05:47 AM, wrote:
    > >>
    > >> A couple of generic comments: your email program made a mess of the
    > >> traceback by appending each source line to the location information.
    > >>

    > > What's me email program got to do with it? :) I'm using a dedicated
    > > newsreader (tin) as I posted via the gmane/usenet interface. The posting
    > > looks perfectly OK to me when I read it back from usenet.

    >
    > FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.
    >

    How strange. I think it must be something to do with the gmane
    interface between news and mail then.

    --
    Chris Green
    , Apr 29, 2013
    #8
  9. > How strange. I think it must be something to do with the gmane
    > interface between news and mail then.


    Probably. It was borked in Gmail as well...

    Skip
    Skip Montanaro, Apr 29, 2013
    #9
  10. On 4/29/2013 5:47 AM, wrote:

    > case). Here's the traceback:-
    >


    > File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
    > in createPictureHTML file.write("".join(html).encode('utf-8'))
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position

    783: ordinal not in range(128)

    Generiric advice for anyone getting unicode errors:
    unpack the composition producing the error
    so that one can see which operation produced it.

    In this case
    s = "".join(html)\
    s = s.encode('utf-8')
    file.write(s)

    This also makes it possible to print intermediate results.
    print(type(s), s) # would have been useful
    Doing so would have immediately shown that in this case the error was
    the encode operation, because s was already bytes.
    For many other posts, the error with the same type of message has been
    the print or write operation, do to output encoding issues, but that was
    not the case here.
    Terry Jan Reedy, Apr 29, 2013
    #10
  11. On 5/1/2013 5:20 PM, Tony the Tiger wrote:
    > On Mon, 29 Apr 2013 10:47:46 +0100, cl wrote:
    >
    >> raw = os.path.join(directory, self.getNameNoExtension()) +
    >> ".html"
    >> file = open(raw, "w")
    >> file.write("".join(html).encode('utf-8'))
    >> file.close()

    > This works for me:
    >
    > Python 2.7.3 (default, Aug 1 2012, 05:16:07)
    > [GCC 4.6.3] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> html='<html><head><title>Blah</title><body>éåäö</body></html>'
    >>>> f=open('test.html', 'w')
    >>>> f.write(''.join(html.decode('utf-8').encode('utf-8')))
    >>>> f.close()

    > Perhaps there are better ways to do it.


    Your .write() line is exactly equivalent to:

    f.write(html)

    Because: if X is a UTF-8 bytestring, then:

    X.decode('utf-8').encode('utf-8') == X

    And if X is a bytestring, then:

    ''.join(X) == X

    --Ned.

    >
    > /Grrr
    Ned Batchelder, May 1, 2013
    #11
  12. On 4/29/2013 5:47 AM, wrote:
    > If I understand correctly the encode() is saying that it can't
    > understand the data in the html because there's a character 0xc3 in it.
    > I *think* this means that the é is encoded in UTF-8 already in the
    > incoming data stream (should be as my system is wholly UTF-8 as far as I
    > know and I created the directory name).
    >
    > So how do I change the code so I don't get the error? Do I just
    > decode() the data first and then encode() it?
    >


    BTW, I did a presentation at PyCon 2012 that many people have found
    helpful: Pragmatic Unicode, or, How Do I Stop the Pain:
    http://nedbatchelder.com/text/unipain.html . It explains the principles
    at work here.

    --Ned.
    Ned Batchelder, May 2, 2013
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. MNQ
    Replies:
    2
    Views:
    652
    Eyck Jentzsch
    May 18, 2004
  2. Harald Kirsch
    Replies:
    2
    Views:
    2,129
    Harald Kirsch
    Aug 28, 2003
  3. Damir Hakimov

    base64.encode and decode not correct

    Damir Hakimov, Aug 16, 2005, in forum: Python
    Replies:
    1
    Views:
    351
  4. Replies:
    0
    Views:
    553
  5. sumit
    Replies:
    0
    Views:
    355
    sumit
    Mar 10, 2012
Loading...

Share This Page