How do I encode and decode this data to write to a file?


C

cl

I am debugging some code that creates a static HTML gallery from a
directory hierarchy full of images. It's this package:-
https://pypi.python.org/pypi/Gallery2.py/2.0


It's basically working and does pretty much what I want so I'm happy to
put some effort into it and fix things.

The problem I'm currently chasing is that it can't cope with directory
names that have accented characters in them, it fails when it tries to
write the HTML that creates the page with the thumbnails on.

The code that's failing is:-

raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
file = open(raw, "w")
file.write("".join(html).encode('utf-8'))
file.close()

The variable html is a list containing the lines of HTML to write to the
file. It fails when it contains accented characters (an é in this
case). Here's the traceback:-

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)



If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).

So how do I change the code so I don't get the error? Do I just
decode() the data first and then encode() it?
 
Ad

Advertisements

A

Andrew Berg

If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).
You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding(). If it returns 'ascii', then your locale settings
are incorrect.
 
P

Peter Otten

I am debugging some code that creates a static HTML gallery from a
directory hierarchy full of images. It's this package:-
https://pypi.python.org/pypi/Gallery2.py/2.0


It's basically working and does pretty much what I want so I'm happy to
put some effort into it and fix things.

The problem I'm currently chasing is that it can't cope with directory
names that have accented characters in them, it fails when it tries to
write the HTML that creates the page with the thumbnails on.

The code that's failing is:-

raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
file = open(raw, "w")
file.write("".join(html).encode('utf-8'))
file.close()

The variable html is a list containing the lines of HTML to write to the
file. It fails when it contains accented characters (an é in this
case). Here's the traceback:-

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line
41, in run self._recurse() File
"/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272,
in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name,
func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk
walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246,
in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py",
line 238, in walk func(arg, top, names) File
"/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263,
in processDir self.createGallery() File
"/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215,
in createGallery self.picturemanager.createPictureHTMLs(self.footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py",
line 84, in createPictureHTMLs
curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(),
self.fullsize, footer) File
"/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
in createPictureHTML file.write("".join(html).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
783: ordinal not in range(128)



If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).

So how do I change the code so I don't get the error? Do I just
decode() the data first and then encode() it?

Note that you are getting a *UnicodeDecodeError*, not a UnicodeEncodeError.
Try omitting the encode() step, i. e. instead of
file.write("".join(html).encode('utf-8'))

use

file.write(""join(html))

Background (applies to Python 2 only): the str type deals with bytes, not
code points. The right thing to do is to use .decode(...) to convert from
str to unicode and .encode(...) to convert from unicode to str. In Python 2
however the str type has an encode(...) method which is basically equivalent
to

class str:
# imaginary python implementation of python2's str
...
def encode(self, encoding):
return self.decode("ascii").encode(encoding)

and is almost never called intentionally.

PS Python3 has relabeled unicode to str and thus uses unicode by default.
str was renamed to bytes and the annoying bytes.encode() method is gone.
 
D

Dave Angel

On 04/29/2013 05:47 AM, (e-mail address removed) wrote:

A couple of generic comments: your email program made a mess of the
traceback by appending each source line to the location information.

Please mention your Python version & OS. Apparently you're running 2.7
on Linux or similar.
I am debugging some code that creates a static HTML gallery from a
directory hierarchy full of images. It's this package:-
https://pypi.python.org/pypi/Gallery2.py/2.0


It's basically working and does pretty much what I want so I'm happy to
put some effort into it and fix things.

The problem I'm currently chasing is that it can't cope with directory
names that have accented characters in them, it fails when it tries to
write the HTML that creates the page with the thumbnails on.

The code that's failing is:-

raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
file = open(raw, "w")
file.write("".join(html).encode('utf-8'))

You can't encode byte data, it's already encoded. So you're forcing the
Python system to implicitly decode it (using ASCII codec) before letting
you encode it to utf-8. If you think it's already in utf-8, then omit
the encode() call there.

Additionally, you can debug things with some simple print statements, at
least if you decompose your 3-function line so you can get at the
intermediate data. Split the line into three parts;
temp1 = "".join(html) #temp1 is byte data
temp2 = temp1.decode() #temp2 is unicode data
temp3 = temp2.encode("utf-8") #temp3 is byte data again
file.write(temp3)

Now, you'll presumably get the error on the second line, so examine the
bytes around byte 783. Make sure it's really in utf-8, and if it is,
then skip the decode and the encode. If it's not, then Andrew's advice
is pertinent.

I would also look at the variable html. It's a list, but what are the
types of the elements in it?
file.close()

The variable html is a list containing the lines of HTML to write to the
file. It fails when it contains accented characters (an é in this
case). Here's the traceback:-

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 41, in run self._recurse()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272, in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg)
File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 238, in walk func(arg, top, names)
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263, in processDir self.createGallery()
File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215, in createGallery self.picturemanager.createPictureHTMLs(self.footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py", line 84, in createPictureHTMLs curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(), self.fullsize, footer)
File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361, in createPictureHTML file.write("".join(html).encode('utf-8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 783: ordinal not in range(128)



If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).

So how do I change the code so I don't get the error? Do I just
decode() the data first and then encode() it?
 
C

cl

Andrew Berg said:
You can verify that your filesystem is set to use UTF-8 with sys.getfilesystemencoding().
If it returns 'ascii', then your locale settings
are incorrect.

chris$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
So I am set up right for UTF-8.
 
C

cl

Dave Angel said:
On 04/29/2013 05:47 AM, (e-mail address removed) wrote:

A couple of generic comments: your email program made a mess of the
traceback by appending each source line to the location information.
What's me email program got to do with it? :) I'm using a dedicated
newsreader (tin) as I posted via the gmane/usenet interface. The posting
looks perfectly OK to me when I read it back from usenet.

Please mention your Python version & OS. Apparently you're running 2.7
on Linux or similar.
Sorry, yes you're spot on.

You can't encode byte data, it's already encoded. So you're forcing the
Python system to implicitly decode it (using ASCII codec) before letting
you encode it to utf-8. If you think it's already in utf-8, then omit
the encode() call there.
It's the way the code was as I installed it from pypi. What you say
makes a lot of sense though, I'll remove the encode().

Additionally, you can debug things with some simple print statements, at
least if you decompose your 3-function line so you can get at the
intermediate data. Split the line into three parts;
temp1 = "".join(html) #temp1 is byte data
temp2 = temp1.decode() #temp2 is unicode data
temp3 = temp2.encode("utf-8") #temp3 is byte data again
file.write(temp3)
OK, thanks for this and all the other advice on this thread.
 
Ad

Advertisements

R

Robert Kern

What's me email program got to do with it? :) I'm using a dedicated
newsreader (tin) as I posted via the gmane/usenet interface. The posting
looks perfectly OK to me when I read it back from usenet.

FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

cl

Robert Kern said:
FWIW, I see the same problem Dave sees. I'm using gmane via Thunderbird.
How strange. I think it must be something to do with the gmane
interface between news and mail then.
 
S

Skip Montanaro

How strange. I think it must be something to do with the gmane
interface between news and mail then.

Probably. It was borked in Gmail as well...

Skip
 
T

Terry Jan Reedy

case). Here's the traceback:-
File "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
in createPictureHTML file.write("".join(html).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
783: ordinal not in range(128)

Generiric advice for anyone getting unicode errors:
unpack the composition producing the error
so that one can see which operation produced it.

In this case
s = "".join(html)\
s = s.encode('utf-8')
file.write(s)

This also makes it possible to print intermediate results.
print(type(s), s) # would have been useful
Doing so would have immediately shown that in this case the error was
the encode operation, because s was already bytes.
For many other posts, the error with the same type of message has been
the print or write operation, do to output encoding issues, but that was
not the case here.
 
N

Ned Batchelder

raw = os.path.join(directory, self.getNameNoExtension()) +
".html"
file = open(raw, "w")
file.write("".join(html).encode('utf-8'))
file.close()
This works for me:

Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.Perhaps there are better ways to do it.

Your .write() line is exactly equivalent to:

f.write(html)

Because: if X is a UTF-8 bytestring, then:

X.decode('utf-8').encode('utf-8') == X

And if X is a bytestring, then:

''.join(X) == X

--Ned.
 
Ad

Advertisements

N

Ned Batchelder

If I understand correctly the encode() is saying that it can't
understand the data in the html because there's a character 0xc3 in it.
I *think* this means that the é is encoded in UTF-8 already in the
incoming data stream (should be as my system is wholly UTF-8 as far as I
know and I created the directory name).

So how do I change the code so I don't get the error? Do I just
decode() the data first and then encode() it?

BTW, I did a presentation at PyCon 2012 that many people have found
helpful: Pragmatic Unicode, or, How Do I Stop the Pain:
http://nedbatchelder.com/text/unipain.html . It explains the principles
at work here.

--Ned.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top