Unicode issue with Python v3.3

nagia.retsina · Apr 11, 2013

Of course here is how it look like:

if page.endswith('.html'):
f = open( "/home/nikos/www/" + page, encoding="utf-8" )
htmldata = f.read()
htmldata = htmldata % (quote, music)

counter = ''' <center>
<a href="mailto:[email protected]"> <img src="/data/images/mail.png"></a>
<table border=2 cellpadding=2 bgcolor=black>
<td><font color=lime>Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½</td>
<td><a href="http://superhost.gr/?show=log&page=%s"><font color=yellow> %d </td>
</table><br>
''' % (page, data[0])

template = htmldata + counter
print( template )

nagia.retsina · Apr 11, 2013

Of course here is how it look like:

if page.endswith('.html'):
f = open( "/home/nikos/www/" + page, encoding="utf-8" )
htmldata = f.read()
htmldata = htmldata % (quote, music)

counter = ''' <center>
<a href="mailto:[email protected]"> <img src="/data/images/mail.png"></a>
<table border=2 cellpadding=2 bgcolor=black>
<td><font color=lime>Î‘ÏÎ¹Î¸Î¼ÏŒÏ‚ Î•Ï€Î¹ÏƒÎºÎµÏ€Ï„ÏŽÎ½</td>
<td><a href="http://superhost.gr/?show=log&page=%s"><font color=yellow> %d </td>
</table><br>
''' % (page, data[0])

template = htmldata + counter
print( template )

Nikos · Apr 11, 2013

Ôç ÐÝìðôç, 11 Áðñéëßïõ 2013 1:45:22 ì.ì.. UTC+3, ï ÷ñÞóôçò Cameron Simpson Ýãñáøå:

| Firtly thank uou for taking a look into the code.

| the doctype is coming form the attempt of script metrites.py to open and read the 'index.html' file.

| But i don't know how to try to open it as a byte file instead of an tetxt file.

I think you've got it backwards. It looks like metrites.py has

opened the file as bytes instead of as text (probably utf8, but

that remains to be seen). Because it has opened it in binary mode

you're getting bytes when you read from the file.

Can you show the relevant code that opens the files and reads from

it, and the print statement that is putting it back out?

You probably need to ensure that metrites.py is opening it as text,

with the correct encoding. Note that the encoding is nothing to

do with your _output_. It is the encoding of the data in the file

you are reading, and that is dictated by the editor used to make

the file.

Webhost && Weblog

This works in the shell, but doesn't work on my website:

$ cat utf8.txt
õëéêü!Ðñüêåéôáé ã
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.õëéêü!Ðñüêåéôáé ã
b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'

See, the last line is what i'am getting on my website. If i remove the encode('utf-8') part in metrites.py, the webpage will not show anything at all....

Nikos · Apr 11, 2013

Ôç ÐÝìðôç, 11 Áðñéëßïõ 2013 1:45:22 ì.ì.. UTC+3, ï ÷ñÞóôçò Cameron Simpson Ýãñáøå:

| Firtly thank uou for taking a look into the code.

| the doctype is coming form the attempt of script metrites.py to open and read the 'index.html' file.

| But i don't know how to try to open it as a byte file instead of an tetxt file.

I think you've got it backwards. It looks like metrites.py has

opened the file as bytes instead of as text (probably utf8, but

that remains to be seen). Because it has opened it in binary mode

you're getting bytes when you read from the file.

Can you show the relevant code that opens the files and reads from

it, and the print statement that is putting it back out?

You probably need to ensure that metrites.py is opening it as text,

with the correct encoding. Note that the encoding is nothing to

do with your _output_. It is the encoding of the data in the file

you are reading, and that is dictated by the editor used to make

the file.

Webhost && Weblog

This works in the shell, but doesn't work on my website:

$ cat utf8.txt
õëéêü!Ðñüêåéôáé ã
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.õëéêü!Ðñüêåéôáé ã
b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'

See, the last line is what i'am getting on my website. If i remove the encode('utf-8') part in metrites.py, the webpage will not show anything at all....

Michael Torrie · Apr 12, 2013

I'am not sure i follow you. How did my topic changed?! Is this
possible?

This is a mailing list/nntp newsgroup. The subject line can be changed
arbitrarily by anyone replying to another message. Normally this is
done to indicate a natural progression of the conversation in a new
direction. In this case, Steven D'Aprano wrote a reply that did not
answer your pleas, but instead made some observations, and so he changed
the subject line to reflect that.

If you read your messages using a threaded message display, this will
make more sense to you. But if you use Gmail's (or Google's) broken
conversation view, then this information about who is responding to whom
does get lost--actually in conversation view a lot of information about
the message flow is lost; it really is unfortunate that this way of
communicating has become so widespread.

How about the oce i posted at patebin.com. Did anyone by any chnace
had a look into?

It's only a single thing iam missing for the encoding and the the
script will load properly with python 3.3

I'm truly sorry, but I simply do not have the time to do so.

nagia.retsina · Apr 12, 2013

Well, can somebody else propose somehting plz?

i have paste the whole script and even the necessary snippet that perhaps causing this encoding confusion in 3.3

alex23 · Apr 12, 2013

Well, can somebody else propose somehting plz?

Pay for a professional.

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 8:06:14 Ï€.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ alex23 ÎÎ³ÏÎ±ÏˆÎµ:

Pay for a professional.

Just for a simple encoding problem that will be solved by not vene 1 single line of coding?

Don't think so.

nagia.retsina · Apr 12, 2013

Someone HEEEEEEEEEELP MEEEEEEEEE!!

Chris Angelico · Apr 12, 2013

Someone HEEEEEEEEEELP MEEEEEEEEE!!

ChrisA

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 4:14:39 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Chris Angelico ÎÎ³ÏÎ±ÏˆÎµ:

ChrisA

Well, instead of being a smartass it would be nice if you could actually help for once.

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 4:14:39 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Chris Angelico ÎÎ³ÏÎ±ÏˆÎµ:

ChrisA

Well, instead of being a smartass it would be nice if you could actually help for once.

Chris Angelico · Apr 12, 2013

Ôç ÐáñáóêåõÞ, 12 Áðñéëßïõ 2013 4:14:39 ì.ì. UTC+3, ï ÷ñÞóôçò Chris Angelico Ýãñáøå:

Well, instead of being a smartass it would be nice if you could actually help for once.

Yeah, I'm done with that. Your whining ran through my patience a few
posts ago. But you should feel special; I clipped that just for you.

ChrisA

rusi · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 4:14:39 Î¼.Î¼.. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Chris Angelico ÎÎ³ÏÎ±ÏˆÎµ:

Well, instead of being a smartass it would be nice if you could actually help for once.

Interesting!
Among the things which you dont seem to know is the meaning of the
word 'once'.

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 4:29:51 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ rusi ÎÎ³ÏÎ±ÏˆÎµ:

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 4:14:39 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Chris Angelico ÎÎ³ÏÎ±ÏˆÎµ:

Interesting!

Among the things which you dont seem to know is the meaning of the

word 'once'.

Click to expand...

Same applies for you too. Stop being smartasses.

Ian Kelly · Apr 12, 2013

Ôç ÐáñáóêåõÞ, 12 Áðñéëßïõ 2013 4:29:51 ì.ì. UTC+3, ï ÷ñÞóôçò rusi Ýãñáøå:

Same applies for you too. Stop being smartasses.

Please keep in mind that this is a community of volunteers. Nobody
here is being paid for their time to help you fix your website, and if
you manage to irritate us in the process, we're likely to just walk
away from it.

I looked over the code that you have provided us with, and based on
that I could not see any reason why the html would be in the form of a
bytes instead of a str. Since nobody else here seems to have any
further insight into the problem either, you're just going to have to
find a a way to debug the code. If you cannot do that on your own,
then I suggest that you find a contractor who can, hire them, and
grant them the access they need to do a real debugging session.

I would also recommend that in the future you should stop deploying
untested code to your production website. Set up a development
environment for yourself, make the changes there, and only deploy when
you know that everything is working.

Roy Smith · Apr 12, 2013

Ian Kelly said:
I would also recommend that in the future you should stop deploying
untested code to your production website. Set up a development
environment for yourself, make the changes there, and only deploy when
you know that everything is working.

But that takes all the fun out of it

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 9:37:29 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Ian ÎÎ³ÏÎ±ÏˆÎµ:

Please keep in mind that this is a community of volunteers. Nobody

here is being paid for their time to help you fix your website, and if

you manage to irritate us in the process, we're likely to just walk

away from it.

I looked over the code that you have provided us with, and based on

that I could not see any reason why the html would be in the form of a

bytes instead of a str. Since nobody else here seems to have any

further insight into the problem either, you're just going to have to

find a a way to debug the code. If you cannot do that on your own,

then I suggest that you find a contractor who can, hire them, and

grant them the access they need to do a real debugging session.

I would also recommend that in the future you should stop deploying

untested code to your production website. Set up a development

environment for yourself, make the changes there, and only deploy when

you know that everything is working.

I agree with what you say except form the fact that i try to irritate people.
Look at the thread and you will see who's irritating whom first.

nagia.retsina · Apr 12, 2013

Î¤Î· Î Î±ÏÎ±ÏƒÎºÎµÏ…Î®, 12 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 9:37:29 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Ian ÎÎ³ÏÎ±ÏˆÎµ:

Please keep in mind that this is a community of volunteers. Nobody

here is being paid for their time to help you fix your website, and if

you manage to irritate us in the process, we're likely to just walk

away from it.

I looked over the code that you have provided us with, and based on

that I could not see any reason why the html would be in the form of a

bytes instead of a str. Since nobody else here seems to have any

further insight into the problem either, you're just going to have to

find a a way to debug the code. If you cannot do that on your own,

then I suggest that you find a contractor who can, hire them, and

grant them the access they need to do a real debugging session.

I would also recommend that in the future you should stop deploying

untested code to your production website. Set up a development

environment for yourself, make the changes there, and only deploy when

you know that everything is working.

I agree with what you say except form the fact that i try to irritate people.
Look at the thread and you will see who's irritating whom first.

Cameron Simpson · Apr 13, 2013

| Î¤Î· Î ÎÎ¼Ï€Ï„Î·, 11 Î‘Ï€ÏÎ¹Î»Î¯Î¿Ï… 2013 1:45:22 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Cameron Simpson ÎÎ³ÏÎ±ÏˆÎµ:
| > | the doctype is coming form the attempt of script metrites.py to open and read the 'index.html' file.
| > | But i don't know how to try to open it as a byte file instead of an tetxt file.

Lele Gaifax showed one way:

from codecs import open
with open('index.html', encoding='utf-8') as f:
content = f.read()

But a plain open() should also do:

with open('index.html') as f:
content = f.read()

if you're not taking tight control of the file encoding.

The point here is to get _text_ (i.e. str) data from the file, not bytes.

If the text turns out to be incorrectly decoded (i.e. incorrectly
reading the file bytes and assembling them into text strings) because
the default encoding is wrong, then you may need to read for Lele's
more verbose open() example to select the correct encoding.

But first ignore that and get text (str) instead of bytes.
If you're already getting text from the file, something later is
making bytes and handing it to print().

Another approach to try is to use
sys.stdout.write()
instead of
print()

The print() function will take _anything_ and write text of some form.
The write() function will throw an exception if it gets the wrong type of data.

If sys.stdout is opened in binary mode then write() will require
bytes as data; strings will need to be explicitly turned into bytes
via .encode() in order to not raise an exception.

If sys.stdout is open in text mode, write() will require str data.
The sys.stdout file itself will transcribe to bytes for you.

If you take that route, at least you will not have confusion about
str versus bytes.

For an HTML output page I would advocate arranging that sys.stdout
is in text mode; that way you can do the natural thing and .write()
str data and lovely UTF-8 bytes will come out the other end.

If the above test (using .write() instead of print()) shows it to
be in binary mode we can fix that. But you need to find out.

You will want access to the error messages from the CGI environment;
do you have access to the web servers error_log? You can tail that
in a terminal while you reload the page to see what's going on.

| This works in the shell, but doesn't work on my website:
|
| $ cat utf8.txt
| Ï…Î»Î¹ÎºÏŒ!Î ÏÏŒÎºÎµÎ¹Ï„Î±Î¹ Î³

Ok, so your terminal is using UTF-8 as its output coding. (And so
is your mail posting program, since we see it unmangled on my screen
here.)

| $ python3
| Python 3.2.3 (default, Oct 19 2012, 20:10:41)
| [GCC 4.6.3] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> data = open('utf8.txt').read()
| >>> print(data)
| Ï…Î»Î¹ÎºÏŒ!Î ÏÏŒÎºÎµÎ¹Ï„Î±Î¹ Î³

Likewise.

However, in an exciting twist, I seem to recall that Python invoked
interactively with aterminal as output will have the default terminal
encoding in place on sys.stdout. Producing what you expect. _However_,
python invoked in a batch environment where stdout is not a terminal
(such as in the CGI environment producing your web page), that is
_not_ necessarily the case.

| >>> print(data.encode('utf-8'))
| b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'
|
| See, the last line is what i'am getting on my website.

The above line takes your Unicode text in "data" and transcribed
it to bytes using UTF-8 as the encoding. And print() is then receiving
that bytes object and printing its str() representation as "b'....'".
That str is itself unicode, and when print passes it to sys.stdout,
_that_ transcribed the unicode "b'...'" string as bytes to your
terminal. Using UTF-8 based on the previous examples above, but
since all those characters are in the bottom 127 code range the
byte sequence will be the same if it uses ASCII or ISO8859-1 or
almost anything else

As you can see, there's a lot of encoding/decoding going on behind
the scenes even in this superficially simple example.

| If i remove
| the encode('utf-8') part in metrites.py, the webpage will not show
| anything at all...

Ah, but data will be being output. The print() function _will_ be
writing "data" out in some form. I suggest you remove the .encode()
and then examine the _source_ text of the web page, not its visible
form.

So: remove .encode(), reload the web page, "view page source"
(depends on your browser, it is ctrl-U in Firefox ((Cmd-U in firefox
on a Mac))).

I think a lot of the issue you have in this thread is that your
page is too complex. Make another page to do the same thing, and
start with nothing. Add stuff to it a single item at a time until
the page behaves incorrectly. Then you will know the exact item of
code that introduced the issue. And then that single item can be
examined in detail for the decode/encode issues.

The other issue in the thread is that people losing patience get
snarky. Respond only to the technical content. If a message is only
snarky, _ignore_ it. People like the last word; let them have it
and you won't get sidetracked into arguments.

Cheers,

Unicode	20	Dec 16, 2012
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
API delay issue on Godaddy shared hosting	1	Mar 23, 2023
Information with WMI in Python.	1	Feb 28, 2023
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Ascii to Unicode.	4	Jul 28, 2010
Issue with $_COOKIE	3	May 2, 2021
Thinking Unicode	0	Aug 8, 2013

Unicode issue with Python v3.3

nagia.retsina

nagia.retsina

Nikos

Nikos

Michael Torrie

nagia.retsina

alex23

nagia.retsina

nagia.retsina

Chris Angelico

nagia.retsina

nagia.retsina

Chris Angelico

rusi

nagia.retsina

Ian Kelly

Roy Smith

nagia.retsina

nagia.retsina

Cameron Simpson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads