the stupid encoding problem to stdout

  • Thread starter Sérgio Monteiro Basto
  • Start date
S

Sérgio Monteiro Basto

hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
../test.py
moçambique
moçambique

../test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)

in python 2.7
how I explain to python to send the same thing to stdout and
the file output.txt ?

Don't seems logic, when send things to a file the beaviour
change.

Thanks,
Sérgio M. B.
 
B

Benjamin Kaplan

2011/6/8 Sérgio Monteiro Basto said:
hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
 File "./test.py", line 5, in <module>
   print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)

in python 2.7
how I explain to python to send the same thing to stdout and
the file output.txt ?

Don't seems logic, when send things to a file the beaviour
change.

Thanks,
Sérgio M. B.

That's not a terminal vs file thing. It's a "file that declares it's
encoding" vs a "file that doesn't declare it's encoding" thing. Your
terminal declares that it is UTF-8. So when you print a Unicode string
to your terminal, Python knows that it's supposed to turn it into
UTF-8. When you pipe the output to a file, that file doesn't declare
an encoding. So rather than guess which encoding you want, Python
defaults to the lowest common denominator: ASCII. If you want
something to be a particular encoding, you have to encode it yourself.

You have a couple of choices on how to make it work:
1) Play dumb and always encode as UTF-8. This would look really weird
if someone tried running your program in a terminal with a CP-847
encoding (like cmd.exe on at least the US version of Windows), but it
would never crash.
2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
string in the string-escape encoding, which substitutes the escape
sequence in for all non-ASCII characters.
3) Check to see if sys.stdout.isatty() and have different behavior for
terminals vs files. If you're on a terminal that doesn't declare its
encoding, encoding it as UTF-8 probably won't help. If you're writing
to a file, that might be what you want to do.
 
S

Sérgio Monteiro Basto

Benjamin said:
That's not a terminal vs file thing. It's a "file that declares it's
encoding" vs a "file that doesn't declare it's encoding" thing. Your
terminal declares that it is UTF-8. So when you print a Unicode string
to your terminal, Python knows that it's supposed to turn it into
UTF-8. When you pipe the output to a file, that file doesn't declare
an encoding. So rather than guess which encoding you want, Python
defaults to the lowest common denominator: ASCII. If you want
something to be a particular encoding, you have to encode it yourself.

Exactly the opposite , if python don't know the encoding should not try
decode to ASCII.
You have a couple of choices on how to make it work:
1) Play dumb and always encode as UTF-8. This would look really weird
if someone tried running your program in a terminal with a CP-847
encoding (like cmd.exe on at least the US version of Windows), but it
would never crash.

I want python don't care about encoding terminal and send characters as they
are or for a file .
2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
string in the string-escape encoding, which substitutes the escape
sequence in for all non-ASCII characters.

How I change sys.stdout.encoding always to UTF-8 ? at least have a
consistent sys.stdout.encoding
3) Check to see if sys.stdout.isatty() and have different behavior for
terminals vs files. If you're on a terminal that doesn't declare its
encoding, encoding it as UTF-8 probably won't help. If you're writing
to a file, that might be what you want to do.


Thanks,
 
S

Sérgio Monteiro Basto

Ben said:
In this case your terminal is reporting its encoding to Python, and it's
capable of taking the UTF-8 data that you send to it in both cases.


In this case your shell has no preference for the encoding (since you're
redirecting output to a file).

How I say to python that I want that write in utf-8 to files ?
 
N

Nobody

Exactly the opposite , if python don't know the encoding should not try
decode to ASCII.

What should it decode to, then?

You can't write characters to a stream, only bytes.
I want python don't care about encoding terminal and send characters as they
are or for a file .

You can't write characters to a stream, only bytes.
 
T

Terry Reedy

What should it decode to, then?

You can't write characters to a stream, only bytes.


You can't write characters to a stream, only bytes.

Characters, representations are for people, byte representations are for
computers.
 
M

Mark Tolonen

How I change sys.stdout.encoding always to UTF-8 ? at least have a
consistent sys.stdout.encoding

There is an environment variable that can force Python I/O to be a specfic
encoding:

PYTHONIOENCODING=utf-8

-Mark
 
S

Sérgio Monteiro Basto

Mark said:
There is an environment variable that can force Python I/O to be a specfic
encoding:

PYTHONIOENCODING=utf-8

Excellent thanks , double thanks.

BTW: should be set by default on a utf-8 systems like Fedora, Ubuntu, Debian
, Redhat, and all Linuxs. For sure I will put this on startup of my systems.
 
S

Sérgio Monteiro Basto

Ben said:
Are you advocating that Python should refuse to write characters unless
the encoding is specified? I could sympathise with that, but currently
that's not what Python does; instead it defaults to the ASCII codec.

could be a solution ;) or a smarter default based on LANG for example (as
many GNU does).
 
L

Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
../test.py
moçambique
moçambique


The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode("utf-8")
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py > test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the
errors are printed by sys.stderr.write. So if you want to do
raise "moçambique"
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend "Comprendre les erreurs
unicode" by Victor Stinner :
http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
 
L

Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
../test.py
moçambique
moçambique


The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode("utf-8")
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py > test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the
errors are printed by sys.stderr.write. So if you want to do
raise "moçambique"
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend "Comprendre les erreurs
unicode" by Victor Stinner :
http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
 
L

Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
../test.py
moçambique
moçambique


The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode("utf-8")
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py > test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the
errors are printed by sys.stderr.write. So if you want to do
raise "moçambique"
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend "Comprendre les erreurs
unicode" by Victor Stinner :
http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
 
S

Sérgio Monteiro Basto

Ben said:
But when you explicitly redirect to a file, it's not going to a TTY.
It's going to a file whose encoding isn't known unless you specify it.

ok after thinking about this, this problem exist because Python want be
smart with ttys, which is in my point of view is wrong, should not encode to
utf-8, because tty is in utf-8. Python should always encode to the same
thing. If the default is ascii, should always encode to ascii.
yeah should send to tty in ascii, if I send my code to a guy in windows
which use tty with cp1000whatever , shouldn't give decoding errors and
should send in ascii .
If we want we change default for whatever we want, but without this "default
change" Python should not change his behavior depending on output.
yeah I prefer strange output for a different platform, to a decode errors.
And I have /usr/bin/iconv .

Thanks for attention, sorry about my very limited English.
 
C

Chris Angelico

2011/6/11 Sérgio Monteiro Basto said:
ok after thinking about this, this problem exist because Python want be
smart with ttys

The *anomaly* (not problem) exists because Python has a way of being
told a target encoding. If two parties agree on an encoding, they can
send characters to each other. I had this discussion at work a while
ago; my boss was talking about being "binary-safe" (which really meant
"8-bit safe"), while I was saying that we should support, verify, and
demand properly-formed UTF-8. The main significance is that agreeing
on an encoding means we can change the encoding any time it's
convenient, without having to document that we've changed the data -
because we haven't. I can take the number "twelve thousand three
hundred and forty-five" and render that as a string of decimal digits
as "12345", or as hexadecimal digits as "3039", but I haven't changed
the number. If you know that I'm giving you a string of decimal
digits, and I give you "12345", you will get the same number at the
far side.

Python has agreed with stdout that it will send it characters encoded
in UTF-8. Having made that agreement, Python and stdout can happily
communicate in characters, not bytes. You don't need to explicitly
encode your characters into bytes - and in fact, this would be a very
bad thing to do, because you don't know _what_ encoding stdout is
using. If it's expecting UTF-16, you'll get a whole lot of rubbish if
you send it UTF-8 - but it'll look fine if you send it Unicode.

Chris Angelico
 
S

Sérgio Monteiro Basto

Ian said:
If you want your output to behave that way, then all you have to do is
specify that with an explicit encode step.
ok


Sorry, I disagree. If your program is going to fail, it's better that
it fail noisily (with an error) than silently (with no notice that
anything is wrong).

Hi,
ok a little resume, I got the solution which is setting env with
PYTHONIOENCODING=utf-8, which if it was a default for modern GNU Linux, was
made me save lots of time.
My practical problem is simple like, I make a script that want run in shell
for testing and log to a file when use with a configuration.
Everything runs well in a shell and sometimes (later) fails when log to a
file, with a "UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position".
So to work in both cases (tty and files), I filled all code with string
..encode('utf-8') to workaround, when what always I want was use
PYTHONIOCONDIG=utf-8. I got anything in utf-8, database is in utf-8, I
coding in utf-8, my OS is in utf-8. In last about 3 years of learning Python
I lost many many hours to understand this problem.
And see, I can send ascii and utf-8 to utf-8 output and never have problems,
but if I send ascii and utf-8 to ascii files sometimes got encode errors.
So you please consider, at least on Linux, default encode to utf-8 (because
we have less problems) or make more clear that pipe to a file is different
to a tty and problem was in files that defaults to ascii. Or
make the default of IOENCONDIG based on env LANG.

Anyway many thanks for your time and for help me out.
I don't know how run the things in Python 3 , in python 3 defaults are utf-8
?

Thanks,
 
C

Chris Angelico

2011/6/14 Sérgio Monteiro Basto said:
And see, I can send ascii and utf-8 to utf-8 output and never have problems,
but if I send ascii and utf-8 to ascii files sometimes got encode errors.

If something fits inside 7-bit ASCII, it is by definition valid UTF-8.
This is not a coincidence.

Those hours you've spent grokking this are not wasted, if you now have
a comprehension of characters vs encodings. More people in the world
need to understand that difference! :)

Chris Angelico
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top