Unicode blues in Python3

nn · Mar 23, 2010

I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

> ./nntst2.py ISO8859-1
ý

> ./nntst2.py >nnout2

Traceback (most recent call last):
File "./nntst2.py", line 5, in <module>
print(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)

cat nnout2

ascii

...Oh great!

ok lets try this:
#nntst3.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar.encode('latin1'))

./nntst3.py ISO8859-1
b'\xfd'

./nntst3.py >nnout3

cat nnout3

ascii
b'\xfd'

...Eh... not what I want really.

#nntst4.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
sys.stdout=codecs.getwriter("latin1")(sys.stdout)
print(mychar)

> ./nntst4.py

ISO8859-1
Traceback (most recent call last):
File "./nntst4.py", line 6, in <module>
print(mychar)
File "Python-3.1.2/Lib/codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes

...OK, this is not working either.

Is there any way to write a value 253 to standard output?

Rami Chowdhury · Mar 23, 2010

I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

The following code works for me:

$ cat nnout5.py
#!/usr/bin/python3.1

import sys
mychar = chr(253)
sys.stdout.write(mychar)
$ echo $(cat nnout)
ý

Can I ask why you're using print() in the first place, rather than writing
directly to a file? Python 3.x, AFAIK, distinguishes between text and binary
files and will let you specify the encoding you want for strings you write.

Hope that helps,
Rami

nn · Mar 23, 2010

Rami said:
The following code works for me:

$ cat nnout5.py
#!/usr/bin/python3.1

import sys
mychar = chr(253)
sys.stdout.write(mychar)
$ echo $(cat nnout)
ý

Can I ask why you're using print() in the first place, rather than writing
directly to a file? Python 3.x, AFAIK, distinguishes between text and binary > files and will let you specify the encoding you want for strings you write.

Hope that helps,
Rami

#nntst5.py
import sys
mychar=chr(253)
sys.stdout.write(mychar)

./nntst5.py >nnout5

Traceback (most recent call last):
File "./nntst5.py", line 4, in <module>
sys.stdout.write(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)

equivalent to print.

I use print so I can do tests and debug runs to the screen or pipe it
to some other tool and then configure the production bash script to
write the final output to a file of my choosing.

Gary Herron · Mar 23, 2010

nn said:
I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

Python3 make a distinction between bytes and string(i.e., unicode)
types, and you are still thinking in the Python2 mode that does *NOT*
make such a distinction. What you appear to want is to write a
particular byte to a file -- so use the bytes type and a file open in
binary mode:

>>> b=bytes([253])
>>> f = open("abc", 'wb')
>>> f.write(b) 1
>>> f.close()

Click to expand...

Click to expand...

> od abc -d

0000000 253
0000001

Hope that helps.

Gary Herron

nn · Mar 23, 2010

Gary said:
nn said:

I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

Click to expand...

Python3 make a distinction between bytes and string(i.e., unicode)
types, and you are still thinking in the Python2 mode that does *NOT*
make such a distinction. What you appear to want is to write a
particular byte to a file -- so use the bytes type and a file open in
binary mode:

b=bytes([253])
f = open("abc", 'wb')
f.write(b) 1
f.close()

Click to expand...

od abc -d

Click to expand...

0000000 253
0000001

Hope that helps.

Gary Herron

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

Traceback (most recent call last):
File "./nntst2.py", line 5, in <module>
print(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)

ascii

..Oh great!

ok lets try this:
#nntst3.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar.encode('latin1'))

ascii
b'\xfd'

..Eh... not what I want really.

#nntst4.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
sys.stdout=codecs.getwriter("latin1")(sys.stdout)
print(mychar)

ISO8859-1
Traceback (most recent call last):
File "./nntst4.py", line 6, in <module>
print(mychar)
File "Python-3.1.2/Lib/codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes

..OK, this is not working either.

Is there any way to write a value 253 to standard output?

Click to expand...

Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.
I am aware that I could do
open('nnout','w',encoding='latin1').write(mychar) but I am porting a
python2 program and don't want to rewrite everything that uses that
script.

Stefan Behnel · Mar 23, 2010

nn, 23.03.2010 19:46:

Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.
I am aware that I could do
open('nnout','w',encoding='latin1').write(mychar) but I am porting a
python2 program and don't want to rewrite everything that uses that
script.

Are you writing text or binary data to stdout?

Stefan

nn · Mar 23, 2010

Stefan said:
nn, 23.03.2010 19:46:

Are you writing text or binary data to stdout?

Stefan

latin1 charset text.

Martin v. Loewis · Mar 23, 2010

nn said:
latin1 charset text.

Are you sure about that? If you carefully reconsider, could you come to
the conclusion that you are not writing text at all, but binary data?

If it really was text that you write, why do you need to use
U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
character is really infrequently used in practice. So that you try to
write it strongly suggests that it is not actually text what you are
writing.

Also, your formulation suggests the same:

"Is there any way to write a value 253 to standard output?"

If you would really be writing text, you'd ask

"Is there any way to write 'ý' to standard output?"

Regards,
Martin

Steven D'Aprano · Mar 24, 2010

Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.

What do you mean "work"?

Do you mean "display a particular glyph" or something else?

In bash:

$ echo -e "\0101" # octal 101 = decimal 65
A
$ echo -e "\0375" # decimal 253
ï¿½

but if I change the terminal encoding, I get this:

$ echo -e "\0375"
Ã½

Or this:

$ echo -e "\0375"
Â²

depending on which encoding I use.

I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.

nn · Mar 24, 2010

Martin said:
Are you sure about that? If you carefully reconsider, could you come to
the conclusion that you are not writing text at all, but binary data?

If it really was text that you write, why do you need to use
U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
character is really infrequently used in practice. So that you try to
write it strongly suggests that it is not actually text what you are
writing.

Also, your formulation suggests the same:

"Is there any way to write a value 253 to standard output?"

If you would really be writing text, you'd ask

"Is there any way to write 'ï¿½' to standard output?"

Regards,
Martin

To be more informative I am both writing text and binary data
together. That is I am embedding text from another source into stream
that uses non-ascii characters as "control" characters. In Python2 I
was processing it mostly as text containing a few "funny" characters.

nn · Mar 24, 2010

Steven said:
What do you mean "work"?

Do you mean "display a particular glyph" or something else?

In bash:

$ echo -e "\0101" # octal 101 = decimal 65
A
$ echo -e "\0375" # decimal 253
ï¿½

but if I change the terminal encoding, I get this:

$ echo -e "\0375"
Ã½

Or this:

$ echo -e "\0375"
Â²

depending on which encoding I use.

I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.

Yes sorry it is a bit ambiguous. I don't really care what glyph is,
the program reading my output reads 8 bit values expects the binary
value 0xFD as control character and lets everything else through as is.

Antoine Pitrou · Mar 24, 2010

I know that unicode is the way to go in Python 3.1, but it is getting in
my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

print() writes to the text (unicode) layer of sys.stdout.
If you want to access the binary (bytes) layer, you must use
sys.stdout.buffer. So:

sys.stdout.buffer.write(chr(253).encode('latin1'))

or:

sys.stdout.buffer.write(bytes([253]))

See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer

Michael Torrie · Mar 24, 2010

Steven said:
I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.

It may or may not be malformed, but I understand the question. So let
eme translate for you. How can he write arbitrary bytes ( 0x0 through
0xff) to stdout without having them mangled by encodings. It's a very
simple question, really. Looks like Antoine Pitrou has answered this
question quite nicely as well.

nn · Mar 24, 2010

Antoine said:
Le Tue, 23 Mar 2010 10:33:33 -0700, nn a écrit :

I know that unicode is the way to go in Python 3.1, but it is getting in
my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

Click to expand...

print() writes to the text (unicode) layer of sys.stdout.
If you want to access the binary (bytes) layer, you must use
sys.stdout.buffer. So:

sys.stdout.buffer.write(chr(253).encode('latin1'))

or:

sys.stdout.buffer.write(bytes([253]))

See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer

Just what I needed! Now I full control of the output.

Thanks Antoine. The new io stack is still a bit of a mystery to me.

Thanks everybody else, and sorry for confusing the issue. Latin1 just
happens to be very convenient to manipulate bytes and is what I
thought of initially to handle my mix of textual and non-textual data.

John Nagle · Mar 24, 2010

nn said:
To be more informative I am both writing text and binary data
together. That is I am embedding text from another source into stream
that uses non-ascii characters as "control" characters. In Python2 I
was processing it mostly as text containing a few "funny" characters.

OK. Then you need to be writing arrays of bytes, not strings.
Encoding is your problem. This has nothing to do with Unicode.

John Nagle

numpy.genfromtxt with Python3 - howto	0	Apr 6, 2012
Pyglet on Python3.x, problems	5	Jul 29, 2013
Unicode	2	Mar 15, 2013
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012
Unicode Chars in Windows Path	12	Apr 3, 2014
python3 Unicode is slow	1	Oct 25, 2009
Logging library unicode problem	0	Aug 13, 2008
helping with unicode	4	Jul 3, 2012

Unicode blues in Python3

nn

Rami Chowdhury

nn

Gary Herron

nn

Stefan Behnel

nn

Martin v. Loewis

Steven D'Aprano

nn

nn

Antoine Pitrou

Michael Torrie

nn

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads