Unicode blues in Python3

N

nn

I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)
> ./nntst2.py ISO8859-1
ý

> ./nntst2.py >nnout2
Traceback (most recent call last):
File "./nntst2.py", line 5, in <module>
print(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)
cat nnout2
ascii

...Oh great!

ok lets try this:
#nntst3.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar.encode('latin1'))
./nntst3.py ISO8859-1
b'\xfd'

./nntst3.py >nnout3
cat nnout3
ascii
b'\xfd'

...Eh... not what I want really.

#nntst4.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
sys.stdout=codecs.getwriter("latin1")(sys.stdout)
print(mychar)
> ./nntst4.py
ISO8859-1
Traceback (most recent call last):
File "./nntst4.py", line 6, in <module>
print(mychar)
File "Python-3.1.2/Lib/codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes

...OK, this is not working either.

Is there any way to write a value 253 to standard output?
 
R

Rami Chowdhury

I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

The following code works for me:

$ cat nnout5.py
#!/usr/bin/python3.1

import sys
mychar = chr(253)
sys.stdout.write(mychar)
$ echo $(cat nnout)
ý

Can I ask why you're using print() in the first place, rather than writing
directly to a file? Python 3.x, AFAIK, distinguishes between text and binary
files and will let you specify the encoding you want for strings you write.

Hope that helps,
Rami
 
N

nn

Rami said:
The following code works for me:

$ cat nnout5.py
#!/usr/bin/python3.1

import sys
mychar = chr(253)
sys.stdout.write(mychar)
$ echo $(cat nnout)
ý

Can I ask why you're using print() in the first place, rather than writing
directly to a file? Python 3.x, AFAIK, distinguishes between text and binary > files and will let you specify the encoding you want for strings you write.

Hope that helps,
Rami

#nntst5.py
import sys
mychar=chr(253)
sys.stdout.write(mychar)
./nntst5.py >nnout5
Traceback (most recent call last):
File "./nntst5.py", line 4, in <module>
sys.stdout.write(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)

equivalent to print.

I use print so I can do tests and debug runs to the screen or pipe it
to some other tool and then configure the production bash script to
write the final output to a file of my choosing.
 
G

Gary Herron

nn said:
I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

Python3 make a distinction between bytes and string(i.e., unicode)
types, and you are still thinking in the Python2 mode that does *NOT*
make such a distinction. What you appear to want is to write a
particular byte to a file -- so use the bytes type and a file open in
binary mode:
>>> b=bytes([253])
>>> f = open("abc", 'wb')
>>> f.write(b) 1
>>> f.close()

> od abc -d
0000000 253
0000001


Hope that helps.

Gary Herron
 
N

nn

Gary said:
nn said:
I know that unicode is the way to go in Python 3.1, but it is getting
in my way right now in my Unix scripts. How do I write a chr(253) to a
file?

Python3 make a distinction between bytes and string(i.e., unicode)
types, and you are still thinking in the Python2 mode that does *NOT*
make such a distinction. What you appear to want is to write a
particular byte to a file -- so use the bytes type and a file open in
binary mode:
b=bytes([253])
f = open("abc", 'wb')
f.write(b) 1
f.close()

od abc -d
0000000 253
0000001


Hope that helps.

Gary Herron


#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

Traceback (most recent call last):
File "./nntst2.py", line 5, in <module>
print(mychar)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
position 0: ordinal not in range(128)


ascii

..Oh great!

ok lets try this:
#nntst3.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar.encode('latin1'))


ascii
b'\xfd'

..Eh... not what I want really.

#nntst4.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
sys.stdout=codecs.getwriter("latin1")(sys.stdout)
print(mychar)

ISO8859-1
Traceback (most recent call last):
File "./nntst4.py", line 6, in <module>
print(mychar)
File "Python-3.1.2/Lib/codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes

..OK, this is not working either.

Is there any way to write a value 253 to standard output?

Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.
I am aware that I could do
open('nnout','w',encoding='latin1').write(mychar) but I am porting a
python2 program and don't want to rewrite everything that uses that
script.
 
S

Stefan Behnel

nn, 23.03.2010 19:46:
Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.
I am aware that I could do
open('nnout','w',encoding='latin1').write(mychar) but I am porting a
python2 program and don't want to rewrite everything that uses that
script.

Are you writing text or binary data to stdout?

Stefan
 
M

Martin v. Loewis

nn said:
latin1 charset text.

Are you sure about that? If you carefully reconsider, could you come to
the conclusion that you are not writing text at all, but binary data?

If it really was text that you write, why do you need to use
U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
character is really infrequently used in practice. So that you try to
write it strongly suggests that it is not actually text what you are
writing.

Also, your formulation suggests the same:

"Is there any way to write a value 253 to standard output?"

If you would really be writing text, you'd ask


"Is there any way to write 'ý' to standard output?"

Regards,
Martin
 
S

Steven D'Aprano

Actually what I want is to write a particular byte to standard output,
and I want this to work regardless of where that output gets sent to.

What do you mean "work"?

Do you mean "display a particular glyph" or something else?

In bash:

$ echo -e "\0101" # octal 101 = decimal 65
A
$ echo -e "\0375" # decimal 253
�

but if I change the terminal encoding, I get this:

$ echo -e "\0375"
ý

Or this:

$ echo -e "\0375"
²

depending on which encoding I use.

I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.
 
N

nn

Martin said:
Are you sure about that? If you carefully reconsider, could you come to
the conclusion that you are not writing text at all, but binary data?

If it really was text that you write, why do you need to use
U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
character is really infrequently used in practice. So that you try to
write it strongly suggests that it is not actually text what you are
writing.

Also, your formulation suggests the same:

"Is there any way to write a value 253 to standard output?"

If you would really be writing text, you'd ask


"Is there any way to write '�' to standard output?"

Regards,
Martin

To be more informative I am both writing text and binary data
together. That is I am embedding text from another source into stream
that uses non-ascii characters as "control" characters. In Python2 I
was processing it mostly as text containing a few "funny" characters.
 
N

nn

Steven said:
What do you mean "work"?

Do you mean "display a particular glyph" or something else?

In bash:

$ echo -e "\0101" # octal 101 = decimal 65
A
$ echo -e "\0375" # decimal 253
�

but if I change the terminal encoding, I get this:

$ echo -e "\0375"
ý

Or this:

$ echo -e "\0375"
²

depending on which encoding I use.

I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.

Yes sorry it is a bit ambiguous. I don't really care what glyph is,
the program reading my output reads 8 bit values expects the binary
value 0xFD as control character and lets everything else through as is.
 
A

Antoine Pitrou

Le Tue, 23 Mar 2010 10:33:33 -0700, nn a écrit :
I know that unicode is the way to go in Python 3.1, but it is getting in
my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

print() writes to the text (unicode) layer of sys.stdout.
If you want to access the binary (bytes) layer, you must use
sys.stdout.buffer. So:

sys.stdout.buffer.write(chr(253).encode('latin1'))

or:

sys.stdout.buffer.write(bytes([253]))

See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer
 
M

Michael Torrie

Steven said:
I think your question is malformed. You need to work out what behaviour
you actually want, before you can ask for help on how to get it.

It may or may not be malformed, but I understand the question. So let
eme translate for you. How can he write arbitrary bytes ( 0x0 through
0xff) to stdout without having them mangled by encodings. It's a very
simple question, really. Looks like Antoine Pitrou has answered this
question quite nicely as well.
 
N

nn

Antoine said:
Le Tue, 23 Mar 2010 10:33:33 -0700, nn a écrit :
I know that unicode is the way to go in Python 3.1, but it is getting in
my way right now in my Unix scripts. How do I write a chr(253) to a
file?

#nntst2.py
import sys,codecs
mychar=chr(253)
print(sys.stdout.encoding)
print(mychar)

print() writes to the text (unicode) layer of sys.stdout.
If you want to access the binary (bytes) layer, you must use
sys.stdout.buffer. So:

sys.stdout.buffer.write(chr(253).encode('latin1'))

or:

sys.stdout.buffer.write(bytes([253]))

See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer

Just what I needed! Now I full control of the output.

Thanks Antoine. The new io stack is still a bit of a mystery to me.

Thanks everybody else, and sorry for confusing the issue. Latin1 just
happens to be very convenient to manipulate bytes and is what I
thought of initially to handle my mix of textual and non-textual data.
 
J

John Nagle

nn said:
To be more informative I am both writing text and binary data
together. That is I am embedding text from another source into stream
that uses non-ascii characters as "control" characters. In Python2 I
was processing it mostly as text containing a few "funny" characters.

OK. Then you need to be writing arrays of bytes, not strings.
Encoding is your problem. This has nothing to do with Unicode.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top