python 3.3 repr

R

Robin Becker

I'm trying to understand what's going on with this simple program

if __name__=='__main__':
print("repr=%s" % repr(u'\xc1'))
print("%%r=%r" % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
File "bang.py", line 2, in <module>
print("repr=%s" % repr(u'\xc1'))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
character maps to <undefined>

If I run the program redirected into a file then no error occurs and the the
result looks like this

C:\tmp>cat fff
repr='â”´'
%r='â”´'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a problem
since print then seems to want to convert that to cp437 if directed into a terminal.

I find the idea that print knows what it's printing to a bit dangerous, but it's
the repr behaviour that strikes me as bad.

What is responsible for defining the repr function's 'printable' so that repr
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker
 
N

Ned Batchelder

I'm trying to understand what's going on with this simple program

if __name__=='__main__':
print("repr=%s" % repr(u'\xc1'))
print("%%r=%r" % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
File "bang.py", line 2, in <module>
print("repr=%s" % repr(u'\xc1'))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
character maps to <undefined>

If I run the program redirected into a file then no error occurs and the the
result looks like this

C:\tmp>cat fff
repr='â”´'
%r='â”´'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a problem
since print then seems to want to convert that to cp437 if directed into a terminal.

I find the idea that print knows what it's printing to a bit dangerous, but it's
the repr behaviour that strikes me as bad.

What is responsible for defining the repr function's 'printable' so that repr
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker

In Python3, repr() will return a Unicode string, and will preserve existingUnicode characters in its arguments. This has been controversial. To getthe Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

--Ned.
 
R

Robin Becker

On 15/11/2013 11:38, Ned Batchelder wrote:
...........
In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

--Ned.

thanks for this, edoesn't make the split across python2 - 3 any easier.
 
N

Ned Batchelder

On 15/11/2013 11:38, Ned Batchelder wrote:
..........

thanks for this, edoesn't make the split across python2 - 3 any easier.

No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

try:
repr = ascii
except NameError:
pass

and then use repr throughout.

--Ned.
 
R

Roy Smith

Ned Batchelder said:
In Python3, repr() will return a Unicode string, and will preserve existing
Unicode characters in its arguments. This has been controversial. To get
the Python 2 behavior of a pure-ascii representation, there is the new
builtin ascii(), and a corresponding %a format string.

I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.

The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.

Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:

MAIN() \(
PRINTF("HELLO, ASCII WORLD");
\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
 
R

Robin Becker

On 15/11/2013 13:54, Ned Batchelder wrote:
..........
No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

try:
repr = ascii
except NameError:
pass
.....
yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c,
for me it seems easier to fix windows to use something like a standard encoding
of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize
is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into
one of the global environment vars and have it work for all python invocations.
 
S

Serhiy Storchaka

15.11.13 15:54, Ned Batchelder напиÑав(ла):
No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

try:
repr = ascii
except NameError:
pass

and then use repr throughout.

Or rather

try:
ascii
except NameError:
ascii = repr

and then use ascii throughout.
 
R

Robin Becker

...........
I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.

The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.

unfortunately the word 'printable' got into the definition of repr; it's clear
that printability is not the same as unicode at least as far as the print
function is concerned. In my opinion it would have been better to leave the old
behaviour as that would have eased the compatibility.

The python gods don't count that sort of thing as important enough so we get the
mess that is the python2/3 split. ReportLab has to do both so it's a real issue;
in addition swapping the str - unicode pair to bytes str doesn't help one's
mental models either :(

Things went wrong when utf8 was not adopted as the standard encoding thus
requiring two string types, it would have been easier to have a len function to
count bytes as before and a glyphlen to count glyphs. Now as I understand it we
have a complicated mess under the hood for unicode objects so they have a
variable representation to approximate an 8 bit representation when suitable etc
etc etc.
Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:

MAIN() \(
PRINTF("HELLO, ASCII WORLD");
\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
......
I can certainly remember those days, how we cried and laughed when 8 bits became
popular.
 
J

Joel Goldstick

Some of us have been doing this long enough to remember when "just plain
.....
I can certainly remember those days, how we cried and laughed when 8 bits
became popular.
Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
;). That eighth bit sure was less confusing than codepoint
translations

 
R

Robin Becker

On 15/11/2013 14:40, Serhiy Storchaka wrote:
.......

Or rather

try:
ascii
except NameError:
ascii = repr

and then use ascii throughout.

apparently you can import ascii from future_builtins and the print() function is
available as

from __future__ import print_function

nothing fixes all those %r formats to be %a though :(
 
R

Robin Becker

............
Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
;). That eighth bit sure was less confusing than codepoint
translations


no we had 6 bits in 60 bit words as I recall; extracting the nth character
involved division by 6; smart people did tricks with inverted multiplications
etc etc :(
 
J

Joel Goldstick

...........




no we had 6 bits in 60 bit words as I recall; extracting the nth character
involved division by 6; smart people did tricks with inverted
multiplications etc etc :(
--

Cool, someone here is older than me! I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.
 
N

Ned Batchelder

Things went wrong when utf8 was not adopted as the standard encoding thus
requiring two string types, it would have been easier to have a len function to
count bytes as before and a glyphlen to count glyphs. Now as I understandit we
have a complicated mess under the hood for unicode objects so they have a
variable representation to approximate an 8 bit representation when suitable etc
etc etc.

Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode codepoints, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choosewhere you want to think about it.

--Ned.
 
C

Chris Angelico

..........


unfortunately the word 'printable' got into the definition of repr; it's
clear that printability is not the same as unicode at least as far as the
print function is concerned. In my opinion it would have been better to
leave the old behaviour as that would have eased the compatibility.

"Printable" means many different things in different contexts. In some
contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet
each of those characters is perfectly displayable in its natural form.
Under IDLE, non-BMP characters can't be displayed (or at least, that's
how it has been; I haven't checked current status on that one). On
Windows, the console runs in codepage 437 by default (again, I may be
wrong here), so anything not representable in that has to be escaped.
My Linux box has its console set to full Unicode, everything working
perfectly, so any non-control character can be printed. As far as
Python's concerned, all of that is outside - something is "printable"
if it's printable within Unicode, and the other hassles are matters of
encoding. (Except the first one. I don't think there's an encoding
"g-rated".)
The python gods don't count that sort of thing as important enough so we get
the mess that is the python2/3 split. ReportLab has to do both so it's a
real issue; in addition swapping the str - unicode pair to bytes str doesn't
help one's mental models either :(

That's fixing, in effect, a long-standing bug - of a sort. The name
"str" needs to be applied to the most normal string type. As of Python
3, that's a Unicode string, which is as it should be. In Python 2, it
was the ASCII/bytes string, which still fit the description of "most
normal string type", but that means that Python 2 programs are
Unicode-unaware by default, which is a flaw. Hence the Py3 fix.
Things went wrong when utf8 was not adopted as the standard encoding thus
requiring two string types, it would have been easier to have a len function
to count bytes as before and a glyphlen to count glyphs. Now as I understand
it we have a complicated mess under the hood for unicode objects so they
have a variable representation to approximate an 8 bit representation when
suitable etc etc etc.

http://unspecified.wordpress.com/20...e-of-language-level-abstract-unicode-strings/

There are languages that do what you describe. It's very VERY easy to
break stuff. What happens when you slice a string?
('as', 'df')
foo = "q\u1234zy"
foo[:2],foo[2:]
('qሴ', 'zy')

Looks good to me. I split a four-character string, I get two
one-character strings. If that had been done in UTF-8, either I would
need to know "don't split at that boundary, that's between bytes in a
character", or else the indexing and slicing would have to be done by
counting characters from the beginning of the string - an O(n)
operation, rather than an O(1) pointer arithmetic, not to mention that
it'll blow your CPU cache (touching every part of a potentially-long
string) just to find the position.

The only reliable way to manage things is to work with true Unicode.
You can completely ignore the internal CPython representation; what
matters is that in Python (any implementation, as long as it conforms
with version 3.3 or later) lets you index Unicode codepoints out of a
Unicode string, without differentiating between those that happen to
be ASCII, those that fit in a single byte, those that fit in two
bytes, and those that are flagged RTL, because none of those
considerations makes any difference to you.

It takes some getting your head around, but it's worth it - same as
using git instead of a Windows shared drive. (I'm still trying to push
my family to think git.)

ChrisA
 
R

Robin Becker

On 15/11/2013 15:07, Joel Goldstick wrote:
.........


Cool, someone here is older than me! I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.

The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had
12 bits I think, then came the IBM 7094 which had 36 bits and finally the
CDC6000 & 7600 machines with 60 bits, some one must have liked 6's
-mumbling-ly yrs-
Robin Becker
 
R

Roy Smith

The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9

I don't know about the 15, but the 10 had 36 bit words (18-bit halfwords). One common character packing was 5 7-bit characters per 36 bit word (with the sign bit left over).

Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else.
 
R

Robin Becker

..........
Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it.

--Ned.
........
I don't think that's what I said; the flexible representation is just an added
complexity that has come about because of the wish to store strings in a compact
way. The requirement for such complexity is the unicode type itself (especially
the storage requirements) which necessitated some remedial action.

There's no point in fighting the change to using unicode. The type wasn't
required for any technical reason as other languages didn't go this route and
are reasonably ok, but there's no doubt the change made things more difficult.
 
A

Antoon Pardon

Op 15-11-13 16:39, Robin Becker schreef:
.........
.......
I don't think that's what I said; the flexible representation is just an
added complexity ...

No it is not, at least not for python programmers. (It of course is for
the python implementors). The python programmer doesn't have to care
about the flexible representation, just as the python programmer doesn't
have to care about the internal reprensentation of (long) integers. It
is an implemantation detail that is mostly ignorable.
 
C

Chris Angelico

.......
I don't think that's what I said; the flexible representation is just an
added complexity that has come about because of the wish to store strings in
a compact way. The requirement for such complexity is the unicode type
itself (especially the storage requirements) which necessitated some
remedial action.

There's no point in fighting the change to using unicode. The type wasn't
required for any technical reason as other languages didn't go this route
and are reasonably ok, but there's no doubt the change made things more
difficult.

There's no perceptible difference between a 3.2 wide build and the 3.3
flexible representation. (Differences with narrow builds are bugs, and
have now been fixed.) As far as your script's concerned, Python 3.3
always stores strings in UTF-32, four bytes per character. It just
happens to be way more efficient on memory, most of the time.

Other languages _have_ gone for at least some sort of Unicode support.
Unfortunately quite a few have done a half-way job and use UTF-16 as
their internal representation. That means there's no difference
between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled
differently. ECMAScript actually specifies the perverse behaviour of
treating codepoints >U+FFFF as two elements in a string, because it's
just too costly to change.

There are a small number of languages that guarantee correct Unicode
handling. I believe bash scripts get this right (though I haven't
tested; string manipulation in bash isn't nearly as rich as a proper
text parsing language, so I don't dig into it much); Pike is a very
Python-like language, and PEP 393 made Python even more Pike-like,
because Pike's string has been variable width for as long as I've
known it. A handful of other languages also guarantee UTF-32
semantics. All of them are really easy to work with; instead of
writing your code and then going "Oh, I wonder what'll happen if I
give this thing weird characters?", you just write your code, safe in
the knowledge that there is no such thing as a "weird character"
(except for a few in the ASCII set... you may find that code breaks if
given a newline in the middle of something, or maybe the slash
confuses you).

Definitely don't fight the change to Unicode, because it's not a
change at all... it's just fixing what was buggy. You already had a
difference between bytes and characters, you just thought you could
ignore it.

ChrisA
 
G

Gene Heskett

Cool, someone here is older than me! I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.

Ok, if you are feeling old & decrepit, hows this for a birthday: 10/04/34,
I came into micro computers about RCA 1802 time. Wrote a program for the
1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding
CA, that was still in use in '94, but never really wrote assembly code
until the 6809 was out in the Radio Shack Color Computers. os9 on the
coco's was the best teacher about the unix way of doing things there ever
was. So I tell folks these days that I am 39, with 40 years experience at
being 39. ;-)


Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Counting in binary is just like counting in decimal -- if you are all
thumbs.
-- Glaser and Way
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top