Python 3.2 has some deadly infection

S

Steven D'Aprano

........
I'm fairly causal and I did understand that the rant was a bit over the
top for fairly practical reasons I have always regarded the std streams
as allowing binary data and always objected to having to open files in
python with a 't' or 'b' mode to cope with line ending issues.

Isn't it a bit old fashioned to think everything is connected to a
console?

The whole concept of stdin and stdout is based on the idea of having a
console to read from and write to. Otherwise, what would be the point?
Classic Mac (pre OS X) had no command line interface nothing, and nothing
even remotely like stdin and stdout. But once you have a console, stdin,
stdout, and stderr become useful. And once you have them, then you can
extend the concept using redirection and pipes. But fundamentally, stdin
and stdout are about consoles.

I think the idea that we only give meaning to binary data using
encodings is a bit limiting. A zip or gif file has structure, but I
don't think it's reasonable to regard such a file as having an encoding
in the python unicode sense.

In the Unicode sense? Of course not, that would be silly.

The concept of encodings is bigger than just text, and in that sense zip
compression is an encoding which encodes non-random data into a different
format which generally takes up less space.
 
G

Gregory Ewing

Steven said:
The whole concept of stdin and stdout is based on the idea of having a
console to read from and write to.

Not really; stdin and stdout are frequently connected to
files, or pipes to other processes. The console, if it
exists, just happens to be a convenient default value for
them. Even on a system without a console, they're still
a useful abstraction.

But we were talking about encodings, and whether stdin
and stdout should be text or binary by default. Well,
one of the design principles behind unix is to make use
of plain text wherever possible. Not just for stuff
meant to be seen on the screen, but for stuff kept in
files as well.

As a result, most unix programs, most of the time, deal
with text on stdin and stdout. So, it makes sense for
them to be text by default. And wherever there's text,
there needs to be an encoding. This is true whether
a console is involved or not.
 
A

Akira Li

Steven D'Aprano said:
The whole concept of stdin and stdout is based on the idea of having a
console to read from and write to. Otherwise, what would be the point?
Classic Mac (pre OS X) had no command line interface nothing, and nothing
even remotely like stdin and stdout. But once you have a console, stdin,
stdout, and stderr become useful. And once you have them, then you can
extend the concept using redirection and pipes. But fundamentally, stdin
and stdout are about consoles.
We can consider "pipes" abstraction to be fundumental. Decades of usage
prove a pipeline of processes usefulness e.g.,

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

See http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

Whether or not a pipe is connected to a tty is a small
detail. stdin/stdout is about pipes, not consoles.
 
M

Marko Rauhamaa

Gregory Ewing said:
As a result, most unix programs, most of the time, deal
with text on stdin and stdout.

Well, ok. But even accepting that premise, that "text" might not be what
Python3 considers "text".

For example, if your program reads in XML, JSON or Python, the parser
object might prefer to take it in as bytes and not have it predecoded by
sys.stdin.
So, it makes sense for them to be text by default.

I'm not sure. That could lead to nasty surprises.

I've experienced analogous consternations when the "sort" utility hasn't
worked identically for identical input: it is heavily influenced by the
(spit, spit) locale. That's why 99.9% of your scripts should prefix
"sort" and "grep" with LC_ALL=C -- even when the input really is UTF-8.

Should I now take it further and prefix all Python programs with
LC_ALL=C? Probably not, since UTF-8 might cause sys.stdin to barf.
And wherever there's text, there needs to be an encoding.

No problem there, only should sys.stdin and sys.stdout carry the
decoding/encoding out or should it be left for the program.


Marko
 
C

Chris Angelico

No problem there, only should sys.stdin and sys.stdout carry the
decoding/encoding out or should it be left for the program.

The most normal thing to do with the standard streams is to have them
produce text, and as much as possible, you shouldn't have to go to
great lengths to make that work. If, in Python, I say print("Hello,
world!"), I expect that to produce a line of text on the screen,
without my code having to encode that to bytes, figure out what sort
of newline to add, etc, etc.

Even if stdout isn't a tty, chances are you're still working with
text. Only an extreme few Unix programs actually manipulate binary
standard streams (some, like cat, will pipe binary through unchanged,
but even cat assumes text for options like -n); those few should be
the ones to have to worry about setting stdin and stdout to be binary.
In the same way that we have double-quoted strings being Unicode
strings, we should have print() and input() "naturally just work" with
Unicode, which means they should negotiate encodings with the system
without the programmer having to lift a finger.

ChrisA
 
M

Marko Rauhamaa

Chris Angelico said:
If, in Python, I say print("Hello, world!"), I expect that to produce
a line of text on the screen, without my code having to encode that to
bytes, figure out what sort of newline to add, etc, etc.

That example in no way represents the typical Python program (if there
is one).
Only an extreme few Unix programs actually manipulate binary standard
streams

That's quite an assumption to make.
we should have print() and input() "naturally just work" with Unicode

No problem there. I couldn't imagine using either function for anything
serious.


Marko
 
S

Steven D'Aprano

Not really; stdin and stdout are frequently connected to files, or pipes
to other processes. The console, if it exists, just happens to be a
convenient default value for them. Even on a system without a console,
they're still a useful abstraction.

If you had kept reading my post, including the bits you cut out *wink*,
you'd see that I did raise that same point. Having stdin and stdout
trivially generalises to the idea of replacing them with other files, or
pipes. But the idea of having standard input and standard output in the
first place comes about because they are useful for the console. I gave
the example of Mac, which didn't have a command-line interface at all,
hence no console, no stdin, no stdout.

If a system had no command line interface (hence no consoles), why would
you bother with a *standard* input file and output file that are never
used?

But we were talking about encodings, and whether stdin and stdout should
be text or binary by default. Well, one of the design principles behind
unix is to make use of plain text wherever possible.

What's plain text? *half a wink*

Its a serious question. Some people think that "good ol' plain text" is
EBCDIC, like IBM intended. To them, the letter "A" is synonymous with the
byte 0xC1, and there's no need for an encoding (or so they think) because
"A" *is* 0xC1.

Of course, people on ASCII systems know better: who needs encodings when
it is a universal fact that "A" *is* 0x41?

*wink*

Not just for stuff
meant to be seen on the screen, but for stuff kept in files as well.

As a result, most unix programs, most of the time, deal with text on
stdin and stdout. So, it makes sense for them to be text by default. And
wherever there's text, there needs to be an encoding. This is true
whether a console is involved or not.


Agreed.
 
C

Chris Angelico

That example in no way represents the typical Python program (if there
is one).

It's simpler than most, but use of print() is certainly quite common.
A naive search of .py files in my /usr came up with five thousand
instances of ' print(', and given that that search won't necessarily
find a Python 2 print statement (and I'm on Debian Wheezy, so Py2 is
the system Python), I think that's a fairly respectable figure.
That's quite an assumption to make.

Okay. Start listing some. You have (de)compression programs like gzip,
which primarily work with files but can work with standard streams;
some image or movie manipulation programs (eg avconv) can also read
from stdin, although again, it's far more common to use files; cat
will happily transmit binary untouched, but all its options (at least
the ones I can see in my 'man cat') are for working with text.

What else do you have? Let's see... grep, sort, less/more, sed, awk,
these are all text manipulation programs. All your "give me info about
the system" programs (ls, mount, pwd, hostname, date.......) print
text to stdout. Some also read from stdin, like md5sum and related.

Piles and piles of programs that work with text. A small handful that
work with binary, and most of them are more commonly used directly
with files, not with pipes. The most common case is that it all be
text.
No problem there. I couldn't imagine using either function for anything
serious.

I don't know about those exact functions, but I do know that there are
plenty of Python programs that use the console (take hg as one fairly
hefty example). Maybe input() isn't all that heavily used, but
certainly print() is a fine function. I can not only imagine using
them seriously, I *have used* them, and their equivalents in other
languages, seriously.

If the standard streams are so crucial, why are their most obvious
interfaces insignificant to you?

ChrisA
 
M

Marko Rauhamaa

Steven D'Aprano said:
But the idea of having standard input and standard output in the first
place comes about because they are useful for the console.

I doubt that. Classic programs take input and produce output. Standard
input and output are the default input and output. The textbook Pascal
programs started:

program myprogram(input, output);
If a system had no command line interface (hence no consoles), why
would you bother with a *standard* input file and output file that are
never used?

Because programs are supposed to do useful work. They consume input and
produce output. That concept is older than computers themselves and is
used to define things like computation, algorithm, halting etc.

No, one of the design principles behind unix is that all data is bytes:
memory, files, devices, sockets, pathnames. Yes, the
ASCII-is-good-for-everybody assumption has been there since the
beginning, but Python will not be able to hide the fact that there is no
text data (in the Python sense). There are only bytes.

UTF-8 beautifully gives text a second-class citizenship in unix/linux.
It will never be granted first-class citizenship, though.

Disagreed strongly.

tcpdump -s 0 -w - >error.pcap
tar zxf - <python.tar.gz
sha1sum <smile.jpg
base64 -d <a.dat >a.exe
wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 -

Unfortunately, the text/binary dichotomy breaks a beautiful principle in
Python as well. In numerous contexts, any file-like object will be
valid. Now there is no file-like object. Instead, you have
text-file-like objects and binary-file-like objects, which require
special attention since some operate on strings while others operate on
bytes.


Marko
 
M

Marko Rauhamaa

Chris Angelico said:
If the standard streams are so crucial, why are their most obvious
interfaces insignificant to you?

I want the standard streams to consume and produce bytes. I do a lot of
system programming and connect processes to each other with socketpairs,
pipes and the like. I have dealt with plugin APIs that communicate over
stdin and stdout.

Python is clearly on a crusade to make *text* a first class system
entity. I don't believe that is possible (without casualties) in the
linux world. Python text should only exist inside string objects.


Marko
 
W

wxjmfauth

Le jeudi 5 juin 2014 11:53:00 UTC+2, Marko Rauhamaa a écrit :
Chris Angelico <[email protected]>:







I want the standard streams to consume and produce bytes. I do a lot of

system programming and connect processes to each other with socketpairs,

pipes and the like. I have dealt with plugin APIs that communicate over

stdin and stdout.



Python is clearly on a crusade to make *text* a first class system

entity. I don't believe that is possible (without casualties) in the

linux world. Python text should only exist inside string objects.





Marko

=====

Are you sure?
timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'") [0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'") [2.5541921791045183, 2.52434366066052, 2.5337417948967413]
timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = 'z'.encode('utf-8')") [0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')") [0.9320969737165115, 0.9086006535332558, 0.9051715140790861]


sys.getsizeof('abc'*1000 + '\u0fce') 6040
sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8')) 3020

jmf
 
R

Rustom Mody

Steven D'Aprano wrote:
I doubt that. Classic programs take input and produce output. Standard
input and output are the default input and output. The textbook Pascal
programs started:
program myprogram(input, output);
Because programs are supposed to do useful work. They consume input and
produce output. That concept is older than computers themselves and is
used to define things like computation, algorithm, halting etc.
No, one of the design principles behind unix is that all data is bytes:
memory, files, devices, sockets, pathnames. Yes, the
ASCII-is-good-for-everybody assumption has been there since the
beginning, but Python will not be able to hide the fact that there is no
text data (in the Python sense). There are only bytes.
UTF-8 beautifully gives text a second-class citizenship in unix/linux.
It will never be granted first-class citizenship, though.

Disagreed strongly.
tcpdump -s 0 -w - >error.pcap
tar zxf - <python.tar.gz
sha1sum <smile.jpg
base64 -d <a.dat >a.exe
wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 -
Unfortunately, the text/binary dichotomy breaks a beautiful principle in
Python as well. In numerous contexts, any file-like object will be
valid. Now there is no file-like object. Instead, you have
text-file-like objects and binary-file-like objects, which require
special attention since some operate on strings while others operate on
bytes.


Pascal is for building pyramids — imposing, breathtaking, static
structures built by armies pushing heavy blocks into place. — Alan Perlis

Lisp is like a ball of mud. Add more and it's still a ball of mud
— it still looks like Lisp. — Guy Steele

There are two fundamental outlooks in computer science —
structuring and universality. And they pull in opposite
directions.

Universality happens when a data-structure can hold everything —
a universal data structure.

Some of the most significant advances in CS come from a universalist vision:

- von Neumann machine storing data+code in memory
- Turing-tape able to store arbitrary turing machines (∴ universal TM)
- Lisp program ≡ Lisp data
- Stream of byte can handle/represent everything in Unix — memory, files,
devices, sockets, pathnames.

However after the allurement of universality is over, the
realization dawns that we have a mess — Lisp is a 'mud-ball'. At
which point people start needing to make distinctions — code and
data, different data-structures, type-systems etc. IOW imposing
structure on the mud-ball.

Taking a broad view, while structuring trades the power for
order, it is universality that adds significant power.

Python is not as universal as Lisp — it has no homoiconicity.
But it is close enough in that any variable/data-structure can
contain any value.

What Marko is saying is that by imposing the structuring of
unicode on the outside (Unix) world of text=byte, significant power is lost.

This is also Armin's crib.

How significant that loss is, is yet to be seen…
 
M

Marko Rauhamaa

Rustom Mody said:
What Marko is saying is that by imposing the structuring of unicode on
the outside (Unix) world of text=byte, significant power is lost.

Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

A different OS might have different assumptions.


Marko
 
S

Steven D'Aprano

Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

Data on pretty much *all* computers consists of bytes, regardless of the
language or operating system. There may be a few esoteric or ancient
machines from the Dark Ages that aren't based on bytes, and even fewer
that aren't based on bits (ancient Soviet era mainframes, if any of them
still survive), but they aren't important. Someday esoteric non-byte
machines, perhaps quantum computers, or machines based on DNA, or nano-
sized analog computers made of carbon atoms, say, will be important, but
this is not that day. For now, bytes rule *everywhere*.

Nevertheless, there are important abstractions that are written on top of
the bytes layer, and in the Unix and Linux world, the most important
abstraction is *text*. In the Unix world, text formats and text
processing is much more common in user-space apps than binary processing.
Perhaps the definitive explanation and celebration of the Unix way is
Eric Raymond's "The Art Of Unix Programming":

http://www.catb.org/esr/writings/taoup/html/ch05s01.html
 
R

Robin Becker

Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

A different OS might have different assumptions.


Marko
I think I'm in the unix camp as well. I just think that an extra assumption on
input output isn't always helpful. In python 3 byte strings are second class
which I think is wrong; apparently pressure from influential users is pushing to
make byte strings more first class which is a good thing.
 
C

Chris Angelico

I think I'm in the unix camp as well. I just think that an extra assumption
on input output isn't always helpful. In python 3 byte strings are second
class which I think is wrong; apparently pressure from influential users is
pushing to make byte strings more first class which is a good thing.

I wouldn't say they're second-class; it's more that the bytes type was
considered to be more like a list of ints than like a Unicode string,
and now that there are a few years' worth of real-world usage
information to learn from, it's known that some more string-like
operations will be extremely helpful. So now they're being added,
which I agree is a good thing.

Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point.
The Py2 str does the first, the Py3 bytes does the second. That one's
a bit hard to change, but what I'm not sure of is how significant this
is to new-build Py3 code. Obviously it's a barrier to porting, but is
it important on its own? However, that's still not really "byte
strings are second class".

ChrisA
 
C

Chris Angelico

In the Unix world, text formats and text
processing is much more common in user-space apps than binary processing.
Perhaps the definitive explanation and celebration of the Unix way is
Eric Raymond's "The Art Of Unix Programming":

http://www.catb.org/esr/writings/taoup/html/ch05s01.html

Specifically, this from the opening paragraph:
"""
Text streams are a valuable universal format because they're easy for
human beings to read, write, and edit without specialized tools. These
formats are (or can be designed to be) transparent.
"""

He goes on to talk about network protocols, one of the best examples
of this. I've idly speculated at times about the possibility of
rewriting the Magic: The Gathering Online back-end with a view to
making it easier to work with. Among other changes, I'd be wanting to
make the client-server communication be plain text (an SMTP-style of
protocol), with an external layer of encryption (TLS). This would mean
that:

1) Internal testing can be done without TLS, making the communication
absolutely transparent, easy to debug, easy to watch, everything.
Adding TLS later would have zero impact on the critical code
internally - it's just a layer around the outside.
2) Upgrades to crypto can simply follow industry best-practice.
(Reminder, to anyone who might have been mad enough to consider this:
DO NOT roll your own crypto! Ever! Even if you use a good library for
the heavy lifting!)
3) A debug log of what the client has sent and received could be
included, even in production, at very low cost. You don't need to
decode packets and pretty-print them - you just take the lines of
text, maybe adorn or color them according to which were sent/received,
and dump them into a display box or log file somewhere.
4) The server is forced to acknowledge that the client might not be
the one it expected. Not only do you get better security that way, but
you could also call this a feature.
5) Therefore, you can debug the system with a simple TELNET or MUD
client (okay, most MUD clients don't do SSL, but you can use "openssl
s_client"). As someone who's debugged myriad issues using his trusty
MUD client, I consider this to be a *huge* advantage.

All it takes is a few simple rules, like: All communication is text,
encoded down the wire as UTF-8, and consists of lines (terminated by
U+000A) which consist of a word, a U+0020 space, and then parameters
to the command. There, that's a rigorous definition that covers
everything you'll need of it; compare with what Flash uses, by
default:

https://en.wikipedia.org/wiki/Action_Message_Format

Sure, it might be slightly more compact going down the wire; but what
do you really gain?

Text wins.

ChrisA
 
S

Steven D'Aprano

In python 3 byte strings
are second class which I think is wrong

It certainly is wrong. bytes are just as much a first-class built-in type
as list, int, float, bool, set, tuple and str.

There may be missing functionality (relatively easy to add new
functionality), and even poor design choices (like the foolish decision
to have bytes display as if they were ASCII-ish strings, a silly mistake
that simply reinforces the myth that bytes and ASCII are synonymous).
Python 3.4 and 3.5 are in the process of rectifying as many of these
mistakes as possible, e.g. adding back % formatting. But a few mistakes
in the design of bytes' API no more makes it "second-class" than the lack
of dict.contains_value() method makes dict "second-class".

By all means ask for better bytes functionality. But don't libel Python
by pretending that bytes is anything less than one of the most important
and fundamental types in the language. bytes are so important that there
are TWO implementations for them, a mutable and immutable version
(bytearray and bytes), while text strings only have an immutable version.
 
R

Robin Becker

On 05/06/2014 16:50, Chris Angelico wrote:
...........
I wouldn't say they're second-class; it's more that the bytes type was
considered to be more like a list of ints than like a Unicode string,
and now that there are a few years' worth of real-world usage
information to learn from, it's known that some more string-like
operations will be extremely helpful. So now they're being added,
which I agree is a good thing.

in python 2 str and unicode were much more comparable. On balance I think just
reversing them ie str --> bytes and unicode --> str was probably the right thing
to do if the default conversions had been turned off. However making bytes a
crippled thing was wrong.

Whether b"a"[0] should be b'a' or ord(b'a') is another sticking point.
The Py2 str does the first, the Py3 bytes does the second. That one's
a bit hard to change, but what I'm not sure of is how significant this
is to new-build Py3 code. Obviously it's a barrier to porting, but is
it important on its own? However, that's still not really "byte
strings are second class".
.......
I dislike the current model, but that's because I had a lot of stuff to convert
and probably made a bunch of blunders. The reportlab code is now a mess of hacks
to keep it alive for 2.7 & >=3.3; I'm probably never going to be convinced that
uncode types are good. Bytes are the underlying concept and should have remained
so for simplicity's sake.
 
S

Steven D'Aprano

Bytes are the underlying
concept and should have remained so for simplicity's sake.

Bytes are the underlying concept for classes too. Do you think that an
opaque unstructured blob of bytes is "simpler" to use than a class? How
would an unstructured blob of bytes be simpler to use than an array of
multi-byte characters?

Earlier:
I dislike the current model, but that's because I had a lot of stuff to
convert and probably made a bunch of blunders. The reportlab code is
now a mess of hacks to keep it alive for 2.7 & >=3.3;

Although I've been critical of many of your statements, I am sympathetic
to your pain. There's no doubt that that the transition from the old,
broken system of bytes masquerading as text can be hard, especially to
those who never quite get past the misleading and false paradigm that
"bytes are ASCII". It may have been that there were better ways to have
updated to 3.3; perhaps you were merely unfortunate to have updated too
early, and had you waited to 3.4 or 3.5 things would have been better. I
don't know.

But whatever the situation, and despite our differences of opinion about
Unicode, THANK YOU for having updated ReportLabs to 3.3.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top