Python 3.2 has some deadly infection

M

Marko Rauhamaa

Steven D'Aprano said:
Nevertheless, there are important abstractions that are written on top
of the bytes layer, and in the Unix and Linux world, the most
important abstraction is *text*. In the Unix world, text formats and
text processing is much more common in user-space apps than binary
processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

$ env | grep UTF
LANG=en_US.UTF-8
$ od -c <<<"Hyvää yötä" # "Good night" in Finnish
0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n
0000017

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

$ wc -c <<<"Hyvää yötä"
15
$ tr 'ä' 'a' <<<"Hyvää yötä"
Hyvaaaa ya�taa

Grep is smarter:

$ grep v...y <<<"Hyvää yötä"
Hyvää yötä

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).


Marko
 
R

Rustom Mody

Specifically, this from the opening paragraph:
"""
Text streams are a valuable universal format because they're easy for
human beings to read, write, and edit without specialized tools. These
formats are (or can be designed to be) transparent.
"""

A fact that stops being true when you tie up text with encodings.
For two reasons:

1. The function/pair encode/decode mapping between byte-string and text
cannot be a bijection because the byte-string set is larger than the text
set. This is the error that Armin was hit by

2. Since there is not one but a zillion encodings possible we are not
talking of one (possibly universal) data structure but a zillion
ones: "Text streams are a universal format" - which encoding-ed
form of text??
 
C

Chris Angelico

in python 2 str and unicode were much more comparable. On balance I think
just reversing them ie str --> bytes and unicode --> str was probably the
right thing to do if the default conversions had been turned off. However
making bytes a crippled thing was wrong.

It's easy to build up functionality after the event. Maybe reportlab
will have lots of hacks to support both 2.7 and 3.3, but in a few
years you'll be able to say "supports 2.7 and 3.5" and take advantage
of percent formatting and whatever else is added. But this is just the
way that languages develop; you use them, you find what isn't easy,
and you fix it. The nature of stability is that it takes time before
you can depend on freshly-written functionality (contrast the extreme
instability of running the version from source control - stuff might
be fixed at any time, but you have to do all the work yourself to make
sure your dependencies line up), but over time, you can depend on
improvements making their way out there.

Can you point to specific areas in which the bytes type is "crippled"?
Comparing either to the Py2 str or the Py3 str, or to anything else?
The Python core devs are listening, as evidenced by PEP 461.

ChrisA
 
I

Ian Kelly

in python 2 str and unicode were much more comparable. On balance I think
just reversing them ie str --> bytes and unicode --> str was probably the
right thing to do if the default conversions had been turned off. However
making bytes a crippled thing was wrong.

How should e.g. bytes.upper() be implemented then? The correct
behavior is entirely dependent on the encoding. Python 2 just assumes
ASCII, which at best will correctly upper-case some subset of the
string and leave the rest unchanged, and at worst could corrupt the
string entirely. There are some things that were dropped that should
not have been, but my impression is that those are being worked on,
for example % formatting in PEP 461.
 
C

Chris Angelico

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

Point of terminology: Linux is the kernel, everything you say below
here is talking about particular programs. From what I understand,
bash (just another Unix program) treats strings as sequences of
codepoints, just as Python does; though its string manipulation is not
nearly as rich as Python's, so it's harder to prove. Python is itself
a Unix program, so you can do the exact same proofs and demonstrate
that Linux is clearly Unicode-aware. It's not Linux you're testing.

ChrisA
 
C

Chris Angelico

A fact that stops being true when you tie up text with encodings.
For two reasons:

1. The function/pair encode/decode mapping between byte-string and text
cannot be a bijection because the byte-string set is larger than the text
set. This is the error that Armin was hit by

2. Since there is not one but a zillion encodings possible we are not
talking of one (possibly universal) data structure but a zillion
ones: "Text streams are a universal format" - which encoding-ed
form of text??

As soon as you store or transmit ANY form of information, you need to
worry about encodings. Ever heard of this thing called "network byte
order"? It's part of taming the wilds of integer encodings. The theory
is that the LC environment variables will carry all that crucial
out-of-band information about encodings, and while the practice isn't
perfect, it does still mean that there is such a thing as a text
stream.

ChrisA
 
T

Terry Reedy

Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

A text file is a binary file wrapped with a codex to translate to and
from a universal text format on input and output. Much of the time, the
wrapping is a great user convenience. Since the wrapping is optional,
nothing is forced or really hidden.
A different OS might have different assumptions.

Different OSes *do* have different assumptions. Both MacOSX and current
Windows use (UCS-2 or) UTF-16 for text. It seems that unicode strings
are better than ascii+??? strings as a universal basis for OS
interfacing. For Windows, at least, the interface is much improved in
Python 3.

I understand that some, but not all, Latin alphabet *nix programmers
wish that Python 3 continued to be strongly in their favor. But they are
a small minority of the world's programmers, and Python 3 is aimed at
everyone on all systems.
 
M

Marko Rauhamaa

Terry Reedy said:
Different OSes *do* have different assumptions. Both MacOSX and
current Windows use (UCS-2 or) UTF-16 for text.

Linux can use anything for text; UTF-8 has become a de-facto standard.

How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).

I have no idea how opaque text files are in Windows or OS-X.
For Windows, at least, the interface is much improved in Python 3.

Yes, I get the feeling that Python is reaching out to Windows and OS-X
and trying to make linux look like them.
I understand that some, but not all, Latin alphabet *nix programmers
wish that Python 3 continued to be strongly in their favor. But they
are a small minority of the world's programmers, and Python 3 is aimed
at everyone on all systems.

Python allows linux programmers to write native linux programs. Maybe it
allows Windows programmers to write native Windows programs. I certainly
hope so.

I don't want to have to write Windows programs that kinda run on linux.
Java suffers from that: no "import os" in Java.


Marko
 
T

Terry Reedy

I want the standard streams to consume and produce bytes.

Easy. Read the manual entry for stdxxx. "To write or read binary data
from/to the standard streams, use the underlying binary buffer object.
For example, to write bytes to stdout, use
sys.stdout.buffer.write(b'abc')" To make it easy, use bound methods.

myfilter.p
----------
import sys
sysin = sys.stdin.buffer.read
sysout = sys.stdout.buffer.write
syserr = sys.stderr.buffer.write

<filter code with calls to sysin, sysout, syserr.>
---

The same trick of defining bound methods to save both writing and
execution time is also useful for text filters when you use
sys.stdin.read, etc, more than once in the text.

When you try this, please report the result, either way.
I do a lot of system programming and connect processes to each other
with socketpairs, pipes and the like. I have dealt with plugin APIs
that communicate over stdin and stdout.

Now you know how to do so on Python 3.
Python is clearly on a crusade to make *text* a first class system
entity. I don't believe that is possible (without casualties) in the
linux world. Python text should only exist inside string objects.

You are clearly on a crusade to push a falsehood. Why?

On Windows and, I believe, Mac, utf-16 encoded text (C widechar type)
*is* a 'first class system entity. The problem Python has with *nix is
getting text bytes from the system in an unknown or worse,
wrongly-claimed encoding. The Python developers do their best to cope
with the differences and peculiarities of the systems it runs on.
 
M

Marko Rauhamaa

Terry Reedy said:
Easy. Read the manual entry for stdxxx. "To write or read binary data
from/to the standard streams, use the underlying binary buffer object.
For example, to write bytes to stdout, use
sys.stdout.buffer.write(b'abc')"

This note from the manual is a bit vague:

Note that the streams can be replaced with objects (like io.StringIO)
that do not support the buffer attribute or the detach() method

"Can be replaced" by who? By the Python developers? By me? By random
library calls?

Does it mean the buffer and detach are not guaranteed to stay with the
API?


Marko
 
T

Terry Reedy

This note from the manual is a bit vague:

Note that the streams can be replaced with objects (like io.StringIO)
that do not support the buffer attribute or the detach() method

"Can be replaced" by who? By the Python developers? By me? By random
library calls?

Fair question. The Python developers will not fiddle with stdxxx for 3rd
party code on 3rd party systems. We do sometimes *temporarily replace
the streams with StringIO, either directly or via test.support when
testing Python itself or stdlib modules. That is done in Lib/test, and
except for testing StringIO, it is only done as a convenience, not a
necessity.

To test a binary stream filter, you would have to do something else,
like read from and write to actual files on disk. Otherwise, you seem
unlikely to sabotage yourself, even accidentally.

Random non-stdlib library calls could sabotage you. However, in my
opinion, an imported 3rd party module should never modify std streams,
with one exception. The exception would be a module whose entire purpose
was to put the streams in a known state, as documented, and only if
intentionally asked to.

Having said that, bound methods created (first) should work regardless
of any subsequent manipulation of sys. Here is an experiment, run from
an Idle editor.

import sys
sysout = sys.stdout.write
sys.stdout = None
sysout('works anyway\n')works anyway

(Of course, subsequent attempts to continue interactively fail. But that
is not your use case.)
 
R

Rustom Mody

Point of terminology: Linux is the kernel, everything you say below
here is talking about particular programs.

If it helps try the following substitution:

s/Linux/Pretty much all the distros that use Linux for their OS kernel/

BTW the only (other) guy I know who insistently makes that distinction is
Richard Stallman.

From what I understand,
bash (just another Unix program) treats strings as sequences of
codepoints, just as Python does; though its string manipulation is not
nearly as rich as Python's, so it's harder to prove. Python is itself
a Unix program, so you can do the exact same proofs and demonstrate
that Linux is clearly Unicode-aware. It's not Linux you're testing.

In these 'other programs' is it permissible to include the kernel
itself?
And then ask how Linux (in your and Stallman's sense) differs from
Windows in how the filesystem handles things like filenames?
 
C

Chris Angelico

If it helps try the following substitution:

s/Linux/Pretty much all the distros that use Linux for their OS kernel/

You could look at the Debian Project, which is a full environment with
everything you're talking about. And everything you say would be
equally true of Debian Linux and Debian kfreebsd. :)
BTW the only (other) guy I know who insistently makes that distinction is
Richard Stallman.

Are you an emacs user by any chance <wink>?

Nope! Just a terminology nerd. :)
In these 'other programs' is it permissible to include the kernel
itself?
And then ask how Linux (in your and Stallman's sense) differs from
Windows in how the filesystem handles things like filenames?

What are you testing of the kernel? Most of the kernel doesn't
actually work with text at all - it works with integers, buffers of
memory (which could be seen as streams of bytes, but might be almost
anything), process tables, open file handles... but not usually text.
To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's
an integer (11 decimal, if I recall correctly). Is that some fancy new
form of encoding? :)

ChrisA
 
S

Steven D'Aprano

Linux can use anything for text; UTF-8 has become a de-facto standard.

How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).

Wait, are they black-boxes to the *operating system* or to
*applications*? They aren't the same thing.

In any case, I reject your premise. ALL data types are constructed on top
of bytes, and so long as you allow applications *any way* to coerce data
types to different data types, you allow them to see "inside the black
box". I can extract the four bytes from a C long integer, but that
doesn't mean that C longs aren't fundamental data types in Unix/Linux.

I have no idea how opaque text files are in Windows or OS-X.

Exactly as opaque as they are in Unix, which is to say not at all. Just
open the file in binary mode, and voilà you see the underlying bytes.

All you're doing is pointing out that, in modern electronic computers,
the fundamental data structure which underlies all others (the
indivisible protons and neutrons, so to speak, only there are 256 of them
rather than 2) is the byte. We know this, and don't dispute it.

(Like protons and neutrons, we can see inside bytes to the quark-like
bits that make up bytes. Like quarks, bits do not exist in isolation, but
only inside bytes.)


Yes, I get the feeling that Python is reaching out to Windows and OS-X
and trying to make linux look like them.

Unicode support in OS-X is (I have been assured) is very good, probably
better than Linux. Apple has very high standards when it comes to their
apps, and provides rich Unicode-aware APIs.

But Linux Unicode support is much better than Windows. Unicode support in
Windows is crippled by continued reliance on legacy code pages, and by
the assumption deep inside the Windows APIs that Unicode means "16 bit
characters". See, for example, the amount of space spent on fixing
Windows Unicode handling here:

http://www.utf8everywhere.org/
 
S

Steven D'Aprano

This note from the manual is a bit vague:

Note that the streams can be replaced with objects (like io.StringIO)
that do not support the buffer attribute or the detach() method

"Can be replaced" by who? By the Python developers? By me? By random
library calls?

By you. sys.stdout and friends are writable. Any code you call may have
replaced them with another file-like object, and you should honour that.

The API could have/should have been a little more friendly, but it's
conceptually simple:

* Does sys.stdout have a buffer attribute? Then write raw bytes to
the buffer.

* If not, then write raw bytes to sys.stdout.

* If either fails, then somebody has replaced stdout with something
weird, and they deserve whatever horrible fate their damn fool
move causes. It's not your responsibility to try to keep your
application running under bizarre circumstances.
 
M

Marko Rauhamaa

Steven D'Aprano said:
In any case, I reject your premise. ALL data types are constructed on
top of bytes,

Only in a very dull sense.
and so long as you allow applications *any way* to coerce data types
to different data types, you allow them to see "inside the black box".

I can't see the bytes inside Python objects, including strings, and
that's how it is supposed to be.

Similarly, I can't (easily) see how files are laid out on hard disks.
That's a true abstraction. Nothing in linux presents data, though,
except through bytes.


Marko
 
M

Marko Rauhamaa

Steven D'Aprano said:
By you. sys.stdout and friends are writable. Any code you call may
have replaced them with another file-like object, and you should
honour that.

I can of course overwrite even sys and os and open and all. That hardly
merits mentioning in the API documentation.

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version. That would be terrible.

If it means some other module is allowed to commandeer the standard
streams, that would be bad as well.

Worst of all, I don't know why the caveat had to be there.

Or is it maybe because some python command line options could cause
buffer and detach not to be there? That would explain the caveat, but
still would be kinda sucky.


Marko
 
C

Chris Angelico

I can of course overwrite even sys and os and open and all. That hardly
merits mentioning in the API documentation.

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version. That would be terrible.

If it means some other module is allowed to commandeer the standard
streams, that would be bad as well.

Worst of all, I don't know why the caveat had to be there.

Or is it maybe because some python command line options could cause
buffer and detach not to be there? That would explain the caveat, but
still would be kinda sucky.

It's more that replacng sys.std* is considered reasonably normal
(unlike, say, replacing sys.float_info, which would be a weird thing
to do); and you could replace them with something that doesn't have
those attributes. If you're running a top-level script and you never
import anything that changes the streams, you should be able to depend
on those always being there.

ChrisA
 
T

Terry Reedy

I can of course overwrite even sys and os and open and all. That hardly
merits mentioning in the API documentation.

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version.

No, not at all.
That would be terrible.
Agreed.

If it means some other module is allowed to commandeer the standard
streams, that would be bad as well.

I think that, for the most part, library modules should either open a
file given a filename from outside or read from and write to open files
handed to them from outside, but not hard-code the std streams. The
module doc should say if the file (name or object) must be text or in
particular binary.

The warning is also a hint as to how to solve a problem, such as testing
a binary filter. Assume the module reads from and writes to .buffer and
has a main function. One approach, untested:

import sys, io, unittest
from mod import main

class Binstd:
def __init(self):
self.buffer = io.BytesIO

sys.stdin = Binstd()
sys.stdout = Binstd()

sys.stdin.buffer.write('test data')
sys.stdin.buffer.seek(0)
main()
out = sys.stdout.buffer.getvalue()
# test that out is as expected for the input
# seek to 0 and truncate for more tests
Worst of all, I don't know why the caveat had to be there.

Because the streams can be replaced for a variety of good reasons, as above.
Or is it maybe because some python command line options could cause
buffer and detach not to be there? That would explain the caveat, but
still would be kinda sucky.

The doc set documents the Python command line options, as well any that
are CPython specific. It is possible that some implementation could add
one to open stdxyz in binary mode. CPython does not really need that.
 
R

Rustom Mody

What are you testing of the kernel? Most of the kernel doesn't
actually work with text at all - it works with integers, buffers of
memory (which could be seen as streams of bytes, but might be almost
anything), process tables, open file handles... but not usually text.
To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's
an integer (11 decimal, if I recall correctly). Is that some fancy new
form of encoding? :)


| Thanks to the properties of UTF-8 encoding, the Linux kernel, the
| innermost and lowest-level part of the operating system, can
| handle Unicode filenames without even having the user tell it
| that UTF-8 is to be used. All character strings, including
| filenames, are treated by the kernel in such a way that THEY
| APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and
| does not need to know whether a pair of consecutive bytes should
| logically be treated as two characters or a single one. The only
| risk of the kernel being fooled would be, for example, for a
| filename to contain a multibyte Unicode character encoded in such
| a way that one of the bytes used to represent it was a slash or
| some other character that has a special meaning in file
| names. Fortunately, as we noted, UTF-8 never uses ASCII
| characters for encoding multibyte characters, so neither the
| slash nor any other special character can appear as part of one
| and therefore there is no risk associated with using Unicode in
| filenames.
|
| Filesystems found on Microsoft Windows machines (NTFS and FAT)
| are different in that THEY STORE FILENAMES ON DISK IN SOME
| PARTICULAR ENCODING. The kernel must translate this encoding to
| the system encoding, which will be UTF-8 in our case.
|
| If you have Windows partitions on your system, you will have to
| take care that they are mounted with correct options. For FAT and
| ISO9660 (used by CD-ROMs) partitions, option utf8 makes the
| system translate the filesystem's character encoding to
| UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should
| also work).

[Emphases mine]

From: http://michal.kosmulski.org/computing/articles/linux-unicode.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top