PEP 358 and operations on bytes

Gerrit Holl · Oct 3, 2006

Hi,

In Python 3, reading from a file gives bytes rather than characters.
Some operations currently performed on strings also make sense when
performed on bytes, either if it's binary data or if it's text of
unknown or mixed encoding. Those include of course slicing and other
operators that exist in lists, but also other operations that aren't
currently defined in PEP 358, like:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions

I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.

kind regards,
Gerrit Holl.

John Machin · Oct 3, 2006

Gerrit said:
Hi,

In Python 3, reading from a file gives bytes rather than characters.
Some operations currently performed on strings also make sense when
performed on bytes, either if it's binary data or if it's text of
unknown or mixed encoding. Those include of course slicing and other
operators that exist in lists, but also other operations that aren't
currently defined in PEP 358, like:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions

I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.

I look at it this way::
Processing text? Use unicode.
Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
bytes.
Nostalgic for confused mixed-use? Don't upgrade.

IMHO, core dev time would be better used on:

* making /relevant/ modules (e.g. struct) work with bytes -- this topic
is not mentioned in the PEP.
* ensuring it covers everything that array.array('B', ...) does.
* being able to initialise a bytes array to (typically) all zeroes
without having to instantiate an initialiser e.g. record =
bytes(size=996, fill=0) instead of record = bytes(996 * [0])

than on starts(ends)with etc, and regexes.

Cheers,
John

Gerrit Holl · Oct 4, 2006

I look at it this way::
Processing text? Use unicode.
Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
bytes.

But can I use regular expressions on bytes?
Regular expressions are not limited to text.

Gerrit.

John Machin · Oct 4, 2006

Gerrit said:
But can I use regular expressions on bytes?
Regular expressions are not limited to text.

So why haven't you been campaigning for regular expression support for
sequences of int, and for various array.array subtypes?

Paul Rubin · Oct 4, 2006

John Machin said:
So why haven't you been campaigning for regular expression support for
sequences of int, and for various array.array subtypes?

regexps work on byte arrays.

John Machin · Oct 4, 2006

Paul said:
regexps work on byte arrays.

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

Paul Rubin · Oct 4, 2006

John Machin said:
But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

I think the underlying regexp C library isn't written that way. I can
see reasons to want a higher-level regexp library that works on
arbitrary sequences, calling a user-supplied function to classify
sequence elements, the way current regexps use the character code to
classify characters.

bearophileHUGS · Oct 4, 2006

Paul Rubin:

I think the underlying regexp C library isn't written that way. I can
see reasons to want a higher-level regexp library that works on
arbitrary sequences, calling a user-supplied function to classify
sequence elements, the way current regexps use the character code to
classify characters.

To begin with something concrete some days ago I was starting to write
a simple RE engine that works on lists/tuples/arrays and uses Psyco in
a good way (but then I have stopped developing it). Once and only once
some good uses has being found, later someone can translate the code to
C, if necessary.
It seems an interesting thing, but can you find some uses for it?

Bye,
bearophile

Paul Rubin · Oct 4, 2006

...It seems an interesting thing, but can you find some uses for it?

Yes, I want something like that all the time for file scanning without
having to resort to parser modules or hand coded automata.

Fredrik Lundh · Oct 4, 2006

John said:
But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

they do:

import re, array

data = [0, 1, 1, 2]

array_type = "IH"[re.sre_compile.MAXCODE == 0xffff]

a = array.array(array_type, data)

m = re.search(r"\x01+", a)

if m:
print m.span()
print m.group()

</F>

bearophileHUGS · Oct 4, 2006

A simple RE engine written in Python can be short, this is a toy:
http://paste.lisp.org/display/24849
If you can't live without the usual syntax:
http://paste.lisp.org/display/24872

Paul Rubin:

Yes, I want something like that all the time for file scanning without
having to resort to parser modules or hand coded automata.

Once read a file is a string or unicode. On them you can use normal
REs. If you need list-REs you probably slit the data in some parts. Can
you show one or more examples where you think simple list-REs can be
useful?

Bye,
bearophile

John Machin · Oct 4, 2006

Fredrik said:
John said:

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

Click to expand...

they do:

import re, array

data = [0, 1, 1, 2]

array_type = "IH"[re.sre_compile.MAXCODE == 0xffff]

a = array.array(array_type, data)

m = re.search(r"\x01+", a)

if m:
print m.span()
print m.group()

Very minor nit: re.sre_compile doesn't exist before Python 2.5.
Presumably sys.maxunicode can substitute for re.sre_compile.MAXCODE.

That aside, I'd like to nominate myself as UGPOTM (utterly gobsmacked
poster of the month). Not only does that work, but so does this, all
the way back to 2.1 at least:

import re, array
data = [0, 1, 1, 2, 257, 257, 258]
# array_type = "IH"[re.sre_compile.MAXCODE == 0xffff] # Python 2.5
array_type = "H"
a = array.array(array_type, data)
for q in (r"\x01+", ur"\u0101+"):
m = re.search(q, a)
if m:
print m.span()
print m.group()

produces:

(1, 3)
array('H', [1, 1])
(4, 6)
array('H', [257, 257])

Now, scurrying back towards Gerrit's original point: this feature is
not documented, even for array.array('B', ...). Should it be left as a
happy accident of duck-typing, accessible only to those who stumble
over it, or should it be supported? Should it be included in Python 3?

Cheers,
John

PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
About Rational Number (PEP 239/PEP 240)	25	Dec 15, 2007
Cleaning out the cobwebs in the PEP cupboard	0	Apr 23, 2009
Pre-PEP Proposal: Codetags	8	Aug 11, 2005
Proto-PEP: Overloadable Boolean Operators	15	Sep 5, 2004
pre-PEP: Standard Microthreading Pattern	4	May 1, 2007
PEP 350: Codetags	20	Sep 26, 2005
PEP: Generalised String Coercion	1	Aug 6, 2005

PEP 358 and operations on bytes

Gerrit Holl

John Machin

Gerrit Holl

John Machin

Paul Rubin

John Machin

Paul Rubin

bearophileHUGS

Paul Rubin

Fredrik Lundh

bearophileHUGS

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads