RE Module Performance

Devyn Collier Johnson · Jul 12, 2013

I recently saw an email in this mailing list about the RE module being
made slower. I no long have that email. However, I have viewed the
source for the RE module, but I did not see any code that would slow
down the script for no valid reason. Can anyone explain what that user
meant or if I missed that part of the module?

Can the RE module be optimized in any way or at least the "re.sub" portion?

Mahalo,

Devyn Collier Johnson
(e-mail address removed)

wxjmfauth · Jul 12, 2013

Le vendredi 12 juillet 2013 01:44:05 UTC+2, Devyn Collier Johnson a écrit :

I recently saw an email in this mailing list about the RE module being

made slower. I no long have that email. However, I have viewed the

source for the RE module, but I did not see any code that would slow

down the script for no valid reason. Can anyone explain what that user

meant or if I missed that part of the module?

Can the RE module be optimized in any way or at least the "re.sub" portion?

Mahalo,

Devyn Collier Johnson

(e-mail address removed)

----------

I would not care too much about the performance
of re.

With the new Flexible String Representation, you
can use a logarithmic scale to compare re results.
To be honest, there is improvment if you are an
ascii user.

Am I the only one who tested this? Probably.

jmf

Chris Angelico · Jul 12, 2013

I would not care too much about the performance
of re.

With the new Flexible String Representation, you
can use a logarithmic scale to compare re results.
To be honest, there is improvment if you are an
ascii user.

Am I the only one who tested this? Probably.

Am I the only one who thinks that Dihedral posted under jmf's name?

ChrisA

Joshua Landau · Jul 12, 2013

Am I the only one who thinks that Dihedral posted under jmf's name?

A bot posting as a troll to troll a different troll? Meta.

Chris Angelico · Jul 12, 2013

A bot posting as a troll to troll a different troll? Meta.

Yeah, it is. But the only other explanation is that jmf has become
rather more incomprehensible than usual. Normally I can understand
what he's complaining enough to refute it, but here I feel like I'm
responding to Dihedral.

ChrisA

Devyn Collier Johnson · Jul 12, 2013

Could you explain what you mean? What and where is the new Flexible
String Representation?

Devyn Collier Johnson

Joshua Landau · Jul 12, 2013

Could you explain what you mean? What and where is the new Flexible String
Representation?

Do not worry. jmf is on about his old rant comparing broken previous
versions of Python to newer ones which in some microbenchmarks are
slower. I don't really get why he spends his time on it.

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.

Peter Otten · Jul 12, 2013

Joshua said:
Do not worry. jmf is on about his old rant comparing broken previous
versions of Python to newer ones which in some microbenchmarks are
slower. I don't really get why he spends his time on it.

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.

See also <http://www.python.org/dev/peps/pep-0393/>

Chris Angelico · Jul 12, 2013

Could you explain what you mean? What and where is the new Flexible String
Representation?

(You're top-posting again. Please put your text underneath what you're
responding to - it helps maintain flow and structure.)

Python versions up to and including 3.2 came in two varieties: narrow
builds (commonly found on Windows) and wide builds (commonly found on
Linux). Narrow builds internally represented Unicode strings in
UTF-16, while wide builds used UTF-32. This is a problem, because it
means that taking a program from one to another actually changes its
behaviour:

Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.1

Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win322

In fact, the narrow builds are flat-out buggy, because you can put
something in as a single character that simply isn't a single
character. You can then pull that out as two characters and make a
huge mess of things:

s=u"\U00012345"
s[0] u'\ud808'
s[1]

Click to expand...

Click to expand...

u'\udf45'

*Any* string indexing will be broken if there is a single character

U+FFFF ahead of it in the string.

Now, this problem is not unique to Python. Heaps of other languages
have the same issue, the same buggy behaviour, the same compromises.
What's special about Python is that it actually managed to come back
from that problem. (Google's V8 JavaScript engine, for instance, is
stuck with it, because the ECMAScript specification demands UTF-16. I
asked on an ECMAScript list and was told "can't change that, it'd
break code". So it's staying buggy.)

There are a number of languages that take the Texan RAM-guzzling
approach of storing all strings in UTF-32; Python (since version 3.3)
is among a *very* small number of languages that store strings in
multiple different ways according to their content. That's described
in PEP 393 [1], titled "Flexible String Representation". It details a
means whereby a Python string will be represented in, effectively,
UTF-32 with some of the leading zero bytes elided. Or if you prefer,
in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
fit in. The difference between a string stored one-byte-per-character
and a string stored four-bytes-per-character is almost invisible to a
Python script; you can find out by checking the string's memory usage,
but otherwise you don't need to worry about it.

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win3241

Adding another character adds another 1 byte. (There's quite a bit of
overhead for small strings - GC headers and such - but it gets dwarfed
by the actual content after a while.)
70

Two bytes to add another character.
104

Four bytes. It uses only what it needs.

Strings in Python are immutable, so there's no need to worry about
up-grading or down-grading a string; there are a few optimizations
that can't be done, but they're fairly trivial. Look, I'll pull a jmf
and find a microbenchmark that makes 3.3 look worse:

2.7.4:

timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
timeit.repeat('a=u"A"*100; a+=u"a"')

Click to expand...

Click to expand...

[0.8088905154146744, 0.8123691698246631, 0.8172558244134365]

3.3.0:

timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
timeit.repeat('a=u"A"*100; a+=u"a"')

Click to expand...

Click to expand...

[0.7017891938739922, 0.7024725209339522, 0.6989539173082449]

See? It's clearly worse on the newer Python! But actually, this is an
extremely unusual situation, and 3.3 outperforms 2.7 on the more
common case (where the two strings are of the same width).

Python's PEP 393 strings are following the same sort of model as the
native string type in a semantically-similar but
syntactically-different language, Pike. In Pike (also free software,
like Python), the string type can be indexed character by character,
and each character can be anything in the Unicode range; and just as
in Python 3.3, memory usage goes up by just one byte if every
character in the string fits inside 8 bits. So it's not as if this is
an untested notion; Pike has been running like this for years (I don't
know how long it's had this functionality, but it won't be more than
18 years as Unicode didn't have multiple planes until 1996), and
performance has been *just fine* for all that time. Pike tends to be
run on servers, so memory usage and computation speed translate fairly
directly into TPS. And there are some sizeable commercial entities
using and developing Pike, so if the flexible string representation
had turned out to be a flop, someone would have put in the coding time
to rewrite it by now.

And yet, despite all these excellent reasons for moving to this way of
doing strings, jmf still sees his microbenchmarks as more important,
and so he jumps in on threads like this to whine about how Python 3.3
is somehow US-centric because it more efficiently handles the entire
Unicode range. I'd really like to take some highlights from Python and
Pike and start recommending that other languages take up the ideas,
but to be honest, I hesitate to inflict jmf on them all. ECMAScript
may have the right idea after all - stay with UTF-16 and avoid
answering jmf's stupid objections every week.

[1] http://www.python.org/dev/peps/pep-0393/

ChrisA

Devyn Collier Johnson · Jul 12, 2013

Could you explain what you mean? What and where is the new Flexible String
Representation?

Click to expand...

(You're top-posting again. Please put your text underneath what you're
responding to - it helps maintain flow and structure.)

Python versions up to and including 3.2 came in two varieties: narrow
builds (commonly found on Windows) and wide builds (commonly found on
Linux). Narrow builds internally represented Unicode strings in
UTF-16, while wide builds used UTF-32. This is a problem, because it
means that taking a program from one to another actually changes its
behaviour:

Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.1

Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win322

In fact, the narrow builds are flat-out buggy, because you can put
something in as a single character that simply isn't a single
character. You can then pull that out as two characters and make a
huge mess of things:

s=u"\U00012345"
s[0] u'\ud808'
s[1]

Click to expand...

Click to expand...

u'\udf45'

*Any* string indexing will be broken if there is a single character

U+FFFF ahead of it in the string.

Click to expand...

Now, this problem is not unique to Python. Heaps of other languages
have the same issue, the same buggy behaviour, the same compromises.
What's special about Python is that it actually managed to come back
from that problem. (Google's V8 JavaScript engine, for instance, is
stuck with it, because the ECMAScript specification demands UTF-16. I
asked on an ECMAScript list and was told "can't change that, it'd
break code". So it's staying buggy.)

There are a number of languages that take the Texan RAM-guzzling
approach of storing all strings in UTF-32; Python (since version 3.3)
is among a *very* small number of languages that store strings in
multiple different ways according to their content. That's described
in PEP 393 [1], titled "Flexible String Representation". It details a
means whereby a Python string will be represented in, effectively,
UTF-32 with some of the leading zero bytes elided. Or if you prefer,
in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
fit in. The difference between a string stored one-byte-per-character
and a string stored four-bytes-per-character is almost invisible to a
Python script; you can find out by checking the string's memory usage,
but otherwise you don't need to worry about it.

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win3241

Adding another character adds another 1 byte. (There's quite a bit of
overhead for small strings - GC headers and such - but it gets dwarfed
by the actual content after a while.)
70

Two bytes to add another character.
104

Four bytes. It uses only what it needs.

Strings in Python are immutable, so there's no need to worry about
up-grading or down-grading a string; there are a few optimizations
that can't be done, but they're fairly trivial. Look, I'll pull a jmf
and find a microbenchmark that makes 3.3 look worse:

2.7.4:

timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
timeit.repeat('a=u"A"*100; a+=u"a"')

Click to expand...

Click to expand...

[0.8088905154146744, 0.8123691698246631, 0.8172558244134365]

3.3.0:

timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
timeit.repeat('a=u"A"*100; a+=u"a"')

Click to expand...

Click to expand...

[0.7017891938739922, 0.7024725209339522, 0.6989539173082449]

See? It's clearly worse on the newer Python! But actually, this is an
extremely unusual situation, and 3.3 outperforms 2.7 on the more
common case (where the two strings are of the same width).

Python's PEP 393 strings are following the same sort of model as the
native string type in a semantically-similar but
syntactically-different language, Pike. In Pike (also free software,
like Python), the string type can be indexed character by character,
and each character can be anything in the Unicode range; and just as
in Python 3.3, memory usage goes up by just one byte if every
character in the string fits inside 8 bits. So it's not as if this is
an untested notion; Pike has been running like this for years (I don't
know how long it's had this functionality, but it won't be more than
18 years as Unicode didn't have multiple planes until 1996), and
performance has been *just fine* for all that time. Pike tends to be
run on servers, so memory usage and computation speed translate fairly
directly into TPS. And there are some sizeable commercial entities
using and developing Pike, so if the flexible string representation
had turned out to be a flop, someone would have put in the coding time
to rewrite it by now.

And yet, despite all these excellent reasons for moving to this way of
doing strings, jmf still sees his microbenchmarks as more important,
and so he jumps in on threads like this to whine about how Python 3.3
is somehow US-centric because it more efficiently handles the entire
Unicode range. I'd really like to take some highlights from Python and
Pike and start recommending that other languages take up the ideas,
but to be honest, I hesitate to inflict jmf on them all. ECMAScript
may have the right idea after all - stay with UTF-16 and avoid
answering jmf's stupid objections every week.

[1] http://www.python.org/dev/peps/pep-0393/

ChrisA

Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.

Devyn Collier Johnson

Tim Delaney · Jul 12, 2013

Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems. I
would love to amp up my programs that use that module.

If you are finding that regular expressions are taking too much time, have
a look at the https://pypi.python.org/pypi/re2/ and
https://pypi.python.org/pypi/regex/2013-06-26 modules to see if they
already give you enough of a speedup.

Tim Delaney

Michael Torrie · Jul 13, 2013

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.

Variable number of bytes is a problematic way to saying it. UTF-8 is a
variable-number-of-bytes encoding scheme where each character can be 1,
2, 4, or more bytes, depending on the unicode character. As you can
imagine this sort of encoding scheme would be very slow to do slicing
with (looking up a character at a certain position). Python uses
fixed-width encoding schemes, so they preserve the O(n) lookup speeds,
but python will use 1, 2, or 4 bytes per every character in the string,
depending on what is needed. Just in case the OP might have
misunderstood what you are saying.

jmf sees the case where a string is promoted from one width to another,
and thinks that the brief slowdown in string operations to accomplish
this is a problem. In reality I have never seen anyone use the types of
string operations his pseudo benchmarks use, and in general Python 3's
string behavior is pretty fast. And apparently much more correct than
if jmf's ideas of unicode were implemented.

MRAB · Jul 13, 2013

On 13 July 2013 03:58, Devyn Collier Johnson <[email protected]

Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix
systems. I would love to amp up my programs that use that module.

If you are finding that regular expressions are taking too much time,
have a look at the https://pypi.python.org/pypi/re2/ and
https://pypi.python.org/pypi/regex/2013-06-26 modules to see if they
already give you enough of a speedup.

FYI, you're better off going to http://pypi.python.org/pypi/regex
because that will take you to the latest version.

Tim Delaney · Jul 13, 2013

because that will take you to the latest version.

Absolutely - what was I thinking?

Tim Delaney

88888 Dihedral · Jul 14, 2013

I plan to spend some time optimizing the re.py module for Unix systems.

Click to expand...

I would love to amp up my programs that use that module.

Click to expand...

In my experience, often the best way to optimize a regex is to not use it

at all.

[steve@ando ~]$ python -m timeit -s "import re" \

-s "data = 'a'*100+'b'" \

Click to expand...

"if re.search('b', data): pass"

Click to expand...

100000 loops, best of 3: 2.77 usec per loop

[steve@ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \

"if 'b' in data: pass"

Click to expand...

1000000 loops, best of 3: 0.219 usec per loop

In Python, we often use plain string operations instead of regex-based

solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use

them for cracking peanuts.

OK, lets talk about the indexed search algorithms of
a character streamor strig which can be buffered and
indexed randomly for RW operations but faster in sequential
block RW operations after some pre-processing.

This was solved long time ago in the suffix array or
suffix tree part and summarized in the famous BWT paper in 199X.

Do we want volunteers to speed up
search operations in the string module in Python?

Devyn Collier Johnson · Jul 15, 2013

I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.

Click to expand...

In my experience, often the best way to optimize a regex is to not use it

at all.

[steve@ando ~]$ python -m timeit -s "import re" \

-s "data = 'a'*100+'b'" \
"if re.search('b', data): pass"

Click to expand...

100000 loops, best of 3: 2.77 usec per loop

[steve@ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \

"if 'b' in data: pass"

Click to expand...

1000000 loops, best of 3: 0.219 usec per loop

In Python, we often use plain string operations instead of regex-based

solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use

them for cracking peanuts.

Click to expand...

OK, lets talk about the indexed search algorithms of
a character streamor strig which can be buffered and
indexed randomly for RW operations but faster in sequential
block RW operations after some pre-processing.

This was solved long time ago in the suffix array or
suffix tree part and summarized in the famous BWT paper in 199X.

Do we want volunteers to speed up
search operations in the string module in Python?

It would be nice if someone could speed it up.

Steven D'Aprano · Jul 15, 2013

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]

Do we want volunteers to speed up
search operations in the string module in Python?

Click to expand...

It would be nice if someone could speed it up.

Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.

Devyn Collier Johnson · Jul 15, 2013

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]

Do we want volunteers to speed up
search operations in the string module in Python?

Click to expand...

It would be nice if someone could speed it up.

Click to expand...

Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.

Wow! Our mailing list has a pet bot. I bet other mailing lists are so
jealous of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that
the program can learn. I use Python3 to program AI applications. If you
see my Launchpad account, you will see my two AI projects - Neobot and
Novabot. (https://launchpad.net/neobot Neo and Nova are still unstable)
AI developers let their bots loose on the Internet to learn from people.
Dihedral is learning from us. Dihedral only responses when it feels it
has sufficient knowledge on the topic. Chatbots want to appear human.
That is their goal. We should feel honored that Dihedral's botmaster
feels that this mailinglist would benefit the development of Dihedral's
knowledge.

Devyn Collier Johnson

Joel Goldstick · Jul 15, 2013

On Mon, Jul 15, 2013 at 8:52 AM, Devyn Collier Johnson <

On Mon, 15 Jul 2013 06:06:06 -0400, Devyn Collier Johnson wrote:

[...]

Do we want volunteers to speed up
search operations in the string module in Python?

It would be nice if someone could speed it up.

Click to expand...

Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.

Wow! Our mailing list has a pet bot. I bet other mailing lists are so

Click to expand...

jealous of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that
the program can learn. I use Python3 to program AI applications. If you see
my Launchpad account, you will see my two AI projects - Neobot and Novabot.
(https://launchpad.net/neobot Neo and Nova are still unstable) AI
developers let their bots loose on the Internet to learn from people.
Dihedral is learning from us. Dihedral only responses when it feels it has
sufficient knowledge on the topic. Chatbots want to appear human. That is
their goal. We should feel honored that Dihedral's botmaster feels that
this mailinglist would benefit the development of Dihedral's knowledge.

Devyn Collier Johnson

I particularly enjoy the misspellings, that seem to be such a human quality
on email messages!

Wayne Werner · Jul 15, 2013

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]
Do we want volunteers to speed up
search operations in the string module in Python?
It would be nice if someone could speed it up.

Click to expand...

Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.

Click to expand...

Wow! Our mailing list has a pet bot. I bet other mailing lists are so jealous
of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that the
program can learn. I use Python3 to program AI applications. If you see my
Launchpad account, you will see my two AI projects - Neobot and Novabot.
(https://launchpad.net/neobot Neo and Nova are still unstable) AI developers
let their bots loose on the Internet to learn from people. Dihedral is
learning from us. Dihedral only responses when it feels it has sufficient
knowledge on the topic. Chatbots want to appear human. That is their goal. We
should feel honored that Dihedral's botmaster feels that this mailinglist
would benefit the development of Dihedral's knowledge.

Are *you* a bot? ~_^

That post felt surprisingly like Dihedral...

-W

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

Devyn Collier Johnson

wxjmfauth

Chris Angelico

Joshua Landau

Chris Angelico

Devyn Collier Johnson

Joshua Landau

Peter Otten

Chris Angelico

Devyn Collier Johnson

Tim Delaney

Michael Torrie

MRAB

Tim Delaney

88888 Dihedral

Devyn Collier Johnson

Steven D'Aprano

Devyn Collier Johnson

Joel Goldstick

Wayne Werner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads