RE Module Performance

  • Thread starter Devyn Collier Johnson
  • Start date
D

Devyn Collier Johnson

I recently saw an email in this mailing list about the RE module being
made slower. I no long have that email. However, I have viewed the
source for the RE module, but I did not see any code that would slow
down the script for no valid reason. Can anyone explain what that user
meant or if I missed that part of the module?

Can the RE module be optimized in any way or at least the "re.sub" portion?

Mahalo,

Devyn Collier Johnson
(e-mail address removed)
 
W

wxjmfauth

Le vendredi 12 juillet 2013 01:44:05 UTC+2, Devyn Collier Johnson a écrit :
I recently saw an email in this mailing list about the RE module being

made slower. I no long have that email. However, I have viewed the

source for the RE module, but I did not see any code that would slow

down the script for no valid reason. Can anyone explain what that user

meant or if I missed that part of the module?



Can the RE module be optimized in any way or at least the "re.sub" portion?



Mahalo,



Devyn Collier Johnson

(e-mail address removed)

----------

I would not care too much about the performance
of re.

With the new Flexible String Representation, you
can use a logarithmic scale to compare re results.
To be honest, there is improvment if you are an
ascii user.

Am I the only one who tested this? Probably.

jmf
 
C

Chris Angelico

I would not care too much about the performance
of re.

With the new Flexible String Representation, you
can use a logarithmic scale to compare re results.
To be honest, there is improvment if you are an
ascii user.

Am I the only one who tested this? Probably.

Am I the only one who thinks that Dihedral posted under jmf's name?

ChrisA
 
C

Chris Angelico

A bot posting as a troll to troll a different troll? Meta.

Yeah, it is. But the only other explanation is that jmf has become
rather more incomprehensible than usual. Normally I can understand
what he's complaining enough to refute it, but here I feel like I'm
responding to Dihedral.

ChrisA
 
D

Devyn Collier Johnson

Could you explain what you mean? What and where is the new Flexible
String Representation?

Devyn Collier Johnson
 
J

Joshua Landau

Could you explain what you mean? What and where is the new Flexible String
Representation?

Do not worry. jmf is on about his old rant comparing broken previous
versions of Python to newer ones which in some microbenchmarks are
slower. I don't really get why he spends his time on it.

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.
 
P

Peter Otten

Joshua said:
Do not worry. jmf is on about his old rant comparing broken previous
versions of Python to newer ones which in some microbenchmarks are
slower. I don't really get why he spends his time on it.

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.

See also <http://www.python.org/dev/peps/pep-0393/>
 
C

Chris Angelico

Could you explain what you mean? What and where is the new Flexible String
Representation?

(You're top-posting again. Please put your text underneath what you're
responding to - it helps maintain flow and structure.)

Python versions up to and including 3.2 came in two varieties: narrow
builds (commonly found on Windows) and wide builds (commonly found on
Linux). Narrow builds internally represented Unicode strings in
UTF-16, while wide builds used UTF-32. This is a problem, because it
means that taking a program from one to another actually changes its
behaviour:

Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.1

Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win322

In fact, the narrow builds are flat-out buggy, because you can put
something in as a single character that simply isn't a single
character. You can then pull that out as two characters and make a
huge mess of things:
s=u"\U00012345"
s[0] u'\ud808'
s[1]
u'\udf45'

*Any* string indexing will be broken if there is a single character
U+FFFF ahead of it in the string.

Now, this problem is not unique to Python. Heaps of other languages
have the same issue, the same buggy behaviour, the same compromises.
What's special about Python is that it actually managed to come back
from that problem. (Google's V8 JavaScript engine, for instance, is
stuck with it, because the ECMAScript specification demands UTF-16. I
asked on an ECMAScript list and was told "can't change that, it'd
break code". So it's staying buggy.)

There are a number of languages that take the Texan RAM-guzzling
approach of storing all strings in UTF-32; Python (since version 3.3)
is among a *very* small number of languages that store strings in
multiple different ways according to their content. That's described
in PEP 393 [1], titled "Flexible String Representation". It details a
means whereby a Python string will be represented in, effectively,
UTF-32 with some of the leading zero bytes elided. Or if you prefer,
in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
fit in. The difference between a string stored one-byte-per-character
and a string stored four-bytes-per-character is almost invisible to a
Python script; you can find out by checking the string's memory usage,
but otherwise you don't need to worry about it.

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win3241

Adding another character adds another 1 byte. (There's quite a bit of
overhead for small strings - GC headers and such - but it gets dwarfed
by the actual content after a while.)
70

Two bytes to add another character.
104

Four bytes. It uses only what it needs.

Strings in Python are immutable, so there's no need to worry about
up-grading or down-grading a string; there are a few optimizations
that can't be done, but they're fairly trivial. Look, I'll pull a jmf
and find a microbenchmark that makes 3.3 look worse:

2.7.4:
timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
timeit.repeat('a=u"A"*100; a+=u"a"')
[0.8088905154146744, 0.8123691698246631, 0.8172558244134365]

3.3.0:
timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
timeit.repeat('a=u"A"*100; a+=u"a"')
[0.7017891938739922, 0.7024725209339522, 0.6989539173082449]

See? It's clearly worse on the newer Python! But actually, this is an
extremely unusual situation, and 3.3 outperforms 2.7 on the more
common case (where the two strings are of the same width).

Python's PEP 393 strings are following the same sort of model as the
native string type in a semantically-similar but
syntactically-different language, Pike. In Pike (also free software,
like Python), the string type can be indexed character by character,
and each character can be anything in the Unicode range; and just as
in Python 3.3, memory usage goes up by just one byte if every
character in the string fits inside 8 bits. So it's not as if this is
an untested notion; Pike has been running like this for years (I don't
know how long it's had this functionality, but it won't be more than
18 years as Unicode didn't have multiple planes until 1996), and
performance has been *just fine* for all that time. Pike tends to be
run on servers, so memory usage and computation speed translate fairly
directly into TPS. And there are some sizeable commercial entities
using and developing Pike, so if the flexible string representation
had turned out to be a flop, someone would have put in the coding time
to rewrite it by now.

And yet, despite all these excellent reasons for moving to this way of
doing strings, jmf still sees his microbenchmarks as more important,
and so he jumps in on threads like this to whine about how Python 3.3
is somehow US-centric because it more efficiently handles the entire
Unicode range. I'd really like to take some highlights from Python and
Pike and start recommending that other languages take up the ideas,
but to be honest, I hesitate to inflict jmf on them all. ECMAScript
may have the right idea after all - stay with UTF-16 and avoid
answering jmf's stupid objections every week.

[1] http://www.python.org/dev/peps/pep-0393/

ChrisA
 
D

Devyn Collier Johnson

Could you explain what you mean? What and where is the new Flexible String
Representation?
(You're top-posting again. Please put your text underneath what you're
responding to - it helps maintain flow and structure.)

Python versions up to and including 3.2 came in two varieties: narrow
builds (commonly found on Windows) and wide builds (commonly found on
Linux). Narrow builds internally represented Unicode strings in
UTF-16, while wide builds used UTF-32. This is a problem, because it
means that taking a program from one to another actually changes its
behaviour:

Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.1

Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win322

In fact, the narrow builds are flat-out buggy, because you can put
something in as a single character that simply isn't a single
character. You can then pull that out as two characters and make a
huge mess of things:
s=u"\U00012345"
s[0] u'\ud808'
s[1]
u'\udf45'

*Any* string indexing will be broken if there is a single character
U+FFFF ahead of it in the string.
Now, this problem is not unique to Python. Heaps of other languages
have the same issue, the same buggy behaviour, the same compromises.
What's special about Python is that it actually managed to come back
from that problem. (Google's V8 JavaScript engine, for instance, is
stuck with it, because the ECMAScript specification demands UTF-16. I
asked on an ECMAScript list and was told "can't change that, it'd
break code". So it's staying buggy.)

There are a number of languages that take the Texan RAM-guzzling
approach of storing all strings in UTF-32; Python (since version 3.3)
is among a *very* small number of languages that store strings in
multiple different ways according to their content. That's described
in PEP 393 [1], titled "Flexible String Representation". It details a
means whereby a Python string will be represented in, effectively,
UTF-32 with some of the leading zero bytes elided. Or if you prefer,
in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
fit in. The difference between a string stored one-byte-per-character
and a string stored four-bytes-per-character is almost invisible to a
Python script; you can find out by checking the string's memory usage,
but otherwise you don't need to worry about it.

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win3241

Adding another character adds another 1 byte. (There's quite a bit of
overhead for small strings - GC headers and such - but it gets dwarfed
by the actual content after a while.)
70

Two bytes to add another character.
104

Four bytes. It uses only what it needs.

Strings in Python are immutable, so there's no need to worry about
up-grading or down-grading a string; there are a few optimizations
that can't be done, but they're fairly trivial. Look, I'll pull a jmf
and find a microbenchmark that makes 3.3 look worse:

2.7.4:
timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
timeit.repeat('a=u"A"*100; a+=u"a"')
[0.8088905154146744, 0.8123691698246631, 0.8172558244134365]

3.3.0:
timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
timeit.repeat('a=u"A"*100; a+=u"a"')
[0.7017891938739922, 0.7024725209339522, 0.6989539173082449]

See? It's clearly worse on the newer Python! But actually, this is an
extremely unusual situation, and 3.3 outperforms 2.7 on the more
common case (where the two strings are of the same width).

Python's PEP 393 strings are following the same sort of model as the
native string type in a semantically-similar but
syntactically-different language, Pike. In Pike (also free software,
like Python), the string type can be indexed character by character,
and each character can be anything in the Unicode range; and just as
in Python 3.3, memory usage goes up by just one byte if every
character in the string fits inside 8 bits. So it's not as if this is
an untested notion; Pike has been running like this for years (I don't
know how long it's had this functionality, but it won't be more than
18 years as Unicode didn't have multiple planes until 1996), and
performance has been *just fine* for all that time. Pike tends to be
run on servers, so memory usage and computation speed translate fairly
directly into TPS. And there are some sizeable commercial entities
using and developing Pike, so if the flexible string representation
had turned out to be a flop, someone would have put in the coding time
to rewrite it by now.

And yet, despite all these excellent reasons for moving to this way of
doing strings, jmf still sees his microbenchmarks as more important,
and so he jumps in on threads like this to whine about how Python 3.3
is somehow US-centric because it more efficiently handles the entire
Unicode range. I'd really like to take some highlights from Python and
Pike and start recommending that other languages take up the ideas,
but to be honest, I hesitate to inflict jmf on them all. ECMAScript
may have the right idea after all - stay with UTF-16 and avoid
answering jmf's stupid objections every week.

[1] http://www.python.org/dev/peps/pep-0393/

ChrisA
Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.

Devyn Collier Johnson
 
T

Tim Delaney

Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems. I
would love to amp up my programs that use that module.


If you are finding that regular expressions are taking too much time, have
a look at the https://pypi.python.org/pypi/re2/ and
https://pypi.python.org/pypi/regex/2013-06-26 modules to see if they
already give you enough of a speedup.

Tim Delaney
 
M

Michael Torrie

If you're interested, the basic of it is that strings now use a
variable number of bytes to encode their values depending on whether
values outside of the ASCII range and some other range are used, as an
optimisation.

Variable number of bytes is a problematic way to saying it. UTF-8 is a
variable-number-of-bytes encoding scheme where each character can be 1,
2, 4, or more bytes, depending on the unicode character. As you can
imagine this sort of encoding scheme would be very slow to do slicing
with (looking up a character at a certain position). Python uses
fixed-width encoding schemes, so they preserve the O(n) lookup speeds,
but python will use 1, 2, or 4 bytes per every character in the string,
depending on what is needed. Just in case the OP might have
misunderstood what you are saying.

jmf sees the case where a string is promoted from one width to another,
and thinks that the brief slowdown in string operations to accomplish
this is a problem. In reality I have never seen anyone use the types of
string operations his pseudo benchmarks use, and in general Python 3's
string behavior is pretty fast. And apparently much more correct than
if jmf's ideas of unicode were implemented.
 
M

MRAB

On 13 July 2013 03:58, Devyn Collier Johnson <[email protected]


Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix
systems. I would love to amp up my programs that use that module.


If you are finding that regular expressions are taking too much time,
have a look at the https://pypi.python.org/pypi/re2/ and
https://pypi.python.org/pypi/regex/2013-06-26 modules to see if they
already give you enough of a speedup.
FYI, you're better off going to http://pypi.python.org/pypi/regex
because that will take you to the latest version.
 
8

88888 Dihedral

I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.



In my experience, often the best way to optimize a regex is to not use it

at all.



[steve@ando ~]$ python -m timeit -s "import re" \
-s "data = 'a'*100+'b'" \
"if re.search('b', data): pass"

100000 loops, best of 3: 2.77 usec per loop



[steve@ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \
"if 'b' in data: pass"

1000000 loops, best of 3: 0.219 usec per loop



In Python, we often use plain string operations instead of regex-based

solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use

them for cracking peanuts.

OK, lets talk about the indexed search algorithms of
a character streamor strig which can be buffered and
indexed randomly for RW operations but faster in sequential
block RW operations after some pre-processing.

This was solved long time ago in the suffix array or
suffix tree part and summarized in the famous BWT paper in 199X.

Do we want volunteers to speed up
search operations in the string module in Python?
 
D

Devyn Collier Johnson

I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.


In my experience, often the best way to optimize a regex is to not use it

at all.



[steve@ando ~]$ python -m timeit -s "import re" \
-s "data = 'a'*100+'b'" \
"if re.search('b', data): pass"
100000 loops, best of 3: 2.77 usec per loop



[steve@ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \
"if 'b' in data: pass"
1000000 loops, best of 3: 0.219 usec per loop



In Python, we often use plain string operations instead of regex-based

solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use

them for cracking peanuts.
OK, lets talk about the indexed search algorithms of
a character streamor strig which can be buffered and
indexed randomly for RW operations but faster in sequential
block RW operations after some pre-processing.

This was solved long time ago in the suffix array or
suffix tree part and summarized in the famous BWT paper in 199X.

Do we want volunteers to speed up
search operations in the string module in Python?
It would be nice if someone could speed it up.
 
S

Steven D'Aprano

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]
Do we want volunteers to speed up
search operations in the string module in Python?

It would be nice if someone could speed it up.

Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.
 
D

Devyn Collier Johnson

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]
Do we want volunteers to speed up
search operations in the string module in Python?
It would be nice if someone could speed it up.
Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.
Wow! Our mailing list has a pet bot. I bet other mailing lists are so
jealous of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that
the program can learn. I use Python3 to program AI applications. If you
see my Launchpad account, you will see my two AI projects - Neobot and
Novabot. (https://launchpad.net/neobot Neo and Nova are still unstable)
AI developers let their bots loose on the Internet to learn from people.
Dihedral is learning from us. Dihedral only responses when it feels it
has sufficient knowledge on the topic. Chatbots want to appear human.
That is their goal. We should feel honored that Dihedral's botmaster
feels that this mailinglist would benefit the development of Dihedral's
knowledge.

Devyn Collier Johnson
 
J

Joel Goldstick

On Mon, Jul 15, 2013 at 8:52 AM, Devyn Collier Johnson <
On Mon, 15 Jul 2013 06:06:06 -0400, Devyn Collier Johnson wrote:

[...]

Do we want volunteers to speed up
search operations in the string module in Python?

It would be nice if someone could speed it up.
Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.



Wow! Our mailing list has a pet bot. I bet other mailing lists are so
jealous of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that
the program can learn. I use Python3 to program AI applications. If you see
my Launchpad account, you will see my two AI projects - Neobot and Novabot.
(https://launchpad.net/neobot Neo and Nova are still unstable) AI
developers let their bots loose on the Internet to learn from people.
Dihedral is learning from us. Dihedral only responses when it feels it has
sufficient knowledge on the topic. Chatbots want to appear human. That is
their goal. We should feel honored that Dihedral's botmaster feels that
this mailinglist would benefit the development of Dihedral's knowledge.

Devyn Collier Johnson

I particularly enjoy the misspellings, that seem to be such a human quality
on email messages!
 
W

Wayne Werner

On 07/14/2013 02:17 PM, 88888 Dihedral wrote: [...]
Do we want volunteers to speed up
search operations in the string module in Python?
It would be nice if someone could speed it up.
Devyn,

88888 Dihedral is our resident bot, not a human being. Nobody knows who
controls it, and why they are running it, but we are pretty certain that
it is a bot responding mechanically to keywords in people's posts.

It's a very clever bot, but still a bot. About one post in four is
meaningless jargon, the other three are relevant enough to fool people
into thinking that maybe it is a human being. It had me fooled for a long
time.
Wow! Our mailing list has a pet bot. I bet other mailing lists are so jealous
of us. Who ever created Dihedral is a genius!

Artificial Intelligence developers put chatbots on mailing lists so that the
program can learn. I use Python3 to program AI applications. If you see my
Launchpad account, you will see my two AI projects - Neobot and Novabot.
(https://launchpad.net/neobot Neo and Nova are still unstable) AI developers
let their bots loose on the Internet to learn from people. Dihedral is
learning from us. Dihedral only responses when it feels it has sufficient
knowledge on the topic. Chatbots want to appear human. That is their goal. We
should feel honored that Dihedral's botmaster feels that this mailinglist
would benefit the development of Dihedral's knowledge.

Are *you* a bot? ~_^

That post felt surprisingly like Dihedral...

-W
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

import syntax 0
Cross-Platform Python3 Equivalent to notify-send 1
Aloha! Check out the Betabots! 0
Critic my module 13
PEP8 79 char max 3
List as Contributor 0
Play Ogg Files 0
Share Code Tips 13

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top