PATCH: Speed up direct string concatenation by 20+%!

S

Steve Holden

Larry said:
Fredrik said:
so what does the benchmark look like if you actually do this ?


Okay, timing this:
x = ""
for i in range(100000):
x += "a"
t = x[1] # forces the concat object to render

The result:
Python 2.5 release: 30.0s
Python 2.5 locally built: 30.2s
Python 2.5 concat: 4.3s
Improvement: 600% (1/7 of the time)


/larry/
Interesting. Does a comparison also force it to render? If not, I wonder
if comparisons might not end up slower. It does sound like memory usage
might go through the roof with this technique under certain
circumstances, so the more extensive your tests are the more likely you
are to see the change actually used (I'm not convinced you'll persuade
the developers to include this).

Whether or not it does get incorporated I think it's a great
demonstration of how amenable Python is to experimentation with
alternate representations, and I think your project might make a very
interesting PyCon paper for people who were thinking about joining the
development effort but hadn't yet started.

regards
Steve
 
C

Colin J. Williams

Congratulations on the clear way in which you have set out your proposal.

I hope that it will make its way to PEPdom.

Colin W.

Larry said:
This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

____________
THE OVERVIEW

I don't remember where I picked it up, but I remember reading years
ago that the simple, obvious Python approach for string concatenation:
x = "a" + "b"
is slow. Obviously it's fine for the simple case above, but if you're
adding together twenty-five things:
x = a + b + c + d + ... + y
the performance is terrible, as the interpreter creates temporary
objects
for each intermediate result. (a + b, then (a + b) + c,
then ((a + b) + c) + d, and so on.)

The suggested idiom for fast string concatenation was this:
x = "".join([a, b, c, d, ... y])
This works fine. But I don't like this approach, not only because it's
kind of ugly and inobvious, but because it violates the "one obvious
way
to do it" principle.

I mentioned all this to a friend of mine (Mike Thornburgh), and told
him matter-of-factly that I'd like to fix this, but the design of
Python
simply precluded it. He said "no it doesn't!" and proceeded to
describe
in great detail an approach to permit fast string concatenation using
+.
I have implemented what he suggested--with only slight changes--and
present it to you below. I only had to touch four files, and only did
major surgery on one. The resulting Python passes the test suite as
well
as my unmodified locally-built Python 2.5. (I didn't install the
external
source trees for stuff like SQLite, so it fails those tests.)

Note that times have changed; using Python 2.5, a + b + c isn't *that*
much slower than "".join([]). But is is still slower. Or, was--with
my
patch, a + b + c becomes *the fastest method* for string concatenation.
Hooray!

(It didn't pay off as *much* as I'd hoped, but it's still an
improvement.)

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

* String concatenation using + is now the fastest way to
concatenate
strings (that I know of).

* Throw off the shackles of "".join([]), you don't need it
anymore.

* Did I mention it was faster?

Downsides to this approach:

* Changes how PyStringObjects are stored internally; ob_sval is
no
longer a char[1], but a char *. This makes each StringObject
four bytes larger.

* Adds another memory dereference in order to get the value of
a string, which is a teensy-weensy slowdown.

* Would force a recompile of all C modules that deal directly
with string objects (which I imagine is most of them).

* Also, *requires* that C modules use the PyString_AS_STRING()
macro, rather than casting the object and grabbing ob_sval
directly. (I was pleased to see that the Python source
was very good about using this macro; if all Python C
modules are this well-behaved, this point is happily moot.)

* On a related note, the file Mac/Modules/MacOS.c implies
that there are Mac-specific Python scripts that peer
directly into string objects. These would have to be
changed to understand the new semantics.

* String concatenation objects are 36 bytes larger than
string objects, and this space will often go unreclaimed
after the string is rendered.

* When rendered, string concatenation objects storing long
strings will allocate a second buffer from the heap to
store the string. So this adds some minor allocation
overhead (though this is offset by the speed gain from
the approach overall).

* Will definitely need some heavy review before it could
go in, in particular I worry I got the semantics surrounding
"interned" strings wrong.


Internally, a "string concatenation" object is the same as a "string"
object with two extra fields: an array of references to string objects
(and string concatenation objects), and a count. Also, ob_sstate now
stores a third flag, indicating whether or not the string is a string
concatenation object. External users don't need to worry about these
details, they just use PyString_AS_STRING() and all is well with their
world.

I've already added some optimizations to the approach:

* If you're adding a string to an existing string concatenation
object whose reference count is 1, who isn't full, and who
hasn't been rendered yet, it adds the string directly to the
existing concatenation object (rather than creating a new
one).
This works whether the existing object is on the right or the
left side.

* The general case is a + b + c ..., where you're only adding
strings to the *end*. This will build a degenerate tree
where
the first slot points to the child. The tree for
concatenating
the first twenty-two letters of the alphabet would look like
this:

[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| p q r s t u v
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| i j k l m n o
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
v v v v v v v v
a b c d e f g h

The rendering function is by necessity recursive. However,
it is optimized for this shape; when the tree looks like
this,
it will be rendered purely iteratively--no recursion.

* If, after rendering, the resulting string will be small
enough
to fit inside the extra space normally used by the
concatenation
object, the final string will be copied on top of this
otherwise
now-wasted space.

______________
THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
Benchmark 2:
for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
"ttt"
Benchmark 3:
for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
..., "ttt"])
Benchmark 4:
for i in range(10000000):
x = "aaa"
x += "bbb"
x += "ccc"
...
x += "ttt"
Benchmark 5:
for i in range(10000000):
array = []
array.append("aaa")
array.append("bbb")
array.append("ccc")
...
array.append("ttt")
x = "".join(array)
Benchmark 6:
for i in range(10000000):
x = cStringIO.StringIO()
x.write("aaa")
x.write("bbb")
x.write("ccc")
...
x.write("ttt")

Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
improvement
1 | 32.1s | 32.8s | 26.7s |
22.8%
2 | 18.7s | 18.7s | 15.1s |
23.8%
3 | 16.1s | 16.2s | 16.6s |
-1.1%
4 | 64.6s | 64.0s | 57.1s |
12.1%
5 | 74.1s | 79.4s | 76.7s |
3.5%
6 | 126.0s | too slow, didn't bother!

______________
THE SUBTLE BUG

While developing this patch, I discovered a subtle bug in Python.
It happened only on one gruesome test in the middle of test_descr.py.
The problem was: '' == '' was returning False.

Starting on line 1116 of stringobject.c we have this:

if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (a->ob_size == b->ob_size
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval,
a->ob_size) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}

The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is
apparently
a speed optimization; check that the first byte is the same before
proceeding with the memcmp(). But it does this even when *both* strings
are *zero length*! I doubt the Python source tree has an official
position on what the first byte of a zero-length string should be, but
I assert that it should be undefined. In practice, it's seemingly
always
set to the same value in the released Python, but with my changes it's
now possible for them to be different.

I think it's sinful to compare the first bytes of zero-length strings.
Therefore I suggest this bugfix:

if (a->ob_size == b->ob_size
/* >> change >> */ && ( (!a->ob_size || a->ob_sval[0] ==
b->ob_sval[0])
&& memcmp(a->ob_sval, b->ob_sval,

I've already incorporated this fix in my code.

__________
THE FUTURE

If this patch is accepted, it paves the way for another cheap
optimization: string slices could be done without copying.
Add another field into the PyString structure: a reference
to another PyString *. If non-NULL, it means the local ob_sval
actually points into the ob_sval owned by that other PyString *.
It'd be another four bytes, but we'd only need to incur that
when creating a slice; we could set a bit in ob_sstate indicating
"this is a slice", and if so we'd have to copy-on-write ob_sval,
and free the reference when the object was destroyed.

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)


Cheers,


/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.
 
L

Larry Hastings

An update: I have submitted this as a patch on SourceForge.
It's request ID #1569040.
http://sourceforge.net/tracker/?group_id=5470&atid=305470
I invite everyone to take it for a spin!

There are some improvements in this version. Specifically:

* Python will no longer crash if you do ten million prepends
( x = 'a' + x ). Since the problem was blowing the stack
with an incredibly deep render, I now limit the depth of
the string concatenation objects (currently set at 16384).
Note that string prepending is now *immensely* faster, as
prepending in the existing implementation is a worst-case.

* I figured out why my zero-length strings were occasionally
not zero terminated. It had to do with subclassing a string
and storing an attribute in the object, which meant storing
a dict, and where specifically the interpreter chose to store
that. The solution was essentially to ensure there's always
space in the object for the trailing zero.

When running regrtest.py, my patched version produces identical
output to a non-patched build on Windows.


Steve said:
Does a comparison also force it to render?

Yes. Any attempt to examine the string causes it to render.

It does sound like memory usage
might go through the roof with this technique under certain
circumstances, so the more extensive your tests are the more likely you
are to see the change actually used (I'm not convinced you'll persuade
the developers to include this).

Yeah, I expect memory usage to be higher too, but not by a fantastic
amount. Once you render the concatenation, it drops all the references
to the child objects held in the tree, and what's left is a string
object with some extra space on the end.

I think your project might make a very
interesting PyCon paper for people who were thinking about joining the
development effort but hadn't yet started.

Perhaps; I've never been to PyCon, but it might be fun to give a
presentation there. That said, it would be way more relevant if the
patch got accepted, don'tcha think?

Cheers,


/larry/

p.s. Thanks for the sentiment, Colin W.!
 
F

Fredrik Lundh

Larry said:
There are some improvements in this version. Specifically:

* Python will no longer crash if you do ten million prepends
( x = 'a' + x ). Since the problem was blowing the stack
with an incredibly deep render, I now limit the depth of
the string concatenation objects (currently set at 16384).
Note that string prepending is now *immensely* faster, as
prepending in the existing implementation is a worst-case.

You really should look up something called "ropes".

You should also benchmark this against code that uses the ordinary
append/join pattern. (you've posted conflicting benchmarks for 2.5,
but if I'm trusting the benchmarks that looks more reasonable, the
standard implementation pattern is still around 10 times faster than
your approach).
> Perhaps; I've never been to PyCon, but it might be fun to give a
> presentation there. That said, it would be way more relevant if the
> patch got accepted, don'tcha think?

It's rather unlikely that something like this will ever be added to
the 2.X series. It's pretty unlikely for 3.X as well (GvR used a
rope-like structure for ABC, and it was no fun for anyone), but it'll
most likely be a lot easier to provide this as an option for 3.X.

</F>
 
J

John Machin

Fredrik said:
You should also benchmark this against code that uses the ordinary
append/join pattern. (you've posted conflicting benchmarks for 2.5,
but if I'm trusting the benchmarks that looks more reasonable, the
standard implementation pattern is still around 10 times faster than
your approach).

And the OP might like to go a bit further, and as well as benchmarking
this style:
array = []
array.append("aaa")
array.append("bbb")
etc
try benchmarking this ... well "style" may not be the appropriate word
:)
array = []
arrapp = array.append
arrapp('aaa')
arrapp('bbb')
etc

Cheers,
John
 
A

Aahz

Perhaps; I've never been to PyCon, but it might be fun to give a
presentation there. That said, it would be way more relevant if the
patch got accepted, don'tcha think?

Not really. The principles involved are timeless, and I guarantee you a
large audience regardless of whether the patch gets accepted (provided
you specify an appropriate presentation title and summary).
 
L

Larry Hastings

Fredrik said:
You should also benchmark this against code that uses the ordinary
append/join pattern.

Sorry, thought I had. Of course, now that the patch is up on
Sourceforce you could download it and run all the benchmarks you like.

For all the benchmarks I ran below, the number listed is the best of
three runs. Time was computed using sum(os.times()[:2]).

Running this under Python 2.5 release:
x = []
for i in xrange(10000000):
x.append("a")
y = "".join(x)
took 4421ms.

Running this under my patched Python:
x = ""
for i in xrange(10000000):
x += "a"
y = x[1]
took 4406ms.

I assert that my code makes + as fast as the old "".join([]) idiom.

It's rather unlikely that something like this will ever be added to
the 2.X series. It's pretty unlikely for 3.X as well (GvR used a
rope-like structure for ABC, and it was no fun for anyone), but it'll
most likely be a lot easier to provide this as an option for 3.X.

I can't address the ABC implementation as I've never seen it. But my
patch only touches four files. Two are obviously stringobject.c and .h.
The other two files, ceval.c and codeobject.c, only got one-line
changes. My changes to PyStringObject are well-encapsulated; as long
as core / extension programmers continue to use PyString_AS_STRING() to
access the char * in a PyStringObject they will never notice the
difference.


John said:
try benchmarking this ... well "style" may not be the appropriate word

Running this under Python 2.5 release:
x = []
xappend = x.append
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3281ms.

Running this under my patched Python 2.5:
x = ""
xappend = x.__add__
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3343ms.


/larry/
 
J

John Machin

Larry said:
John said:
try benchmarking this ... well "style" may not be the appropriate word

Running this under Python 2.5 release:
x = []
xappend = x.append
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3281ms.

Running this under my patched Python 2.5:
x = ""
xappend = x.__add__
for i in xrange(10000000):
xappend("a")
y = "".join(x)

Don't you mean y = x[1] or something like that? y = "".join(x) looks
like a copy-paste error.

Playing the devil's advocate here: 10M adds each of a single byte
followed by 1 render doesn't seem very typical. How about 10 adds each
of 10 bytes followed by a render, then repeat that 10 M times.

Another question:
How does this:
buff = ""
for datum in data:
if buff:
buff += "|"
buff += str(datum)
compare with
buff = "|".join(str(datum) for datum in data)
?

Some of us have Windows boxes and don't have the necessary MS compiler.
Is there any chance of someone making a 2.5+patch Windows binary?

Cheers,
John
 
L

Larry Hastings

John said:
Don't you mean y = x[1] or something like that? y = "".join(x) looks
like a copy-paste error.

You're right, by gum. Worse than that, my benchmark wasn't actually
*doing* much of anything there; at the end of the run x was still
length 0. That was sloppy, and I apologize.

I cannot find a speedy equivalent to the 'xappend = x.append' trick for
string concatenation using +. 'operator.__iadd__' is slower, a labored
attempt using 'x = x("a").__add__' is better but still slower. So far,
just calling 'x += "a"' is the fastest way, and that's 4.4s, still beat
by the 'xappend = x.append' trick at 3.2s.

Playing the devil's advocate here: 10M adds each of a single byte
followed by 1 render doesn't seem very typical. How about [...]
Some of us have Windows boxes and don't have the necessary MS compiler.
Is there any chance of someone making a 2.5+patch Windows binary?

I actually developed it on Windows. Here you go:
http://larryhastings.com/programming/lch.python.2.5.concat.zip
That contains the new python25.dll and my hacked-up benchmark script.
Just back up your existing python25.dll, then unzip to your Python
directory, and you're ready to rock.

Enjoy,


/larry/
 
L

Larry Hastings

Istvan said:
I remember a similar patch from some time ago:

That patch addressed the same general problem, but took a completely
different approach. For the record, this patch (#980695) optimized "x
+= a" by examining the next instruction after the addition. If it's a
store instruction, and the object being stored to holds a reference to
the left side of the concatenation, it makes the object drop the
reference *before* doing the concatenation. In the case where there
was only one reference to the string on the left side (that being "x"),
this allows the string concatenation function to use a very fast
shortcut: resize the left side and concatenate the right side directly
to the end.

The patch was accepted in August of 2004, so it might have been folded
in to CPython during the 2.3 era; certainly it was part of CPython 2.4.

And, happily, that optimization is complimentary to mine. In my patch,
when the left side only has one reference and it's a "string
concatenation object", it will often have space left in its internal
array of strings objects. In that case I just append the string.
Speeds things up immensely.

Cheers,


/larry/
 
T

Terry Reedy

"> Forgot to mention this, in case you haven't done so post your original
message/patch on the python-dev lists since that's where the decisions
are made. This group is more end-user oriented.

http://mail.python.org/pipermail/python-dev/2006-October/thread.html

It is often good to get community input first, as the OP has done. When he
wishes, the OP should submit his latest (not original) version of the patch
to SourceForge (not the list) and then optionally post a new message with a
link to the SF item for discussion.

tjr
 
N

Nicko

Larry said:
It's *slightly* slower for two:

def addTwoThings(a, b):
return a + b
for i in range(10000000):
x = addTwoThings("aaa", "bbb") ....
But starts paying off already, even with three:

def addThreeThings(a, b, c):
return a + b + c
for i in range(10000000):
x = addThreeThings("aaa", "bbb", "ccc")

I note that in both of those tests you didn't actually ever realise the
concatenated string. Can you give us figures for these tests having
forced the concatenated string to be computed?

Cheers,
Nicko
 
L

Larry Hastings

Nicko said:
I note that in both of those tests you didn't actually ever realise the
concatenated string. Can you give us figures for these tests having
forced the concatenated string to be computed?

Sure, good call. And bad news.

All these benchmarks were with functions taking N arguments that just
added (concatenated) the arguments and returned the result. I only ran
each benchmark once.

| Python 2.5 | Python 2.5
arguments | release | concat
-------------+--------------+--------------
2 | 7718ms | 9078ms
3 | 9203ms | 10500ms
4 | 10656ms | 11656ms
5 | 12156ms | 13390ms
20 | 29359ms | 26703ms

Interpret as you see fit,


/larry/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top