StringChain -- a data structure for managing large sequences ofchunks of bytes

  • Thread starter Zooko O'Whielacronx
  • Start date
Z

Zooko O'Whielacronx

Folks:

Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating
data by string concatenation. It just happened again
(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
to make the data accumulator efficient without introducing a bunch of
bugs into the surrounding code. So this time around I decided to try
encapsulating my preferred more efficient idiom into a reusable class.

So I present to you StringChain, which is an efficient way to
accumulate and process data in many chunks:

http://tahoe-lafs.org/trac/stringchain

Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()' as instructed by
the README.txt file.

The N: in the left-hand column is how many bytes were in the test
dataset. The ave rate: number in the right-hand column is how many
bytes per second were processed. "naive" means the string-based idiom
sketched above and "strch" means using the StringChain class.

_buildup init_naive
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 890, ave
rate: 58350579
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 265, ave
rate: 34800398
N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 79, ave
rate: 20745346
N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,
2th-worst: 0.05, worst: 0.05 (of 5), reps/s: 20, ave
rate: 10823850
_buildup init_strch
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 25543, ave
rate: 1674043282
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 14179, ave
rate: 1858538925
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 8016, ave
rate: 2101513050
N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4108, ave
rate: 2154215572
_consume init_naive_loaded
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 931, ave
rate: 61037862
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 270, ave
rate: 35454393
N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 74, ave
rate: 19471963
N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,
2th-worst: 0.05, worst: 0.06 (of 5), reps/s: 19, ave
rate: 10146747
_consume init_strch_loaded
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4309, ave
rate: 282447500
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 2313, ave
rate: 303263357
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 1186, ave
rate: 311159052
N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 606, ave
rate: 317814669
_randomy init_naive
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 479, ave
rate: 31450561
N: 131072, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 140, ave
rate: 18461191
N: 262144, time: best: 0.02, 2th-best: 0.02, ave: 0.02,
2th-worst: 0.03, worst: 0.03 (of 5), reps/s: 42, ave
rate: 11127714
N: 524288, time: best: 0.06, 2th-best: 0.07, ave: 0.08,
2th-worst: 0.08, worst: 0.09 (of 5), reps/s: 13, ave
rate: 6906341
_randomy init_strch
N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 973, ave
rate: 63827127
N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 495, ave
rate: 64970669
N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,
2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 239, ave
rate: 62913360
N: 524288, time: best: 0.01, 2th-best: 0.01, ave: 0.01,
2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 121, ave
rate: 63811569

The naive approach is slower than the StringChain class, and the
bigger the dataset the slower it goes. The StringChain class is fast
and also it is scalable (with regard to these benchmarks at least...).

Thanks!

Regards,

Zooko
 
P

Paul Rubin

Zooko O'Whielacronx said:
Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating
data by string concatenation.

I usually use StringIO or cStringIO for that (python 2.x syntax):

buf = cStringIO()
buf.write("first thing")
buf.write("second thing")
result = buf.getvalue()

Sometimes I like to use a generator instead:

def stuff():
yield "first thing"
yield "second thing"

result = ''.join(stuff())
 
S

Steven D'Aprano

Folks:

Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating data
by string concatenation.

I don't mean to discourage you, but the simple way to avoid that is not
to accumulate data by string concatenation.

The usual Python idiom is to append substrings to a list, then once, at
the very end, combine into a single string:


accumulator = []
for item in sequence:
accumulator.append(process(item))
string = ''.join(accumulator)

It just happened again
(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
to make the data accumulator efficient without introducing a bunch of
bugs into the surrounding code.

I'm sorry, I don't agree about that at all. I've never come across a
situation where I wanted to use string concatenation and couldn't easily
modify it to use the list idiom above.

[...]
Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()' as instructed by
the README.txt file.

To be taken seriously, I think you need to compare stringchain to the
list idiom. If your benchmarks favourably compare to that, then it might
be worthwhile.
 
M

MRAB

Steven said:
Folks:

Every couple of years I run into a problem where some Python code that
worked well at small scales starts burning up my CPU at larger scales,
and the underlying issue turns out to be the idiom of accumulating data
by string concatenation.

I don't mean to discourage you, but the simple way to avoid that is not
to accumulate data by string concatenation.

The usual Python idiom is to append substrings to a list, then once, at
the very end, combine into a single string:


accumulator = []
for item in sequence:
accumulator.append(process(item))
string = ''.join(accumulator)

It just happened again
(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard
to make the data accumulator efficient without introducing a bunch of
bugs into the surrounding code.

I'm sorry, I don't agree about that at all. I've never come across a
situation where I wanted to use string concatenation and couldn't easily
modify it to use the list idiom above.

[...]
Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()' as instructed by
the README.txt file.

To be taken seriously, I think you need to compare stringchain to the
list idiom. If your benchmarks favourably compare to that, then it might
be worthwhile.
IIRC, someone did some work on making concatenation faster by delaying
it until a certain threshold had been reached (in the string class
implementation).
 
S

Steven D'Aprano

IIRC, someone did some work on making concatenation faster by delaying
it until a certain threshold had been reached (in the string class
implementation).

I believe you're talking about this patch:

http://bugs.python.org/issue1569040

It's never been formally rejected, but the chances of it being accepted
are pretty low.


However, in Python 2.4 another optimization was added that makes string
concatenation of the form:

a = a + b
a += b

much faster. This is implementation specific (Jython and IronPython don't
have it, and almost certainly won't) and it doesn't work for (e.g.):

a = b + a

See here:

http://mail.python.org/pipermail/python-dev/2004-August/046686.html
http://bugs.python.org/issue980695
 
Z

Zooko O'Whielacronx

Folks:

I failed to make something sufficiently clear in my original message
about StringChain. The use case that I am talking about is not simply
that you need to accumulate a sequence of incoming chunks of data,
concatenate them together, and then process the entire result. If that
is all you need to do then indeed you can accumulate the incoming
strings in a list and then run ''.join(thelist) when you have received
all of them and are ready to process them.

But the use case that I am talking about is where you need to
accumulate new incoming strings into your buffer while alternately
processing leading prefixes of the buffer. The typical example is that
sequences of bytes are arriving on your TCP socket and you are
processing the stream incrementally. You need to process the leading
prefixes of the stream as soon as you can (e.g. as soon as a line
followed by a newline has accumulated, or as soon as another complete
element in your data format has accumulated, etc.); you can't simply
wait until the bytes stop coming and then concatenate the entire
collection together at once.

This is exactly the current situation in foolscap [1] which is causing
a performance problem in Tahoe-LAFS [2].

To illustrate what I mean I cooked up some benchmarks showing the task
of "accumulate a bunch of things then consume them in a single gulp"
versus the task of "alternate between accumulating and consuming"
(with accumulation going faster than consumption until the input
stream runs dry).

I implemented a few data structures for benchmarking: StringChain, the
naive "accumulatorstr += newstr" approach (named "Stringy" in the
benchmarks), one based on cStringIO (named "StringIOy"), one based on
accumulating the new strings into a list and then doing
''.join(accumulatorlist) whenever you need to consume a leading prefix
(called "SimplerStringChain").

Below are the abbreviated results of the benchmark. You can easily run
this benchmark yourself by following the README.txt in the StringChain
source package [3]. These results show that for the load imposed by
this benchmark StringChain is the only one of the four that scales and
that it is also nice and fast.

The left-hand column is the total number of bytes that were run
through it. The results are shown in nanoseconds per byte in
exponential notation. ("e+01" means times 10, "e+02" means times 100,
and "e+03" means times 1000, etc.) Each result is the best of 10
tries, or of 5 tries for the tasks which were taking too long to run
it 10 times.

Regards,

Zooko

[1] http://foolscap.lothar.com/trac/ticket/149
[2] http://allmydata.org/pipermail/tahoe-dev/2010-March/004181.html
[3] http://tahoe-lafs.org/trac/stringchain

impl: StringChain
task: _accumulate_then_one_gulp
10000 best: 2.694e+00
50000 best: 2.742e+00
100000 best: 2.310e+00
500000 best: 2.040e+00
1000000 best: 1.988e+00
5000000 best: 2.193e+00

task: _alternate_str
10000 best: 6.509e+00
50000 best: 4.559e+00
100000 best: 4.308e+00
500000 best: 4.070e+00
1000000 best: 3.991e+00
5000000 best: 4.000e+00

impl: SimplerStringChain
task: _accumulate_then_one_gulp
10000 best: 1.407e+00
50000 best: 2.317e+00
100000 best: 2.012e+00
500000 best: 1.902e+00
1000000 best: 1.897e+00
5000000 best: 2.104e+00

task: _alternate_str
10000 best: 4.888e+00
50000 best: 5.198e+00
100000 best: 1.750e+01
500000 best: 6.233e+01
1000000 best: 1.134e+02
5000000 best: 7.599e+02

impl: StringIOy
task: _accumulate_then_one_gulp
10000 best: 4.196e+00
50000 best: 5.522e+00
100000 best: 4.499e+00
500000 best: 3.756e+00
1000000 best: 4.176e+00
5000000 best: 5.414e+00

task: _alternate_str
10000 best: 5.484e+00
50000 best: 7.863e+00
100000 best: 2.126e+01
500000 best: 6.972e+01
1000000 best: 1.219e+02
5000000 best: 9.463e+02

task: _accumulate_then_one_gulp
10000 best: 1.502e+00
50000 best: 1.420e+01
100000 best: 2.245e+01
500000 best: 8.577e+01
1000000 best: 2.295e+02
5000000 best: 1.326e+03

task: _alternate_str
10000 best: 3.290e+00
50000 best: 4.220e+00
100000 best: 1.665e+01
500000 best: 6.281e+01
1000000 best: 1.127e+02
5000000 best: 7.626e+02
 
S

Steven D'Aprano

But the use case that I am talking about is where you need to accumulate
new incoming strings into your buffer while alternately processing
leading prefixes of the buffer. [...]
Below are the abbreviated results of the benchmark. You can easily run
this benchmark yourself by following the README.txt in the StringChain
source package [3]. These results show that for the load imposed by this
benchmark StringChain is the only one of the four that scales and that
it is also nice and fast.

I was reading this email, and thinking "Why do you need this StringChain
data structure, from the description it sounds like a job for
collections.deque?"

And funnily enough, following the links you supplied I came across this:

"You can get the package from http://pypi.python.org/pypi/stringchain or
with darcs get http://tahoe-lafs.org/source/stringchain/trunk. It has
unit tests. It is in pure Python (it uses collections.deque and string)."

http://tahoe-lafs.org/trac/stringchain


Perhaps you should have said that it was a wrapper around deque giving
richer functionality, rather than giving the impression that it was a
brand new data structure invented by you. People are naturally going to
be more skeptical about a newly invented data structure than one based on
a reliable, component like deque.

In any case, it looks to me that the StringChain data structure itself is
a little too application specific for me to be personally interested in
it. Maybe you should consider linking to it on PyPI and seeing if there
is any interest from others?



Regards,





Steven
 
Z

Zooko O'Whielacronx

My apologies; I left out the heading on the last of the four
structures in the benchmark results. Here are those results again with
the missing heading (Stringy) inserted:

Regards,

Zooko
- Hide quoted text -

impl: StringChain
task: _accumulate_then_one_gulp
10000 best: 2.694e+00
50000 best: 2.742e+00
100000 best: 2.310e+00
500000 best: 2.040e+00
1000000 best: 1.988e+00
5000000 best: 2.193e+00

task: _alternate_str
10000 best: 6.509e+00
50000 best: 4.559e+00
100000 best: 4.308e+00
500000 best: 4.070e+00
1000000 best: 3.991e+00
5000000 best: 4.000e+00

impl: SimplerStringChain
task: _accumulate_then_one_gulp
10000 best: 1.407e+00
50000 best: 2.317e+00
100000 best: 2.012e+00
500000 best: 1.902e+00
1000000 best: 1.897e+00
5000000 best: 2.104e+00

task: _alternate_str
10000 best: 4.888e+00
50000 best: 5.198e+00
100000 best: 1.750e+01
500000 best: 6.233e+01
1000000 best: 1.134e+02
5000000 best: 7.599e+02

impl: StringIOy
task: _accumulate_then_one_gulp
10000 best: 4.196e+00
50000 best: 5.522e+00
100000 best: 4.499e+00
500000 best: 3.756e+00
1000000 best: 4.176e+00
5000000 best: 5.414e+00

task: _alternate_str
10000 best: 5.484e+00
50000 best: 7.863e+00
100000 best: 2.126e+01
500000 best: 6.972e+01
1000000 best: 1.219e+02
5000000 best: 9.463e+02
impl: Stringy
- Hide quoted text -
 
Z

Zooko O'Whielacronx

Perhaps you should have said that it was a wrapper around deque giving
richer functionality, rather than giving the impression that it was a
brand new data structure invented by you. People are naturally going to
be more skeptical about a newly invented data structure than one based on
a reliable, component like deque.

The fact that StringChain uses deque to hold the queue of strings
isn't that important. I just benchmarked it by swapping in the deque
for a list and using the list costs about one third of a nanosecond
per byte at the scales that the benchmark exercises (namely 10,000,000
bytes in about 10,000 strings). A third of a nanosecond per byte is
about 4% of the runtime.

I also implemented and benchmarked a simpler deque-based scheme which
puts all the actual bytes from the strings into a deque with
self.d.extend(newstr). As you would expect, this shows good asymptotic
performance but the constants are relatively bad so that at all of the
actual loads measured it was three orders of magnitude worse than
StringChain and than String-Chain-with-a-list-instead-of-a-deque.
Moral: the constants matter!

Those benchmarks are appended. You can run the benchmarks yourself per
the README.txt.

But anyway, I take your point and I updated the StringChain README to
explain that it is a pretty simple data structure that holds a list
(actually a deque) of strings and isn't anything too clever or novel.

By the way, we could further micro-optimize this kind of thing if
''.join() would accept either strings or buffers instead of requiring
strings:
''.join([buffer("abc"), "def"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, buffer found

Then whenever StringChain needs to slice a string into leading and
trailing parts, it could construct a buffer() viewing each part
instead of making a copy of each part.

it. Maybe you should consider linking to it on PyPI and seeing if there
is any interest from others?

http://pypi.python.org/pypi/stringchain

Regards,

Zooko

impl: StringChain
task: _accumulate_then_one_gulp
10000 best: 5.698e+00, 3th-best: 7.486e+00, mean: 7.758e+00,
100000 best: 4.640e+00, 3th-best: 4.690e+00, mean: 7.674e+00,
1000000 best: 3.789e+00, 3th-best: 3.806e+00, mean: 3.995e+00,
10000000 best: 4.095e+00, 1th-best: 4.095e+00, mean: 4.095e+00,
task: _alternate_str
10000 best: 1.378e+01, 3th-best: 1.390e+01, mean: 1.500e+01,
100000 best: 9.198e+00, 3th-best: 9.248e+00, mean: 9.385e+00,
1000000 best: 8.715e+00, 3th-best: 8.755e+00, mean: 8.808e+00,
10000000 best: 8.738e+00, 1th-best: 8.738e+00, mean: 8.738e+00,
impl: StringChainWithList
task: _accumulate_then_one_gulp
10000 best: 3.600e+00, 3th-best: 3.695e+00, mean: 4.129e+00,
100000 best: 4.070e+00, 3th-best: 4.079e+00, mean: 4.162e+00,
1000000 best: 3.662e+00, 3th-best: 3.688e+00, mean: 3.721e+00,
10000000 best: 3.902e+00, 1th-best: 3.902e+00, mean: 3.902e+00,
1th-worst: 3.902e+00, worst: 3.902e+00 (of 1)
task: _alternate_str
10000 best: 1.369e+01, 3th-best: 1.380e+01, mean: 1.442e+01,
100000 best: 9.251e+00, 3th-best: 9.289e+00, mean: 9.416e+00,
1000000 best: 8.809e+00, 3th-best: 8.876e+00, mean: 8.943e+00,
10000000 best: 9.095e+00, 1th-best: 9.095e+00, mean: 9.095e+00,
impl: Dequey
task: _accumulate_then_one_gulp
10000 best: 2.772e+02, 3th-best: 2.785e+02, mean: 2.911e+02,
100000 best: 2.314e+02, 3th-best: 2.334e+02, mean: 2.422e+02,
1000000 best: 2.282e+02, 3th-best: 2.288e+02, mean: 2.370e+02,
10000000 best: 2.587e+02, 1th-best: 2.587e+02, mean: 2.587e+02,
task: _alternate_str
10000 best: 1.576e+03, 3th-best: 1.580e+03, mean: 1.608e+03,
100000 best: 1.301e+03, 3th-best: 1.303e+03, mean: 1.306e+03,
1000000 best: 1.275e+03, 3th-best: 1.276e+03, mean: 1.278e+03,
10000000 best: 1.280e+03, 1th-best: 1.280e+03, mean: 1.280e+03,
 
L

Lie Ryan

Perhaps you should have said that it was a wrapper around deque giving
richer functionality, rather than giving the impression that it was a
brand new data structure invented by you. People are naturally going to
be more skeptical about a newly invented data structure than one based on
a reliable, component like deque.

Well, technically StringChain is not a data structure in the first
place. StringChain is a string; a string that is implemented using deque
data structure to make appending algorithmically efficient. It is not a
data structure, in the sense that I can't put arbitrary "thing" into the
data structure. In the same sense, "string" is also not a data structure
as "string" is an implementation of stream using "array" data structure;
"StringChain" is an implementation of stream using "deque" data structure.
 
S

Steven D'Aprano

Well, technically StringChain is not a data structure in the first
place. StringChain is a string;

And strings are data structures, as are arrays and structs. Just because
they're simple data structures made directly from primitives rather than
rich, complex structures, doesn't mean they're not data structures.

a string that is implemented using deque
data structure to make appending algorithmically efficient. It is not a
data structure, in the sense that I can't put arbitrary "thing" into the
data structure.

Any "thing" that can be pickled or serialised can be put into a string.
 
L

Lie Ryan

And strings are data structures, as are arrays and structs. Just because
they're simple data structures made directly from primitives rather than
rich, complex structures, doesn't mean they're not data structures.

Array is a way to structure data and thus a data structure; array is the
concept of structuring data using contiguous memory with elements
addressed by an index. string is just a contiguous memory reserved to
store data, or in other words: string is an array. This is what I meant
when I said string is not itself a data structure.
Any "thing" that can be pickled or serialised can be put into a string.

Fair enough, you're right to think so but IMHO I disagree. Streams (or
perhaps 'blob' to avoid the connotation of FIFO) are the data structure
which you can put anything pickleable/serialisable into, but string (the
implementation of stream/blob using array) are just a special case of
array data structure and not a different, separate data structure than
array.

Perhaps I should not make such a bold claim as saying string is not data
structure; what I have in mind is that string is just a special case of
array and not distinctly separate data structure than array.
 
S

Steve Holden

Lie said:
Array is a way to structure data and thus a data structure; array is the
concept of structuring data using contiguous memory with elements
addressed by an index. string is just a contiguous memory reserved to
store data, or in other words: string is an array. This is what I meant
when I said string is not itself a data structure.


Fair enough, you're right to think so but IMHO I disagree. Streams (or
perhaps 'blob' to avoid the connotation of FIFO) are the data structure
which you can put anything pickleable/serialisable into, but string (the
implementation of stream/blob using array) are just a special case of
array data structure and not a different, separate data structure than
array.

Perhaps I should not make such a bold claim as saying string is not data
structure; what I have in mind is that string is just a special case of
array and not distinctly separate data structure than array.

While this may be true conceptually it's certainly not true in Python
(and in Python you need to be careful to distinguish between lists and
arrays, since it does have both types).
From the point of view of the language, strings are primitives. But
every Python implementation uses a data structure to store them, and the
structure is definitely not the same as is used to store lists, or
arrays (which in Python are container types, whereas the string isn't
because the individual indexable elements are immutable).

As always, it's better to seek common ground than pick the nits: I don't
think that there's really that much difference between your approach and
Steven's, but he's a well-known nit-picker (and we are *sometimes*
grateful for it).

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top