modifying small chunks from long string

MackS · Nov 14, 2005

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

Thanks for any help,

Mack

Steven D'Aprano · Nov 14, 2005

MackS said:
This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time.
....

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object?

Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c')
UserString.MutableString
cStringIO

depending on exactly what you are trying to do, but if
it is just straight forward replacement, (i.e. you
never actually change the length of the string) I'd
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the
string, the original and the modified:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy = original.lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my
not-especially-powerful system. Ouch. How does that
compare to your code?

If you can operate on moderately sized chunks of text
at a time rather than one character at a time, you'll
see some performance increase.

Also the above code not only has a 20MB string, plus a
20MB list, it also carries around a list of 20+ million
int objects. If I was paying more attention before I
hit the enter key, I would have used xrange instead of
range and saved a lot of memory.

It also has to assemble the new 20MB string before
garbage-collecting the character list. It is possible
that the above test code has well in excess of 100MB of
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
copy.append( \
original[i*blocksizei+1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5
seconds. That's a speed-up by a factor of about one
thousand.

I was so amazed at the speed increase I re-wrote the
code with a couple of simple tests, then ran it again.
Same result. Unless I've made some stupid mistake, I
think this is the way to go.

Steven D'Aprano · Nov 14, 2005

Replying to myself... the first sign of madness *wink*

Steven D'Aprano wrote:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy = original.lower()
copy = ''.join(copy)

Do you notice the premature optimization? Rather than
appending new data to an initially empty list, I
thought I would "optimize" my code by pre-allocating
all the memory I would need for the list.

You can see how well it didn't work:

This takes 530 seconds (8 minutes)

Click to expand...

The more sensible algorithm, without the unneeded
pre-allocation of the copy list, runs 1000 times
faster. That's the difference between optimization by
wild guessing and optimization by actually writing good
code

Bengt Richter · Nov 14, 2005

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

The usual way is to accumulate the edited short pieces of the new version of longstr
in a list and then join them once, if you really need the new longstr in a single piece
for something. I.e., (untested sketch)

chunks_of_new_longstr = []
for chunk in chunker(longstr):
#edit chunk (your shortstr)
newlong.append(chunk) # or do several appends of pieces from the editing of a chunk
longstr = ''.join(chunks_of_new_longstr)

But if you don't really need it except to write it to output and the next thing would be
open('longstr.txt','wb').write(longstr) # might want 'w' instead of 'wb' for plain text data

then don't bother joining into a new longstr but do
open('longstr.txt','wb').writelines(chunks_of_new_longstr)

instead. But if you are going to do that, why not just
fout = open('longstr.txt','wb')
before the loop, and
fout.write(chunk)
in place of
newlong.append(chunk)

Of course, if longstr is coming from a file, maybe you can have
the chunker operate on a file instead of a longstr in memory.

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

I imagine you should be able to change a very few lines to switch between
ways of getting your input stream of editable chunks and accumulating your output.

OTOH, this is all guesswork without more context ;-)

Regards,
Bengt Richter

Tony Nelson · Nov 14, 2005

"MackS said:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

MutableString warns that it is very slow. It seems to work by having a
string data item that it keeps replacing. I didn't try it.

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Replace this with slice assignment:

longarray = array.array('c',longstr) # once only at beginning!

shortstring = longarray[beg:end].tostring() # or just edit an array

# edit shortstring (or shortarray)

longarray[beg:end] = array.array('c',shortstr)

longstring = longarray.tostring() # if needed
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

MackS · Nov 15, 2005

Thank you all for your great help. One of the few things better than
python is the knowledgeable community around it. : )

Regards,

Mack

Steven D'Aprano · Nov 15, 2005

Tony said:
A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

I have no major objections to using array, but a minor
one: ordinary lists may very well be more than snappy
enough, and they have the advantage of being more
familiar than the array module to many Python programmers.

The time it takes to process a 20MB string will depend
on the details of the processing, but my back of the
envelope test using one large input string and an
intermediate list of strings was *extremely* fast, less
than half a second for a 20MB input. (See my earlier
post for details.)

Given that sort of speed, shifting to the less familiar
array module just to shave the time from 0.49s to 0.45s
is premature optimization. Although, in fairness, if
you could cut the time to 0.04s for 20MB then it would
be worth the extra work to use the array module.

STRING - Remove small letters from string	1	Jan 20, 2023
chunking a long string?	19	Nov 8, 2013
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
Collect Excel Data from Website	5	Apr 30, 2022
Inheriting from int or long	2	Oct 2, 2007
Sending Python statement over socket in chunks	2	Feb 11, 2008
Hashing long strings...	8	Aug 18, 2010
Trouble accessing a value within a JSON string.	1	Jun 16, 2023

modifying small chunks from long string

MackS

Steven D'Aprano

Steven D'Aprano

Bengt Richter

Tony Nelson

MackS

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads