modifying small chunks from long string

M

MackS

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

Thanks for any help,

Mack
 
S

Steven D'Aprano

MackS said:
This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time.
....

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object?

Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c')
UserString.MutableString
cStringIO

depending on exactly what you are trying to do, but if
it is just straight forward replacement, (i.e. you
never actually change the length of the string) I'd
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the
string, the original and the modified:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy = original.lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my
not-especially-powerful system. Ouch. How does that
compare to your code?

If you can operate on moderately sized chunks of text
at a time rather than one character at a time, you'll
see some performance increase.

Also the above code not only has a 20MB string, plus a
20MB list, it also carries around a list of 20+ million
int objects. If I was paying more attention before I
hit the enter key, I would have used xrange instead of
range and saved a lot of memory.

It also has to assemble the new 20MB string before
garbage-collecting the character list. It is possible
that the above test code has well in excess of 100MB of
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
copy.append( \
original[i*blocksize:(i+1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5
seconds. That's a speed-up by a factor of about one
thousand.

I was so amazed at the speed increase I re-wrote the
code with a couple of simple tests, then ran it again.
Same result. Unless I've made some stupid mistake, I
think this is the way to go.
 
S

Steven D'Aprano

Replying to myself... the first sign of madness *wink*


Steven D'Aprano wrote:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy = original.lower()
copy = ''.join(copy)


Do you notice the premature optimization? Rather than
appending new data to an initially empty list, I
thought I would "optimize" my code by pre-allocating
all the memory I would need for the list.

You can see how well it didn't work:
This takes 530 seconds (8 minutes)

The more sensible algorithm, without the unneeded
pre-allocation of the copy list, runs 1000 times
faster. That's the difference between optimization by
wild guessing and optimization by actually writing good
code :)
 
B

Bengt Richter

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!
The usual way is to accumulate the edited short pieces of the new version of longstr
in a list and then join them once, if you really need the new longstr in a single piece
for something. I.e., (untested sketch)

chunks_of_new_longstr = []
for chunk in chunker(longstr):
#edit chunk (your shortstr)
newlong.append(chunk) # or do several appends of pieces from the editing of a chunk
longstr = ''.join(chunks_of_new_longstr)

But if you don't really need it except to write it to output and the next thing would be
open('longstr.txt','wb').write(longstr) # might want 'w' instead of 'wb' for plain text data

then don't bother joining into a new longstr but do
open('longstr.txt','wb').writelines(chunks_of_new_longstr)

instead. But if you are going to do that, why not just
fout = open('longstr.txt','wb')
before the loop, and
fout.write(chunk)
in place of
newlong.append(chunk)

Of course, if longstr is coming from a file, maybe you can have
the chunker operate on a file instead of a longstr in memory.
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...
I imagine you should be able to change a very few lines to switch between
ways of getting your input stream of editable chunks and accumulating your output.

OTOH, this is all guesswork without more context ;-)

Regards,
Bengt Richter
 
T

Tony Nelson

"MackS said:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

MutableString warns that it is very slow. It seems to work by having a
string data item that it keeps replacing. I didn't try it.

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Replace this with slice assignment:

longarray = array.array('c',longstr) # once only at beginning!

shortstring = longarray[beg:end].tostring() # or just edit an array

# edit shortstring (or shortarray)

longarray[beg:end] = array.array('c',shortstr)

longstring = longarray.tostring() # if needed
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
 
M

MackS

Thank you all for your great help. One of the few things better than
python is the knowledgeable community around it. : )

Regards,

Mack
 
S

Steven D'Aprano

Tony said:
A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

I have no major objections to using array, but a minor
one: ordinary lists may very well be more than snappy
enough, and they have the advantage of being more
familiar than the array module to many Python programmers.

The time it takes to process a 20MB string will depend
on the details of the processing, but my back of the
envelope test using one large input string and an
intermediate list of strings was *extremely* fast, less
than half a second for a 20MB input. (See my earlier
post for details.)

Given that sort of speed, shifting to the less familiar
array module just to shave the time from 0.49s to 0.45s
is premature optimization. Although, in fairness, if
you could cut the time to 0.04s for 20MB then it would
be worth the extra work to use the array module.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top