modifying small chunks from long string

Discussion in 'Python' started by MackS, Nov 14, 2005.

  1. MackS

    MackS Guest

    Hello everyone

    I am faced with the following problem. For the first time I've asked
    myself "might this actually be easier to code in C rather than in
    python?", and I am not looking at device drivers. : )

    This program is meant to process relatively long strings (10-20 MB) by
    selectively modifying small chunks one at a time. Eg, it locates
    approx. 1000-2000 characters and modifies them. Currently I was doing
    this using a string object but it is getting very slow: although I only
    modify a tiny bit of the string at a time, a new entire string gets
    created whenever I "merge" it with the rest. Eg,

    shortstr = longstr[beg:end]

    # edit shortstr...

    longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
    created!!

    Can I get over this performance problem without reimplementing the
    whole thing using a barebones list object? I though I was being "smart"
    by avoiding editing the long list, but then it struck me that I am
    creating a second object of the same size when I put the modified
    shorter string in place...

    Thanks for any help,

    Mack
    MackS, Nov 14, 2005
    #1
    1. Advertising

  2. MackS wrote:

    > This program is meant to process relatively long strings (10-20 MB) by
    > selectively modifying small chunks one at a time.


    ....

    > Can I get over this performance problem without reimplementing the
    > whole thing using a barebones list object?


    Is that a problem? Are you using string methods?

    Some possibilities are:

    array.array('c')
    UserString.MutableString
    cStringIO

    depending on exactly what you are trying to do, but if
    it is just straight forward replacement, (i.e. you
    never actually change the length of the string) I'd
    guess the list idiom would be simplest.

    If you have enough memory, just keep two copies of the
    string, the original and the modified:

    size = 1024*1024*20 # 20 MB
    original = "A" * size
    copy = [None] * size
    for i in range(size):
    copy = original.lower()
    copy = ''.join(copy)

    This takes 530 seconds (8 minutes) on my
    not-especially-powerful system. Ouch. How does that
    compare to your code?

    If you can operate on moderately sized chunks of text
    at a time rather than one character at a time, you'll
    see some performance increase.

    Also the above code not only has a 20MB string, plus a
    20MB list, it also carries around a list of 20+ million
    int objects. If I was paying more attention before I
    hit the enter key, I would have used xrange instead of
    range and saved a lot of memory.

    It also has to assemble the new 20MB string before
    garbage-collecting the character list. It is possible
    that the above test code has well in excess of 100MB of
    data at its peak.

    With those problems in mind, I try this:

    from __future__ import division

    size = 1024*1024*20 # 20 MB
    original = "A" * size
    copy = []
    blocksize = 1024
    for i in xrange(size//blocksize):
    copy.append( \
    original[i*blocksize:(i+1)*blocksize].lower() )
    copy = ''.join(copy)

    This makes a BIG difference: from 8 minutes down to 0.5
    seconds. That's a speed-up by a factor of about one
    thousand.

    I was so amazed at the speed increase I re-wrote the
    code with a couple of simple tests, then ran it again.
    Same result. Unless I've made some stupid mistake, I
    think this is the way to go.



    --
    Steven.
    Steven D'Aprano, Nov 14, 2005
    #2
    1. Advertising

  3. Replying to myself... the first sign of madness *wink*


    Steven D'Aprano wrote:


    > size = 1024*1024*20 # 20 MB
    > original = "A" * size
    > copy = [None] * size
    > for i in range(size):
    > copy = original.lower()
    > copy = ''.join(copy)


    Do you notice the premature optimization? Rather than
    appending new data to an initially empty list, I
    thought I would "optimize" my code by pre-allocating
    all the memory I would need for the list.

    You can see how well it didn't work:

    > This takes 530 seconds (8 minutes)


    The more sensible algorithm, without the unneeded
    pre-allocation of the copy list, runs 1000 times
    faster. That's the difference between optimization by
    wild guessing and optimization by actually writing good
    code :)


    --
    Steven.
    Steven D'Aprano, Nov 14, 2005
    #3
  4. On 13 Nov 2005 22:57:50 -0800, "MackS" <> wrote:

    >Hello everyone
    >
    >I am faced with the following problem. For the first time I've asked
    >myself "might this actually be easier to code in C rather than in
    >python?", and I am not looking at device drivers. : )
    >
    >This program is meant to process relatively long strings (10-20 MB) by
    >selectively modifying small chunks one at a time. Eg, it locates
    >approx. 1000-2000 characters and modifies them. Currently I was doing
    >this using a string object but it is getting very slow: although I only
    >modify a tiny bit of the string at a time, a new entire string gets
    >created whenever I "merge" it with the rest. Eg,
    >
    >shortstr = longstr[beg:end]
    >
    ># edit shortstr...
    >
    >longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
    >created!!
    >

    The usual way is to accumulate the edited short pieces of the new version of longstr
    in a list and then join them once, if you really need the new longstr in a single piece
    for something. I.e., (untested sketch)

    chunks_of_new_longstr = []
    for chunk in chunker(longstr):
    #edit chunk (your shortstr)
    newlong.append(chunk) # or do several appends of pieces from the editing of a chunk
    longstr = ''.join(chunks_of_new_longstr)

    But if you don't really need it except to write it to output and the next thing would be
    open('longstr.txt','wb').write(longstr) # might want 'w' instead of 'wb' for plain text data

    then don't bother joining into a new longstr but do
    open('longstr.txt','wb').writelines(chunks_of_new_longstr)

    instead. But if you are going to do that, why not just
    fout = open('longstr.txt','wb')
    before the loop, and
    fout.write(chunk)
    in place of
    newlong.append(chunk)

    Of course, if longstr is coming from a file, maybe you can have
    the chunker operate on a file instead of a longstr in memory.

    >Can I get over this performance problem without reimplementing the
    >whole thing using a barebones list object? I though I was being "smart"
    >by avoiding editing the long list, but then it struck me that I am
    >creating a second object of the same size when I put the modified
    >shorter string in place...
    >

    I imagine you should be able to change a very few lines to switch between
    ways of getting your input stream of editable chunks and accumulating your output.

    OTOH, this is all guesswork without more context ;-)

    Regards,
    Bengt Richter
    Bengt Richter, Nov 14, 2005
    #4
  5. MackS

    Tony Nelson Guest

    In article <>,
    "MackS" <> wrote:

    > Hello everyone
    >
    > I am faced with the following problem. For the first time I've asked
    > myself "might this actually be easier to code in C rather than in
    > python?", and I am not looking at device drivers. : )
    >
    > This program is meant to process relatively long strings (10-20 MB) by
    > selectively modifying small chunks one at a time. Eg, it locates
    > approx. 1000-2000 characters and modifies them. Currently I was doing
    > this using a string object but it is getting very slow: although I only
    > modify a tiny bit of the string at a time, a new entire string gets
    > created whenever I "merge" it with the rest. Eg,
    >
    > shortstr = longstr[beg:end]
    >
    > # edit shortstr...
    >
    > longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
    > created!!
    >
    > Can I get over this performance problem without reimplementing the
    > whole thing using a barebones list object? I though I was being "smart"
    > by avoiding editing the long list, but then it struck me that I am
    > creating a second object of the same size when I put the modified
    > shorter string in place...


    A couple of minutes experimenting with array.array at the python command
    line indicates that it will work fine for you. Quite snappy on a 16 MB
    array, including a slice assignment of 1 KB near the beginning.
    Array.array is probably better than lists for speed, and uses less
    memory. It is the way to go if you are going to be randomly editing all
    over the place but don't need to convert to string often.

    MutableString warns that it is very slow. It seems to work by having a
    string data item that it keeps replacing. I didn't try it.


    > shortstr = longstr[beg:end]
    >
    > # edit shortstr...
    >
    > longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
    > created!!


    Replace this with slice assignment:

    longarray = array.array('c',longstr) # once only at beginning!

    shortstring = longarray[beg:end].tostring() # or just edit an array

    # edit shortstring (or shortarray)

    longarray[beg:end] = array.array('c',shortstr)

    longstring = longarray.tostring() # if needed
    ________________________________________________________________________
    TonyN.:' *firstname*nlsnews@georgea*lastname*.com
    ' <http://www.georgeanelson.com/>
    Tony Nelson, Nov 14, 2005
    #5
  6. MackS

    MackS Guest

    Thank you all for your great help. One of the few things better than
    python is the knowledgeable community around it. : )

    Regards,

    Mack
    MackS, Nov 15, 2005
    #6
  7. Tony Nelson wrote:

    > >Can I get over this performance problem without reimplementing the
    > >whole thing using a barebones list object? I though I was being "smart"
    > >by avoiding editing the long list, but then it struck me that I am
    > >creating a second object of the same size when I put the modified
    > >shorter string in place...

    >
    >
    > A couple of minutes experimenting with array.array at the python command
    > line indicates that it will work fine for you. Quite snappy on a 16 MB
    > array, including a slice assignment of 1 KB near the beginning.
    > Array.array is probably better than lists for speed, and uses less
    > memory. It is the way to go if you are going to be randomly editing all
    > over the place but don't need to convert to string often.


    I have no major objections to using array, but a minor
    one: ordinary lists may very well be more than snappy
    enough, and they have the advantage of being more
    familiar than the array module to many Python programmers.

    The time it takes to process a 20MB string will depend
    on the details of the processing, but my back of the
    envelope test using one large input string and an
    intermediate list of strings was *extremely* fast, less
    than half a second for a 20MB input. (See my earlier
    post for details.)

    Given that sort of speed, shifting to the less familiar
    array module just to shave the time from 0.49s to 0.45s
    is premature optimization. Although, in fairness, if
    you could cut the time to 0.04s for 20MB then it would
    be worth the extra work to use the array module.


    --
    Steven.
    Steven D'Aprano, Nov 15, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Marsaglia

    Assigning unsigned long to unsigned long long

    George Marsaglia, Jul 8, 2003, in forum: C Programming
    Replies:
    1
    Views:
    672
    Eric Sosman
    Jul 8, 2003
  2. Daniel Rudy

    unsigned long long int to long double

    Daniel Rudy, Sep 19, 2005, in forum: C Programming
    Replies:
    5
    Views:
    1,184
    Peter Shaggy Haywood
    Sep 20, 2005
  3. Mathieu Dutour

    long long and long

    Mathieu Dutour, Jul 17, 2007, in forum: C Programming
    Replies:
    4
    Views:
    471
    santosh
    Jul 24, 2007
  4. Bart C

    Use of Long and Long Long

    Bart C, Jan 9, 2008, in forum: C Programming
    Replies:
    27
    Views:
    796
    Peter Nilsson
    Jan 15, 2008
  5. veryhotsausage
    Replies:
    1
    Views:
    1,796
    veryhotsausage
    Jul 4, 2008
Loading...

Share This Page