optimization pointers?

Discussion in 'Python' started by r.e.s., Dec 11, 2003.

  1. r.e.s.

    r.e.s. Guest

    Would anyone care to offer pointers on how the following code
    might be optimized? Or alternatives? ...

    ---
    def lz(string):
    """ Return the Lempel-Ziv complexity (LZ78) of string. """

    vocabulary = []
    word = ''
    for char in string:
    word += char
    if word not in vocabulary:
    vocabulary.append(word)
    word = ''
    return len(vocabulary)
    ---

    On my 2 GHz system running PythonWin, the above code averages
    about 1.3 hrs to process a 1 MB string. I'd hoped to compute
    LZ78 for multi-GB files, but it seems infeasible. (?)

    Thanks for any feedback.
    --
    r.e.s.
     
    r.e.s., Dec 11, 2003
    #1
    1. Advertising

  2. "r.e.s." <> wrote:

    >Would anyone care to offer pointers on how the following code
    >might be optimized? Or alternatives? ...
    >
    >---
    >def lz(string):
    > """ Return the Lempel-Ziv complexity (LZ78) of string. """
    >
    > vocabulary = []
    > word = ''
    > for char in string:
    > word += char
    > if word not in vocabulary:
    > vocabulary.append(word)
    > word = ''
    > return len(vocabulary)
    >---
    >
    >On my 2 GHz system running PythonWin, the above code averages
    >about 1.3 hrs to process a 1 MB string. I'd hoped to compute
    >LZ78 for multi-GB files, but it seems infeasible. (?)


    On a 300 mhz celeron the script below takes about ten or twenty
    seconds, the original lz is commented out as it never seems to finish
    for longer strings.

    Anton

    p.s. the file I used is here:
    http://sailor.gutenberg.org/etext97/1donq10.zip

    import sets

    def lz(string):
    """ Return the Lempel-Ziv complexity (LZ78) of string. """

    vocabulary = []
    word = ''
    for char in string:
    word += char
    if word not in vocabulary:
    vocabulary.append(word)
    word = ''
    return len(vocabulary)

    def lzx(string):
    """ Return the Lempel-Ziv complexity (LZ78) of string. """

    vocabulary = sets.Set()
    word = ''
    for char in string:
    word += char
    if word not in vocabulary:
    vocabulary.add(word)
    word = ''
    return len(vocabulary)

    def test():
    fn = '1donq10.txt'
    s = file(fn,'rb').read(1000000)
    print lzx(s)
    #print lz(s)


    if __name__=='__main__':
    test()
     
    Anton Vredegoor, Dec 11, 2003
    #2
    1. Advertising

  3. r.e.s.

    r.e.s. Guest

    "Mel Wilson" <> wrote ...

    > One simple improvement is to get rid of
    >
    > if an_item not in a_list
    >
    > which does a linear search over the list.
    > A dictionary is faster.
    >
    > Without a full benchmark, it seems that

    [...]
    > will show about a x6 speedup.


    I really appreciate the suggestions -- both yours
    and Anton's.

    After making the changes you indicated, the code
    runs in about 1/500th the time of the original
    (i.e. about 2 sec per MB for strings in RAM). The
    sets idea also speeds things up tremendously, but
    not as much -- it takes about 70% longer than the
    dictionary approach.

    Thanks again.
    --
    r.e.s.
     
    r.e.s., Dec 12, 2003
    #3
  4. "Mel Wilson" <> wrote in message
    news:rzO2/ks/...
    >
    > One simple improvement is to get rid of
    >
    > if an_item not in a_list
    >
    > which does a linear search over the list. A dictionary is faster.
    >
    > Without a full benchmark, it seems that appending
    >
    >
    > def lz_complexity (s):
    > voc = {}
    > word = ''
    > for c in s:
    > word += c
    > if not voc.get (word, False):
    > voc[word] = True
    > word = ''
    > return len (voc.keys())
    >
    > if __name__ == '__main__':
    > import time
    > text = file ('lzc.py', 'rt').read()
    > t0 = time.clock()
    > t0 = time.clock() - t0
    >
    > c = lz (text)
    > t1 = time.clock()
    > c = lz (text)
    > t1 = time.clock() - t1
    >
    > c = lz_complexity (text)
    > t2 = time.clock()
    > c = lz_complexity (text)
    > t2 = time.clock() - t2
    >
    > print 'T1:', t1 - t0
    > print 'T2:', t2 - t0
    >
    >
    > will show about a x6 speedup.
    >
    > Avoiding
    >
    > a_string += another_string
    >
    > is a traditional speedup, but it seems you'll be needing the
    > string form for every character processed, so it may not buy
    > much in this case.
    >
    > Regards. Mel.


    Heres my contribution, removing the string append in favour of slices. Buys
    a little speed thanks to xrange.

    def lz_comp_mine(text):
    voc = {}
    st = 0
    for cur in xrange(1,len(text)+1):
    if not voc.get(text[st:cur], False):
    voc[text[st:cur]] = True
    st = cur
    return len(voc)

    Anthony McDonald
     
    Anthony McDonald, Dec 12, 2003
    #4
  5. "Mel Wilson" <> wrote in message
    news:xIf2/ks/...
    >
    > Nice touch! I tried slices and took a huge performance
    > hit (almost 3x the list version) but I didn't use `for ... in
    > xrange ...`. It must have all been in the while-loop test
    > and index incrementing.
    >
    > Regards. Mel.


    Thanks.

    My original attempt which incidentally used range was about half a second
    slower than yours with 700K of test data. Just as I was about to close the
    editor window I noticed your return statement.

    return len (voc.keys())

    Which creates a list of keys to then apply len to, but thats an unneeded
    step as len applied directly to a dictionary returns the number of keys. I
    made the change and gained 7 hundreds of a second. Not much, still behind
    yours, but it suggested that chainging range to xrange and avoiding the list
    creation might help. Viola!

    The interesting thing that benchmarking with test data shows, is that the
    difference in speed between our 2 routines is about 0.05secs per 700K
    processed. That difference remains constant at least upto 80Mb of input
    data, implying that slices are no more efficent than string appending, and
    may actually in this case be less efficent.

    Anthony McDonald
     
    Anthony McDonald, Dec 13, 2003
    #5
  6. "Anthony McDonald" <tonym1972(at)club-internet(in)fr> wrote:

    >return len (voc.keys())
    >
    >Which creates a list of keys to then apply len to, but thats an unneeded
    >step as len applied directly to a dictionary returns the number of keys.


    Below are two more small optimization possibilities. Originally I
    wasn't going to post them because they are way into micro optimization
    country and because the setdefault solution is beautiful but blond. At
    least on some of my machines however they are both faster than the
    solutions offered so far.

    Anton

    def lzx(s):
    word,D = '',{}
    Dget, Dset = D.get, D.__setitem__
    for c in s:
    word += c
    if not Dget(word):
    Dset(word,True)
    word = ''
    return len(D)

    def lzy(s):
    j,D = 0,{}
    func = D.setdefault
    for i in xrange(1,len(s)+1):
    if func(s[j:i],j) is j: j = i
    return len(D)
     
    Anton Vredegoor, Dec 14, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Phil
    Replies:
    1
    Views:
    650
    llewelly
    Sep 16, 2003
  2. galathaea
    Replies:
    9
    Views:
    567
    Michael Press
    Jul 28, 2007
  3. Luna Moon
    Replies:
    1
    Views:
    475
    Chip Eastham
    Jul 27, 2007
  4. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    867
    Thad Smith
    Nov 24, 2008
  5. cerr

    pointers, pointers, pointers...

    cerr, Apr 7, 2011, in forum: C Programming
    Replies:
    12
    Views:
    682
Loading...

Share This Page