Optimizing a text statistics function

Discussion in 'Python' started by Nickolay Kolev, Apr 21, 2004.

  1. Hi all,

    I am currently writing some simple functions in the process of learning
    Python. I have a task where the program has to read in a text file and
    display some statistics about the tokens in that file.

    The text I have been feeding it is Dickens' David Copperfield.

    It is really simple - it reads the file in memory, splits it on
    whitespace, strips punctuation characters and transforms all remaining
    elements to lowercase. It then looks through what has been left and
    creates a list of tuples (count, word) which contain each unique word
    and the number of time it appears in the text.

    The code (~30 lines and easy to read :) can be found at
    http://www.uni-bonn.de/~nmkolev/python/textStats.py

    I am now looking for a way to make the whole thing run faster. I have
    already made many changes since the initial version, realising many
    mistakes. As I do not think of anything else, I thought I would ask the
    more knowledgeable.

    I find the two loops through the initial list a bit troubling. Could
    this be avoided?

    Any other remarks and suggestions will also be greatly appreciated.

    Many thanks in advance,
    Nicky
    Nickolay Kolev, Apr 21, 2004
    #1
    1. Advertising

  2. On Wed, Apr 21, 2004 at 04:51:56PM +0200, Nickolay Kolev wrote:
    > It is really simple - it reads the file in memory, splits it on
    > whitespace, strips punctuation characters and transforms all remaining
    > elements to lowercase. It then looks through what has been left and
    > creates a list of tuples (count, word) which contain each unique word
    > and the number of time it appears in the text.
    >
    > The code (~30 lines and easy to read :) can be found at
    > http://www.uni-bonn.de/~nmkolev/python/textStats.py
    >
    > I am now looking for a way to make the whole thing run faster.


    Do you actually need it to be faster? If the answer is "no, but
    it would be nice." then you are already done *wink*.

    A good profiling strategy is to wrap each part in a function
    so you can see which lines consume the most CPU. Just make sure
    to wrap big pieces so the function call overhead doesn't get
    added ten thousand times and distort the picture.

    You will get a bunch of suggestions on how to make the code faster,
    so I'll skip those. What you want to do is only do the expensive
    parsing once and not every time you run your program. Try pickle.
    [untested code follows]

    import pickle

    def proc(txt_filename): # txt_filename like 'dickens.txt'
    ... exiting code ...
    reutrn wordCountList

    def procWrap(txt_filename):
    cache_filename = tst_filename.replace('.txt', '.pickle')
    try:
    fob = open(cache_filename)
    wordCountList = pickle.load(fob)
    except IOError:
    wordCountList = proc(txt_filename)
    fob = open(cache_filename, 'w+')
    pickle.dump(wordCountList, fob, -1) # see the docs about the '-1'
    return wordCountList

    Use procWrap() instead of proc() to get the list. You'll need
    to delete the .pickle file every time you change proc() so the
    pickle gets refreshed. This way you never have to care about how
    effecient the parsing loop is, because you only have to call it once.

    -jackdied
    Jack Diederich, Apr 21, 2004
    #2
    1. Advertising

  3. On Wed, 21 Apr 2004 16:51:56 +0200, Nickolay Kolev <>
    declaimed the following in comp.lang.python:

    > It is really simple - it reads the file in memory, splits it on
    > whitespace, strips punctuation characters and transforms all remaining
    > elements to lowercase. It then looks through what has been left and
    > creates a list of tuples (count, word) which contain each unique word
    > and the number of time it appears in the text.
    >

    Without looking at the code, I'd probably drop the use of the
    tuples for a dictionary keyed by the word, with the data being the
    count. Granted, the output would not be easily sorted...

    Okay, you do have a dictionary...

    I suggest dropping the initialization pass -- for common words
    it's going to be redundant... Dictionaries have a method for supplying a
    default value if a key doesn't exist. See the following:


    ## for x in strippedWords:
    ## unique[x] = 0
    ##
    ## for x in strippedWords:
    ## unique[x] += 1

    for x in strippedWords:
    unique[x] = unique.get(x, 0) + 1


    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
    Dennis Lee Bieber, Apr 21, 2004
    #3
  4. Nickolay Kolev

    Neil Benn Guest

    Hello,

    I don't know python as well as most people on this list so
    this is a leading question.

    In other languages I've used (mainly java although some C, C# VB
    <wash out your mouth>), the way I woud look at speeding this up is to
    avoid loading all the words into memory in one go and then working upon
    them. I'd create one stream which reads through the file, then passes
    onto a listener each word it finds from the lexing (making the input
    into tokens) and then another stream listening to this which will then
    sort out the detail from these tokens (parsing), finally an output
    stream which put this data wherever it needs to be (DB, screen, file,
    etc). This means that the program would scale better (if you pass the
    European voting register through your system it would take exponentially
    much longer as you must scan the information twice).

    However as more experienced python programmers have not suggested
    this is this because there is :

    a. Something I'm not getting about python text handling
    b. Not easy/possible in python

    I ask this because I've been looking at (C)StringIO and it is OK for
    this kind of behaviour using strings (reading from a serial port and
    passing onto middleware) but it doesn't seem to have the ability to
    remove characters from the buffer once they have been read, therefore
    the buffer will grow and grow as the process runs. For now I'm having
    to use strings which is less than ideal because they are immutable (I
    was think of writing my own StringBuffer class will will discard
    characters once they have been read from the StringBuffer) and therefore
    my program scales badly.

    However I do agree with the earlier poster - don't optimise for
    speed unless you need to (I'm assuming this is an academic exercise and
    I'm waiting to go to the pub)!!! Simplicity of design is usually a
    better win.

    Cheers,

    Neil

    Dennis Lee Bieber wrote:

    >On Wed, 21 Apr 2004 16:51:56 +0200, Nickolay Kolev <>
    >declaimed the following in comp.lang.python:
    >
    >
    >
    >>It is really simple - it reads the file in memory, splits it on
    >>whitespace, strips punctuation characters and transforms all remaining
    >>elements to lowercase. It then looks through what has been left and
    >>creates a list of tuples (count, word) which contain each unique word
    >>and the number of time it appears in the text.
    >>
    >>
    >>

    > Without looking at the code, I'd probably drop the use of the
    >tuples for a dictionary keyed by the word, with the data being the
    >count. Granted, the output would not be easily sorted...
    >
    > Okay, you do have a dictionary...
    >
    > I suggest dropping the initialization pass -- for common words
    >it's going to be redundant... Dictionaries have a method for supplying a
    >default value if a key doesn't exist. See the following:
    >
    >
    >## for x in strippedWords:
    >## unique[x] = 0
    >##
    >## for x in strippedWords:
    >## unique[x] += 1
    >
    > for x in strippedWords:
    > unique[x] = unique.get(x, 0) + 1
    >
    >
    >--
    > > ============================================================== <
    > > | Wulfraed Dennis Lee Bieber KD6MOG <
    > > | Bestiaria Support Staff <
    > > ============================================================== <
    > > Home Page: <http://www.dm.net/~wulfraed/> <
    > > Overflow Page: <http://wlfraed.home.netcom.com/> <

    >
    >


    --

    Neil Benn
    Senior Automation Engineer
    Cenix BioScience
    PfotenhauerStrasse 108
    D-01307
    Dresden
    Germany

    Tel : +49 (351) 210 1300
    e-mail :
    Cenix Website : http://www.cenix-bioscience.com
    Neil Benn, Apr 21, 2004
    #4
  5. Dennis Lee Bieber wrote:

    > ## for x in strippedWords:
    > ## unique[x] = 0
    > ##
    > ## for x in strippedWords:
    > ## unique[x] += 1
    >
    > for x in strippedWords:
    > unique[x] = unique.get(x, 0) + 1



    Just the thing I was looking for. :)

    To clarify:

    unique.get(x, 0)

    This says *give me the value of unique[x] if it exists, else give me 0*.
    Correct?
    Nickolay Kolev, Apr 21, 2004
    #5
  6. Nickolay Kolev

    Peter Otten Guest

    Nickolay Kolev wrote:

    > I am currently writing some simple functions in the process of learning
    > Python. I have a task where the program has to read in a text file and
    > display some statistics about the tokens in that file.
    >
    > The text I have been feeding it is Dickens' David Copperfield.
    >
    > It is really simple - it reads the file in memory, splits it on
    > whitespace, strips punctuation characters and transforms all remaining
    > elements to lowercase. It then looks through what has been left and
    > creates a list of tuples (count, word) which contain each unique word
    > and the number of time it appears in the text.
    >
    > The code (~30 lines and easy to read :) can be found at
    > http://www.uni-bonn.de/~nmkolev/python/textStats.py
    >
    > I am now looking for a way to make the whole thing run faster. I have
    > already made many changes since the initial version, realising many
    > mistakes. As I do not think of anything else, I thought I would ask the
    > more knowledgeable.
    >
    > I find the two loops through the initial list a bit troubling. Could
    > this be avoided?
    >
    > Any other remarks and suggestions will also be greatly appreciated.


    I'd probably do this with regular expressions, either with r.split() where r
    matches the spaces between the words or r.finditer() where r matches the
    words.

    But since you were asking for speed, another approach might be faster. First
    I normalize all non-word characters to space and all alpha-chars to
    lowercase. Note that this may yield different word frequencies as e.g.
    "two.words" will be 2 words with my and one word with your approach.

    from __future__ import division
    import string, sys

    def main(filename):
    # caveat: assumes one byte per character
    repl = string.whitespace + string.punctuation
    tr = string.maketrans(repl, len(repl)*" ").lower()
    assert len(tr) == 256

    words = file(filename).read().translate(tr).split()
    histogram = {}
    wordCount = len(words)
    for word in words:
    histogram[word] = histogram.get(word, 0) + 1

    wordCountList = [(c, w) for w, c in histogram.items()]
    wordCountList.sort()
    wordCountList.reverse()

    print
    for freq, word in wordCountList[:10]:
    print '%15r %5d %5.2f%%' % (word, freq, freq / wordCount * 100)
    print

    if __name__ == "__main__":
    main(sys.argv[1])

    Other remarks:
    Speed is not that important most of the time.
    Psyco offers speed for free.
    You might consider switching to 4-space indent.

    Peter
    Peter Otten, Apr 21, 2004
    #6
  7. Nickolay Kolev

    Terry Reedy Guest

    "Neil Benn" <> wrote in message
    news:...
    > In other languages I've used (mainly java although some C, C# VB
    > <wash out your mouth>), the way I woud look at speeding this up is to
    > avoid loading all the words into memory in one go and then working upon
    > them. I'd create one stream which reads through the file, then passes
    > onto a listener each word it finds from the lexing (making the input
    > into tokens) and then another stream listening to this which will then
    > sort out the detail from these tokens (parsing), finally an output
    > stream which put this data wherever it needs to be (DB, screen, file,
    > etc). This means that the program would scale better (if you pass the
    > European voting register through your system it would take exponentially
    > much longer as you must scan the information twice).


    You are talking about chaining iterators, which the generators and the new
    iterator protocol make easier than before -- intentionally. Something like
    following (untested).

    filename = 'whatever'

    def wordify(source):
    for line in source:
    for word in line.split():
    yield word.strip()

    def tabulate(words):
    counts = {}
    for word in words:
    counts[word] = counts.get[word,0]
    for wordcount in count.iteritems():
    yield wordcount

    def disposer(wordcounts):
    for wordcount in wordcounts:
    print wordcount

    disposer(tabulate(wordify(filename)))


    > However as more experienced python programmers have not suggested
    > this is this because there is :
    >
    > a. Something I'm not getting about python text handling


    Nothing obvious.

    > b. Not easy/possible in python


    Wrong (see above)

    c. The OPs question (speedup) was answered a half hour after posting by an
    experienced P. progammer (use dict.get) -- which answer makes the
    processing one-pass, which in turn makes chaining possible.

    d. It has been less than 5 hours since OP posted.

    e. The OP did not ask how to restructure program to make it more modular.
    Thinking ahead to make code more reusable and scaleable is a second order
    concern after learning basics like getting .get(). But since you brought
    the subject up ...

    Terry J. Reedy
    Terry Reedy, Apr 21, 2004
    #7
  8. Peter Otten wrote:

    Thanks for taking the time to rewrite the whole thing. I will try to
    work through it as I do not understand what you are doing most of the
    time there.

    > Speed is not that important most of the time.


    Agreed. I am not looking for a magical solution that will shave some
    time off at the price of using things I do not understand and/or are not
    readable.

    > You might consider switching to 4-space indent.


    Not wanting to start a flame here (this is probably a much discussed
    topic): why? I write in VIM and use tabs. The default representation
    width of the tab char there is 8 spaces. If I do get many, many nested
    blocks, I could just set it to a lower width, no?

    Nicky
    Nickolay Kolev, Apr 21, 2004
    #8
  9. Nickolay Kolev

    Benji York Guest

    Nickolay Kolev <> wrote in message news:<c661pe$tie$-bonn.de>...
    > I find the two loops through the initial list a bit troubling. Could
    > this be avoided?


    How about instead of:

    for x in strippedWords:
    unique[x] = 0

    for x in strippedWords:
    unique[x] += 1

    You do something like (untested):

    for word in strippedWords:
    unique[word] = unique.get(word, 0) + 1

    --
    Benji York
    http://benjiyork.com
    Benji York, Apr 21, 2004
    #9
  10. Peter Otten wrote:
    > Nickolay Kolev wrote:

    Playing along, simply because it's fun.

    > def main(filename):
    > ...


    #> words = file(filename).read().translate(tr).split()
    #> histogram = {}
    #> wordCount = len(words)
    #> for word in words:
    #> histogram[word] = histogram.get(word, 0) + 1

    Better not to do several huge string allocs above (I suspect).
    This method lets you to work on files too large to read into memory:

    wordCount = 0
    histogram = {}

    for line in file(filename):
    words = line.translate(tr).split()
    wordCount += len(words)
    for word in words:
    histogram[word] = histogram.get(word, 0) + 1

    > ...



    - Scott David Daniels
    Scott David Daniels, Apr 21, 2004
    #10
  11. Nickolay Kolev

    Peter Otten Guest

    Scott David Daniels wrote:

    > Peter Otten wrote:
    >> Nickolay Kolev wrote:

    > Playing along, simply because it's fun.
    >
    >> def main(filename):
    >> ...

    >
    > #> words = file(filename).read().translate(tr).split()
    > #> histogram = {}
    > #> wordCount = len(words)
    > #> for word in words:
    > #> histogram[word] = histogram.get(word, 0) + 1
    >
    > Better not to do several huge string allocs above (I suspect).
    > This method lets you to work on files too large to read into memory:
    >
    > wordCount = 0
    > histogram = {}
    >
    > for line in file(filename):
    > words = line.translate(tr).split()
    > wordCount += len(words)
    > for word in words:
    > histogram[word] = histogram.get(word, 0) + 1
    >
    >> ...


    In theory you are right, in practice most text files are tiny compared to
    the amount of RAM available on a fairly recent machine.

    However, I readily admit that your variant plays nicely with *very* large
    files as long as they have newlines, and, contrary to my expectation, is
    even a bit faster with my adhoc testcase, a 670,000 byte 8,700 line HTML
    file.

    Peter
    Peter Otten, Apr 21, 2004
    #11
  12. On Wed, 21 Apr 2004 18:18:16 +0200, Nickolay Kolev <>
    declaimed the following in comp.lang.python:

    >
    > unique.get(x, 0)
    >
    > This says *give me the value of unique[x] if it exists, else give me 0*.
    > Correct?


    Unless I misread the help file... I do know there IS a method
    that does this, as I've used it in the past. The other scheme would be a
    try/except block...

    try:
    unique[x] = unique[x] + 1 #maybe use the newer += operator
    except:
    unique[x] = 1

    At first, this will tend to cause the exception to trigger on
    each word, but as you start getting repeated words, the first operation
    will complete.


    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
    Dennis Lee Bieber, Apr 22, 2004
    #12
  13. Scott David Daniels wrote:

    > Better not to do several huge string allocs above (I suspect).
    > This method lets you to work on files too large to read into memory:


    .... code ...

    Now this is something I was thinking about when I started considering
    options for this task.

    From my understanding, for line in file('...'): is only useful if I
    want to make checks on the lines and then do something with them:

    for line in file('...'):
    if line.startswith('XXX'):
    myDesiredLines.append(line)

    This avoids creating a huge list of all lines and then filtering it to
    another list with only the desired ones.

    In my case I want to get all the lines regardless of any condition. I
    also do not even need them as a list of lines, a single huge string is
    completely sufficient. It will be split on whitespace later anyway.

    Both methods should produce the same amount of memory usage as all words
    are stored in a list. Reading a file line by line should be slower, as
    Python would have to check where the newline characters are. Please
    comment and correct, I am making assumptions here. Clarification from
    someone who know how these things are internally implemented would be nice.

    Nicky
    Nickolay Kolev, Apr 22, 2004
    #13
  14. Nickolay Kolev

    Terry Reedy Guest

    "Nickolay Kolev" <> wrote in message
    news:c67kmc$utu$-bonn.de...
    > From my understanding, for line in file('...'): is only useful if I
    > want to make checks on the lines and then do something with them:


    It is also useful for limiting amount in RAM at one time. The main
    downside would be if the units you want to analyse, in this case words, are
    split across line endings. But split() on whitespace will also split the
    units at line endings.

    > Both methods should produce the same amount of memory usage as all words
    > are stored in a list.


    It seems to me that 300 meg file chunk + 400 meg list/dict is larger than
    small file chunk + 400 meg list/dict, but maybe I misunderstand what you
    meant.

    > Reading a file line by line should be slower, as
    > Python would have to check where the newline characters are.


    'for line in file' has now been optimized to run quickly. Behind the
    scenes, sensibly sized chunks (such as 64K) are read from disk into a
    memory buffer. Lines are then doled out one at a time. This is done with
    compiled C. So I suspect the slowdown compared to read-all and split is
    minimal.

    Terry J. Reedy
    Terry Reedy, Apr 22, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack
    Replies:
    2
    Views:
    324
  2. puneet vyas

    optimizing function

    puneet vyas, Feb 26, 2009, in forum: C++
    Replies:
    0
    Views:
    482
    puneet vyas
    Feb 26, 2009
  3. puneet vyas

    optimizing function

    puneet vyas, Feb 26, 2009, in forum: C++
    Replies:
    0
    Views:
    283
    puneet vyas
    Feb 26, 2009
  4. carl

    Optimizing a function?

    carl, Dec 12, 2009, in forum: C++
    Replies:
    10
    Views:
    472
    joseph cook
    Dec 13, 2009
  5. bivity

    optimizing text file searches

    bivity, Aug 20, 2007, in forum: Perl Misc
    Replies:
    9
    Views:
    112
    bivity
    Aug 20, 2007
Loading...

Share This Page