large dictionary creation takes a LOT of time.

Discussion in 'Python' started by possibilitybox, Apr 29, 2005.

  1. this code here:


    def wordcount(lines):
    for i in range(len(lines)/8):
    words = lines.split(" ")
    if not locals().has_key("frequency"):
    frequency = {}
    for word in words:
    if frequency.has_key(word):
    frequency[word] += 1
    else:
    frequency[word] = 1
    return frequency
    wordcount(lines)

    is taking over six minutes to run on a two megabyte text file. i
    realize that's a big text file, a really big one (it's actually the
    full text of don quixote.). i'm trying to figure out how. is there a
    better way for me to do a frequency count of all the words in the text?
    it seems to me like this should scale linearly, but perhaps it isn't?
    i don't know much about algorithmic complexity. if someone could give
    a breakdown of this functions complexity as well i'd be much obliged.

    lines is expected to be a list of lines as provided by file.readline()
    possibilitybox, Apr 29, 2005
    #1
    1. Advertising

  2. possibilitybox

    Paul Rubin Guest

    "possibilitybox" <> writes:

    > this code here:
    >
    >
    > def wordcount(lines):
    > for i in range(len(lines)/8):
    > words = lines.split(" ")
    > if not locals().has_key("frequency"):
    > frequency = {}
    > for word in words:
    > if frequency.has_key(word):
    > frequency[word] += 1
    > else:
    > frequency[word] = 1
    > return frequency
    > wordcount(lines)
    >
    > is taking over six minutes to run on a two megabyte text file. i
    > realize that's a big text file, a really big one (it's actually the
    > full text of don quixote.). i'm trying to figure out how. is there a
    > better way for me to do a frequency count of all the words in the text?


    2MB is not that large. Your method is ok and shouldn't be that slow
    unless you're on a pretty slow PC. Could your machine be short of
    memory and paging a lot? You could tweak the code somewhat by moving
    the initialization of the frequency dict out of the loop and combining
    a few other statements. Also you should use xrange instead of range,
    to avoid allocating a big list in memory:

    def wordcount(lines):
    frequency = {}
    for i in xrange(len(lines)/8):
    for word in lines.split():
    frequency[word] = 1 + frequency.get(word, 0)
    return frequency
    wordcount(lines)

    > it seems to me like this should scale linearly, but perhaps it isn't?
    > i don't know much about algorithmic complexity. if someone could give
    > a breakdown of this functions complexity as well i'd be much obliged.


    It should be close to linear, or at worst n log n, depending on what
    happens when dicts have to be enlarged as the # of elements increases.
    Why are you only processing 1/8th of the lines?
    Paul Rubin, Apr 29, 2005
    #2
    1. Advertising

  3. possibilitybox

    Kent Johnson Guest

    possibilitybox wrote:
    > this code here:
    >
    >
    > def wordcount(lines):
    > for i in range(len(lines)/8):
    > words = lines.split(" ")
    > if not locals().has_key("frequency"):
    > frequency = {}
    > for word in words:
    > if frequency.has_key(word):
    > frequency[word] += 1
    > else:
    > frequency[word] = 1
    > return frequency
    > wordcount(lines)
    >
    > is taking over six minutes to run on a two megabyte text file. i
    > realize that's a big text file, a really big one (it's actually the
    > full text of don quixote.). i'm trying to figure out how. is there a
    > better way for me to do a frequency count of all the words in the text?
    > it seems to me like this should scale linearly, but perhaps it isn't?
    > i don't know much about algorithmic complexity. if someone could give
    > a breakdown of this functions complexity as well i'd be much obliged.
    >
    > lines is expected to be a list of lines as provided by file.readline()
    >


    Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
    running on?

    path = 'DonQuixote.txt'

    frequency = {}

    for line in open(path):
    for word in line.split():
    if frequency.has_key(word):
    frequency[word] += 1
    else:
    frequency[word] = 1

    print len(frequency), 'words'


    Kent
    Kent Johnson, Apr 29, 2005
    #3
  4. possibilitybox

    Ville Vainio Guest

    >>>>> "Kent" == Kent Johnson <> writes:

    Kent> if frequency.has_key(word):
    Kent> frequency[word] += 1
    Kent> else:
    Kent> frequency[word] = 1

    This is a good place to use 'get' method of dict:

    frequency[word] = frequency.get(word,0) + 1

    --
    Ville Vainio http://tinyurl.com/2prnb
    Ville Vainio, Apr 29, 2005
    #4
  5. possibilitybox

    Guest

    Ville Vainio:
    > This is a good place to use 'get' method of dict:
    > frequency[word] = frequency.get(word,0) + 1


    I think Kent Johnson is longer, but a bit faster...

    Bye,
    Bearophile
    , Apr 29, 2005
    #5
  6. possibilitybox

    Roy Smith Guest

    In article <>,
    "possibilitybox" <> wrote:

    > this code here:
    >
    >
    > def wordcount(lines):
    > for i in range(len(lines)/8):
    > words = lines.split(" ")
    > if not locals().has_key("frequency"):
    > frequency = {}
    > for word in words:
    > if frequency.has_key(word):
    > frequency[word] += 1
    > else:
    > frequency[word] = 1
    > return frequency
    > wordcount(lines)
    >
    > is taking over six minutes to run on a two megabyte text file. i
    > realize that's a big text file, a really big one (it's actually the
    > full text of don quixote.).


    Something doesn't make sense with your timing.

    I just downloaded the text of Don Quixote
    (http://www.gutenberg.org/dirs/9/9/996/996.txt). It's about 2.3 Mbytes,
    428 kwords, 40 klines. This includes the text of the novel itself, plus a
    little boilerplate text added by the Gutenberg Project folks.

    I ran your program against it on my PowerBook (1 GHz PowerPC, Python 2.3.4,
    768 Mbytes RAM). It took about 0.4 seconds. When I got rid of the "/8" in
    the range() call (so it processed the whole text), it took about 1.8
    seconds (0.24 of which were just reading the file). Some other posters
    reported similiar findings.

    What kind of system are you running it on? The only thing I can think of
    is that you've got way too little memory and your machine is just
    thrashing. My Python process is about 31 Mb when it first starts up, grows
    to 35 when the file is read into a list, then gets to 38 after the call to
    wordcount().

    > i'm trying to figure out how. is there a
    > better way for me to do a frequency count of all the words in the text?
    > it seems to me like this should scale linearly, but perhaps it isn't?
    > i don't know much about algorithmic complexity. if someone could give
    > a breakdown of this functions complexity as well i'd be much obliged.


    Well, I don't see anything in your code which isn't linear. There are some
    places where you could do things a little more efficiently, but these would
    all be replacing one linear process with a better linear process. Small
    speedups, but not the drastic kind of speedups you would expect by
    replacing a quadratic process with a linear one.

    Here's a few things I would change:

    > if not locals().has_key("frequency"):
    > frequency = {}


    This is kind of silly. Just factor this out of the main loop, and do
    "frequency = {}" before the first for loop. This won't speed things up
    other than trivially, but it'll make your code easier to read and
    understand.

    Next, replace

    > for i in range(len(lines)):
    > words = lines.split(" ")


    with something like

    > for line in lines:
    > words = line.split()


    It's marginally faster, easier to read, and is actually more correct;
    calling split() with no arguments makes it split on arbitrary white space,
    which is probably what you really want:

    >>> "foo bar baz".split(" ")

    ['foo', 'bar', '', '', '', 'baz']

    >>> "foo bar baz".split()

    ['foo', 'bar', 'baz']

    And lastly, instead of

    if frequency.has_key(word):
    > frequency[word] += 1
    > else:
    > frequency[word] = 1


    I would do:

    try:
    frequency[word] += 1
    except KeyError:
    frequency[word] = 1

    which is usually a little bit faster. Somebody else mentioned using the
    get() method of dictionaries here; that might be even better, but I learned
    the try/except trick before get() existed, so I tend to stick to that :)

    But, as I said, all of these are just minor tweaks. Overall, your code
    looks like it should run in O(n), so fixing your code is not where you
    should be looking. Something external (i.e. memory thrashing or some other
    exceptionally bad bit of system performance) has to be causing the
    horrendously bad performance you're seeing, and that's where you should be
    concentrating your efforts.

    BTW, if you suspected you had some kind of non-linear algorithm, and didn't
    trust code inspection to verify its existance, you could just run your
    program a bunch of times on different sized data sets, and plot runtime vs.
    input size to see what kind of curve you get.
    Roy Smith, Apr 29, 2005
    #6
  7. Kent Johnson wrote:
    >
    > Here is a little cleaner version. It takes about a second to run on my
    > PC. What hardware are you running on?
    >
    > path = 'DonQuixote.txt'
    >
    > frequency = {}
    >
    > for line in open(path):
    > for word in line.split():
    > if frequency.has_key(word):
    > frequency[word] += 1
    > else:
    > frequency[word] = 1
    >
    > print len(frequency), 'words'
    >
    >
    > Kent


    > for line in open(path):

    the line of your example raise another question: opened file will be read at once time, as method readlines() do, or it will be read line by line as method readline() do.
    as far i know, it is depends on implementation of method "__iter__" of the object that "open()" returns, so another question: where i can find such an information (about how does such a functions works)?


    --
    Best regards,
    Maksim Kasimov
    mailto:
    Maksim Kasimov, Apr 29, 2005
    #7
  8. possibilitybox

    Kent Johnson Guest

    Maksim Kasimov wrote:
    > Kent Johnson wrote:
    > > for line in open(path):

    > the line of your example raise another question: opened file will be
    > read at once time, as method readlines() do, or it will be read line by
    > line as method readline() do.


    It will be read line by line as readline() does.

    > as far i know, it is depends on implementation of method "__iter__" of
    > the object that "open()" returns, so another question: where i can find
    > such an information (about how does such a functions works)?


    http://docs.python.org/lib/built-in-funcs.html
    http://docs.python.org/lib/bltin-file-objects.html

    Kent
    Kent Johnson, Apr 29, 2005
    #8
  9. In fact, as one of the Peter's (either Otten or Hansen) explained to me,

    for line in open(file):

    is actually both faster (being buffered) and generally better for very
    large files because it doesn't read the whole file into memory, like
    readlines does (if you have a memory limitation).

    On Fri, 29 Apr 2005 12:00:37 -0400, Kent Johnson <> wrote:

    > Maksim Kasimov wrote:
    >> Kent Johnson wrote:
    >> > for line in open(path):

    >> the line of your example raise another question: opened file will be
    >> read at once time, as method readlines() do, or it will be read line by
    >> line as method readline() do.

    >
    > It will be read line by line as readline() does.
    >
    >> as far i know, it is depends on implementation of method "__iter__" of
    >> the object that "open()" returns, so another question: where i can find
    >> such an information (about how does such a functions works)?

    >
    > http://docs.python.org/lib/built-in-funcs.html
    > http://docs.python.org/lib/bltin-file-objects.html
    >
    > Kent
    Caleb Hattingh, Apr 29, 2005
    #9
  10. On Friday 29 April 2005 11:53, Ville Vainio wrote:
    > >>>>> "Kent" == Kent Johnson <> writes:

    >
    > Kent> if frequency.has_key(word):
    > Kent> frequency[word] += 1
    > Kent> else:
    > Kent> frequency[word] = 1
    >
    > This is a good place to use 'get' method of dict:
    >
    > frequency[word] = frequency.get(word,0) + 1


    try/except might be fastest of all:

    http://gumuz.looze.net/wordpress/index.php/archives/2005/04/28/python-dictionary-speed-optimisation/

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQBCcqaWY6W16wIJgxQRAkSKAJ0VDc2Z6HY9J11vIDfSuOztyryprACgqfLy
    TwuBUdEE6VlLf0tJPCiVeoY=
    =oUbE
    -----END PGP SIGNATURE-----
    R. C. James Harlow, Apr 29, 2005
    #10
  11. oh, right, i did only one eighth to check and see if it was scaling
    near linearly, as i couldn't even run profiling without python dying.

    i have 400mb ram and 2ghz processor, on freebsd, so it shouldn't be
    performance. i'll try your suggestions and see how it works.
    possibilitybox, Apr 30, 2005
    #11
  12. sorry for my question, but i've read the documentation, and can't find where is the explanation of how it is exactly works (but of course i do believe you). If it is buit in function, can i see the source code of the method to find it out?

    Kent Johnson wrote:
    > Maksim Kasimov wrote:
    >
    >> Kent Johnson wrote:
    >> > for line in open(path):

    >> the line of your example raise another question: opened file will be
    >> read at once time, as method readlines() do, or it will be read line
    >> by line as method readline() do.

    >
    >
    > It will be read line by line as readline() does.
    >
    >> as far i know, it is depends on implementation of method "__iter__" of
    >> the object that "open()" returns, so another question: where i can
    >> find such an information (about how does such a functions works)?

    >
    >
    > http://docs.python.org/lib/built-in-funcs.html
    > http://docs.python.org/lib/bltin-file-objects.html
    >
    > Kent



    --
    Best regards,
    Maxim Kasimov
    mailto:
    Maksim Kasimov, Apr 30, 2005
    #12
  13. possibilitybox

    Kent Johnson Guest

    Maksim Kasimov wrote:
    >
    > sorry for my question, but i've read the documentation, and can't find
    > where is the explanation of how it is exactly works (but of course i do
    > believe you). If it is buit in function, can i see the source code of
    > the method to find it out?
    >
    > Kent Johnson wrote:
    >> http://docs.python.org/lib/built-in-funcs.html


    From the above page:
    open( filename[, mode[, bufsize]])
    An alias for the file() function above.

    file( filename[, mode[, bufsize]])
    Return a new file object (described in section 2.3.9, ``File Objects'').

    >> http://docs.python.org/lib/bltin-file-objects.html


    2.3.9 File Objects

    next( )
    A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a
    file is used as an iterator, typically in a for loop (for example, for line in f: print line), the
    next() method is called repeatedly. This method returns the next input line, or raises StopIteration
    when EOF is hit. <etc>

    or look at Objects/fileobject.c in the source code.

    Kent
    Kent Johnson, Apr 30, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gary
    Replies:
    1
    Views:
    318
    Janaka
    Oct 16, 2003
  2. =?Utf-8?B?UmFuIERhdmlkb3ZpdHo=?=

    aspnet_wp takes a lot of memory even after appdomain restarted

    =?Utf-8?B?UmFuIERhdmlkb3ZpdHo=?=, Mar 29, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    401
    =?Utf-8?B?UmFuIERhdmlkb3ZpdHo=?=
    Mar 29, 2005
  3. homecurr

    Vector takes a lot memory

    homecurr, Nov 25, 2004, in forum: Java
    Replies:
    5
    Views:
    404
    Thomas G. Marshall
    Nov 26, 2004
  4. Raga
    Replies:
    4
    Views:
    1,636
  5. Rémi
    Replies:
    0
    Views:
    173
    Rémi
    Jul 9, 2011
Loading...

Share This Page