GC is very expensive: am I doing something wrong?

Discussion in 'Python' started by Weeble, Mar 19, 2010.

  1. Weeble

    Weeble Guest

    I am loading a dictionary from a text file and constructing a trie
    data structure in memory. However, it takes longer than I'm happy with
    - about 12 seconds on my computer. I profiled it, came up with some
    clever ideas to cut down on the work (such as by exploiting the fact
    that the dictionary is sorted) and was only able to shave a small
    fraction of the time off. However, then I tried calling gc.disable()
    before loading the trie and it halved the running time! I was
    surprised. Is that normal? I thought that the cost of garbage
    collection would be in some way proportional to the amount of garbage
    created, but I didn't think I was creating any: as far as I can tell
    the only objects deallocated during the load are strings, which could
    not be participating in cycles.

    I have spent a lot of time in C#, where the standard advice is not to
    mess about with the garbage collector because you'll probably just
    make things worse. Does that advice apply in Python? Is it a bad idea
    to call gc.disable() before loading the trie and then enable it again
    afterwards? Does the fact that the garbage collector is taking so much
    time indicate I'm doing something particularly bad here? Is there some
    way to give the garbage collector a hint that the whole trie is going
    to be long-lived and get it promoted straight to generation 2 rather
    than scanning it over and over?

    $ python
    Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>>

    $ time python -c "import
    trie;t=trie.Trie();t.load(open('sowpods.txt'))"

    real 0m12.523s
    user 0m12.380s
    sys 0m0.140s
    $ time python -c "import
    trie;t=trie.Trie();t.load(open('sowpods.txt'))"

    real 0m12.592s
    user 0m12.480s
    sys 0m0.110s
    $ time python -c "import gc;gc.disable();import
    trie;t=trie.Trie();t.load(open('sowpods.txt'))"

    real 0m6.176s
    user 0m5.980s
    sys 0m0.190s
    $ time python -c "import gc;gc.disable();import
    trie;t=trie.Trie();t.load(open('sowpods.txt'))"

    real 0m6.331s
    user 0m5.530s
    sys 0m0.170s


    === trie.py ===

    class Trie(object):
    __slots__=("root", "active")
    def __init__(self):
    self.root=[]
    self.active=False
    def insert(self, word):
    if len(word) == 0:
    self.active=True
    else:
    head = word[0]
    for ch, child in reversed(self.root):
    if ch == head:
    child.insert(word[1:])
    return
    child = Trie()
    self.root.append((head, child))
    child.insert(word[1:])
    def seek(self, word):
    if len(word) == 0:
    return self
    head = word[0]
    for ch, child in self.root:
    if ch == head:
    return child.seek(word[1:])
    return EMPTY_TRIE
    def load(self, file):
    for line in file:
    self.insert(line.strip().lower())
    def empty(self):
    return (not self.root) and not self.active
    def endings(self, prefix=""):
    if self.active:
    yield prefix
    for ch, child in self.root:
    for ending in child.endings(prefix+ch):
    yield ending

    EMPTY_TRIE = Trie()
    Weeble, Mar 19, 2010
    #1
    1. Advertising

  2. Weeble

    Terry Reedy Guest

    On 3/18/2010 8:13 PM, Weeble wrote:
    > I thought that the cost of garbage
    > collection would be in some way proportional to the amount of garbage
    > created, but I didn't think I was creating any: as far as I can tell
    > the only objects deallocated during the load are strings, which could
    > not be participating in cycles.
    >
    > I have spent a lot of time in C#, where the standard advice is not to
    > mess about with the garbage collector because you'll probably just
    > make things worse. Does that advice apply in Python? Is it a bad idea
    > to call gc.disable() before loading the trie and then enable it again
    > afterwards?


    I believe not. It is known that certain patterns of object creation and
    destruction can lead to bad gc behavior. No one has discovered a setting
    of the internal tuning parameters for which there are no bad patterns
    and I suspect there are not any such. This does not negate Xavier's
    suggestion that a code change might also solve your problem.

    tjr
    Terry Reedy, Mar 19, 2010
    #2
    1. Advertising

  3. On Mar 18, 7:13 pm, Weeble <> wrote:
    > I am loading a dictionary from a text file and constructing a trie
    > data structure in memory. However, it takes longer than I'm happy with
    > - about 12 seconds on my computer. I profiled it, came up with some
    > clever ideas to cut down on the work (such as by exploiting the fact
    > that the dictionary is sorted) and was only able to shave a small
    > fraction of the time off. However, then I tried calling gc.disable()
    > before loading the trie and it halved the running time! I was
    > surprised. Is that normal? I thought that the cost of garbage
    > collection would be in some way proportional to the amount of garbage
    > created, but I didn't think I was creating any: as far as I can tell
    > the only objects deallocated during the load are strings, which could
    > not be participating in cycles.


    Well, you are creating and destroying a lot of objects in the process,
    so that will provoke the garbage collector. But you are also doing
    reversing and searching, and that's slow. Does your application
    really need to be able to keep things in order like this, or do you
    just want to know if a word is in the dictionary? If you just want to
    load up a dictionary and be able to see if words are in it, I would
    use a dict instead of a list.
    Even if you want to be able to print out the data in order, you might
    consider using a dict instead of a list. The print operation could
    use one sort at the end, instead of searching all the nodes on
    insertion.

    You could also use a character that doesn't appear in your data as a
    sentinel, and then you don't need a separate active indicator -- every
    leaf node will be empty and be referred to by the sentinel character.

    You are also doing a lot of recursive operations that could be done
    without recursing.

    Finally, do you really need to keep an additional object around for
    each node in the tree?

    I have modified your trie code to use a dict and a sentinel, while
    keeping basically the same API. This may or may not be the best way
    to do this, depending on your other code which uses this data
    structure. It could also probably be made faster by removing the
    setdefault, and not re-building objects when you need them, and even
    this implementation will load faster if you disable the gc, but in any
    case, this might give you some ideas about how to make your code go
    faster.

    Regards,
    Pat


    from collections import defaultdict

    class TrieTop(object):
    sentinel = ' ' # Something not in the data

    def __init__(self, data=None):
    def defaultrecurse():
    return defaultdict(defaultrecurse)
    if data is None:
    data = defaultrecurse()
    self.data = data
    def insert(self, word):
    data = self.data
    for ch in word:
    data = data[ch]
    data[self.sentinel]
    def seek(self, word):
    data = self.data
    for ch in word:
    data = data.get(ch)
    if data is None:
    return EMPTY_TRIE
    return TrieTop(data)
    def load(self, file):
    for line in file:
    self.insert(line.strip().lower())
    def empty(self):
    return (not self.data)
    def endings(self, prefix=""):
    def recurse(data, prefix):
    if not data:
    yield prefix[:-1]
    return
    for ch, data in data.iteritems():
    for result in recurse(data, prefix + ch):
    yield result
    return sorted(recurse(self.data, prefix))

    EMPTY_TRIE = TrieTop()
    Patrick Maupin, Mar 19, 2010
    #3
  4. In message <>, Terry
    Reedy wrote:

    > No one has discovered a setting
    > of the internal tuning parameters for which there are no bad patterns
    > and I suspect there are not any such. This does not negate Xavier's
    > suggestion that a code change might also solve your problem.


    Could it be that for implementing a structure like a trie as the OP is,
    where a lot of CPU cycles can be spent manipulating the structure, a high-
    level language like Python, Perl or Ruby just gets in the way?

    My feeling would be, try to get the language to do as much of the work for
    you as possible. If you can’t do that, then you might be better off with a
    lower-level language.
    Lawrence D'Oliveiro, Mar 21, 2010
    #4
  5. Lawrence D'Oliveiro, 22.03.2010 00:36:
    > Terry Reedy wrote:
    >> No one has discovered a setting
    >> of the internal tuning parameters for which there are no bad patterns
    >> and I suspect there are not any such. This does not negate Xavier's
    >> suggestion that a code change might also solve your problem.

    >
    > Could it be that for implementing a structure like a trie as the OP is,
    > where a lot of CPU cycles can be spent manipulating the structure, a high-
    > level language like Python, Perl or Ruby just gets in the way?


    I would rather say that the specific problem of the trie data structure is
    that it has extremely little benefit over other available data structures.
    There may still be a couple of niches where it makes sense to consider it
    as an alternative, but given that dicts are so heavily optimised in Python,
    it'll be hard for tries to compete even when written in a low-level language.

    Remember that the original use case was to load a dictionary from a text
    file. For this use case, a trie can be very wasteful in terms of memory and
    rather CPU cache unfriendly on traversal, whereas hash values are a) rather
    fast to calculate for a string, and b) often just calculated once and then
    kept alive in the string object for later reuse.


    > My feeling would be, try to get the language to do as much of the work for
    > you as possible. If you can’t do that, then you might be better off with a
    > lower-level language.


    I agree with the first sentence, but I'd like to underline the word 'might'
    in the second. As this newsgroup shows, very often it's enough to look for
    a better algorithmic approach first.

    Stefan
    Stefan Behnel, Mar 22, 2010
    #5
  6. Weeble

    tan Guest

    In article <>, Stefan Behnel <> wrote:
    >Lawrence D'Oliveiro, 22.03.2010 00:36:
    >> Terry Reedy wrote:
    >>> No one has discovered a setting
    >>> of the internal tuning parameters for which there are no bad patterns
    >>> and I suspect there are not any such. This does not negate Xavier's
    >>> suggestion that a code change might also solve your problem.

    >>
    >> Could it be that for implementing a structure like a trie as the OP is,
    >> where a lot of CPU cycles can be spent manipulating the structure, a high-
    >> level language like Python, Perl or Ruby just gets in the way?

    >
    >I would rather say that the specific problem of the trie data structure is
    >that it has extremely little benefit over other available data structures.


    Not true.

    >There may still be a couple of niches where it makes sense to consider it
    >as an alternative, but given that dicts are so heavily optimised in Python,
    >it'll be hard for tries to compete even when written in a low-level language.


    It depends. If your data is not in nearly sorted order,
    trees are some of the best mechanisms available.

    >Remember that the original use case was to load a dictionary from a text
    >file. For this use case, a trie can be very wasteful in terms of memory and
    >rather CPU cache unfriendly on traversal, whereas hash values are a) rather
    >fast to calculate for a string, and b) often just calculated once and then
    >kept alive in the string object for later reuse.


    You still have to walk the bucket in a hash map/table.
    Performance may be orders of magnitude worse than for trees.

    >> My feeling would be, try to get the language to do as much of the work for
    >> you as possible. If you can’t do that, then you might be better off with a
    >> lower-level language.

    >
    >I agree with the first sentence, but I'd like to underline the word 'might'
    >in the second. As this newsgroup shows, very often it's enough to look for
    >a better algorithmic approach first.
    >
    >Stefan
    >


    --
    You want to know who you are?

    http://oshosearch.net/Convert/search.php

    Most Osho books on line:

    http://oshosearch.net
    tan, Mar 22, 2010
    #6
  7. Le Mon, 22 Mar 2010 23:40:16 +0000, tan a écrit :
    >
    >>Remember that the original use case was to load a dictionary from a text
    >>file. For this use case, a trie can be very wasteful in terms of memory
    >>and rather CPU cache unfriendly on traversal, whereas hash values are a)
    >>rather fast to calculate for a string, and b) often just calculated once
    >>and then kept alive in the string object for later reuse.

    >
    > You still have to walk the bucket in a hash map/table. Performance may
    > be orders of magnitude worse than for trees.


    "walk the bucket" shouldn't be a significant cost factor here, especially
    if you are doing meaningful work with the traversed items. In the CPython
    implementation the total hash table size is less than a constant times
    the number of actual items. Moreover, it's a linear scan over an array
    rather than having to dereference pointers as in tree.

    "Orders of magnitude worse", in any case, sounds very exaggerated.

    (and, of course, as the OP said, there's the major difference that the
    dict type is implemented in C, which makes constant factors an order of
    magnitude smaller than for a Python trie implementation)
    Antoine Pitrou, Mar 23, 2010
    #7
  8. Weeble

    Paul Rubin Guest

    Antoine Pitrou <> writes:
    > "Orders of magnitude worse", in any case, sounds very exaggerated.


    The worst case can lose orders of magnitude if a lot of values hash
    to the same bucket.
    Paul Rubin, Mar 23, 2010
    #8
  9. On Mon, 22 Mar 2010 22:05:40 -0700, Paul Rubin wrote:

    > Antoine Pitrou <> writes:
    >> "Orders of magnitude worse", in any case, sounds very exaggerated.

    >
    > The worst case can lose orders of magnitude if a lot of values hash to
    > the same bucket.



    Well, perhaps one order of magnitude.


    >>> for i in xrange(100):

    .... n = 32*i+1
    .... assert hash(2**n) == hash(2)
    ....
    >>> d1 = dict.fromkeys(xrange(100))
    >>> d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])
    >>>
    >>> from timeit import Timer
    >>> setup = "from __main__ import d1, d2"
    >>> t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
    >>> t2 = Timer("for k in d2.keys(): x = d2[k]", setup)
    >>>
    >>> min(t1.repeat(number=1000, repeat=5))

    0.026707887649536133
    >>> min(t2.repeat(number=1000, repeat=5))

    0.33103203773498535




    --
    Steven
    Steven D'Aprano, Mar 23, 2010
    #9
  10. Weeble

    Paul Rubin Guest

    Steven D'Aprano <> writes:
    > Well, perhaps one order of magnitude.
    >>>> for i in xrange(100):

    > ... n = 32*i+1
    > ... assert hash(2**n) == hash(2)


    Try with much more than 100 items (you might want to construct the
    entries a little more intricately to avoid such big numbers). The point
    is that access becomes O(N) instead of O(1). See:

    http://www.cs.rice.edu/~scrosby/hash/

    for the consequences. http://cr.yp.to/critbit.html discusses the
    issue a little more.
    Paul Rubin, Mar 23, 2010
    #10
  11. Weeble

    Peter Otten Guest

    Steven D'Aprano wrote:

    > On Mon, 22 Mar 2010 22:05:40 -0700, Paul Rubin wrote:
    >
    >> Antoine Pitrou <> writes:
    >>> "Orders of magnitude worse", in any case, sounds very exaggerated.

    >>
    >> The worst case can lose orders of magnitude if a lot of values hash to
    >> the same bucket.

    >
    >
    > Well, perhaps one order of magnitude.
    >
    >
    >>>> for i in xrange(100):

    > ... n = 32*i+1
    > ... assert hash(2**n) == hash(2)
    > ...
    >>>> d1 = dict.fromkeys(xrange(100))
    >>>> d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])
    >>>>
    >>>> from timeit import Timer
    >>>> setup = "from __main__ import d1, d2"
    >>>> t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
    >>>> t2 = Timer("for k in d2.keys(): x = d2[k]", setup)
    >>>>
    >>>> min(t1.repeat(number=1000, repeat=5))

    > 0.026707887649536133
    >>>> min(t2.repeat(number=1000, repeat=5))

    > 0.33103203773498535


    But the ratio grows with the number of collisions:

    $ python extrapolate.py
    10
    0.00120401382446
    0.00753307342529
    ratio: 6.25663366337

    100
    0.00542402267456
    0.316139936447
    ratio: 58.2851428571

    1000
    0.00553417205811
    3.36690688133
    ratio: 608.384930209

    $ cat extrapolate.py
    from timeit import Timer

    class Item(object):
    def __init__(self, value, hash=None):
    self.value = value
    self.hash = value if hash is None else hash
    def __eq__(self, other):
    return self.value == other.value
    def __hash__(self):
    return self.hash

    setup = "from __main__ import d"
    bench = "for k in d: x = d[k]"

    for n, number in (10,100), (100,100), (1000,10):
    print n
    d1 = dict.fromkeys(Item(i) for i in xrange(n))
    d2 = dict.fromkeys(Item(i, 0) for i in xrange(n))
    ab = []
    for d in d1, d2:
    t = Timer(bench, setup)
    ab.append(min(t.repeat(number=number, repeat=3)))
    print ab[-1]
    print "ratio:", ab[1]/ab[0]
    print


    See also http://xkcd.com/605/

    Peter
    Peter Otten, Mar 23, 2010
    #11
  12. Weeble

    Paul Rubin Guest

    Stefan Behnel <> writes:
    > While this is theoretically true, and it's good to be aware of this
    > possibility, common string hash functions make it so rare in practice
    > that a hash table will almost always outperform a trie for exact
    > lookups. If it happens, it will either show up clearly enough in
    > benchmarks or not be worth bothering.


    It is unlikely to happen by accident. You might care that it can
    happen on purpose. See: http://www.cs.rice.edu/~scrosby/hash/
    that I cited in another post. The article shows some sample attacks
    on Python cgi's.
    Paul Rubin, Mar 23, 2010
    #12
  13. Paul Rubin, 23.03.2010 06:05:
    > Antoine Pitrou writes:
    >> "Orders of magnitude worse", in any case, sounds very exaggerated.

    >
    > The worst case can lose orders of magnitude if a lot of values hash
    > to the same bucket.


    While this is theoretically true, and it's good to be aware of this
    possibility, common string hash functions make it so rare in practice that
    a hash table will almost always outperform a trie for exact lookups. If it
    happens, it will either show up clearly enough in benchmarks or not be
    worth bothering.

    Stefan
    Stefan Behnel, Mar 23, 2010
    #13
  14. Le Tue, 23 Mar 2010 02:57:56 -0700, Paul Rubin a écrit :
    >
    > It is unlikely to happen by accident. You might care that it can happen
    > on purpose. See: http://www.cs.rice.edu/~scrosby/hash/ that I cited in
    > another post. The article shows some sample attacks on Python cgi's.


    Certainly interesting in a purely academic point of view, but in real
    life if you want to cause a denial of service by overwhelming a server,
    there are far more obvious options than trying to guess the server's use
    of hash tables and trying to cause lots of collisions in them.
    Antoine Pitrou, Mar 23, 2010
    #14
  15. Weeble

    Paul Rubin Guest

    Antoine Pitrou <> writes:
    >> See: http://www.cs.rice.edu/~scrosby/hash/ ...

    > Certainly interesting in a purely academic point of view, but in real
    > life if you want to cause a denial of service by overwhelming a server,
    > there are far more obvious options than trying to guess the server's use
    > of hash tables and trying to cause lots of collisions in them.


    If you look at the very low bandwidth used in some of those hashtable
    attacks, it's hard to see any other somewhat-generic attack that's
    comparably effective. Usually we think of "DOS" as involving massive
    botnets and the like, not a dribble of a few hundred characters per
    second.
    Paul Rubin, Mar 24, 2010
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark
    Replies:
    0
    Views:
    465
  2. Chiller

    Doing something wrong

    Chiller, Apr 12, 2004, in forum: C++
    Replies:
    1
    Views:
    343
    Daniel T.
    Apr 12, 2004
  3. Michael Sparks
    Replies:
    6
    Views:
    480
    Michael Sparks
    Sep 21, 2005
  4. Jp Calderone
    Replies:
    1
    Views:
    384
    Michael Sparks
    Sep 20, 2005
  5. vhdl124
    Replies:
    1
    Views:
    489
    vhdl124
    Apr 28, 2008
Loading...

Share This Page