Interning own classes like strings for speed and size?

Discussion in 'Python' started by Ulrich Eckhardt, Dec 27, 2010.

  1. Hi!

    I'm trying to solve a computational problem and of course speed and size is
    important there. Apart from picking the right algorithm, I came across an
    idea that could help speed up things and keep memory requirements down. What
    I have is regions described by min and max coordinates. At first, I just
    modeled these as a simple class containing two values for each axis.

    In a second step, I derived this class from tuple instead of object. Some
    code then moved from __init__ to __new__ and some code that modified these
    objects had to be changed to replace them instead. The upside to this is
    that they can be used as keys in sets and dicts, which isn't the case for
    mutable types[1].

    What I'm now considering is to only allow a single instance of these objects
    for each set of values, similar to interned strings. What I would gain is
    that I could safely compare objects for identity instead of equality. What
    I'm not yet sure is how much overhead that would give me and/or how to keep
    it low. The idea is to store each instance in a set and after creating a new
    object I would first look up an equal object in the global set and return
    that instead, otherwise add the new one.

    The problem I foresee is that if I define equality as identity, this lookup
    when creating will never eliminate duplicates. If I only fall back to
    equality comparison for non-identical objects, I would probably sacrifice
    most of the gain. If I build a dict mapping between the values and the
    actual objects, I would have doubled the required memory and uselessly store
    the same values twice there.

    Am I looking in the wrong direction? Is there some better approach? Please
    don't tell me to use C, as I'm specifically interested in learning Python,
    I'm pretty sure I could have solved the problem quickly in C++ otherwise.
    Other suggestions?

    Cheers!

    Uli


    [1] Somebody correct me if I'm wrong, but I believe I could have defined a
    hashing function for the type and thus allowed its use in a set or dict,
    right? However, goofing up because you accidentally modified an object and
    changed its hash value is something I don't want to risk anyway.
    Ulrich Eckhardt, Dec 27, 2010
    #1
    1. Advertising

  2. > I'm trying to solve a computational problem and of course speed and size is
    > important there. Apart from picking the right algorithm, I came across an
    > idea that could help speed up things and keep memory requirements down. What
    > I have is regions described by min and max coordinates. At first, I just
    > modeled these as a simple class containing two values for each axis.
    >
    > In a second step, I derived this class from tuple instead of object. Some
    > code then moved from __init__ to __new__ and some code that modified these
    > objects had to be changed to replace them instead. The upside to this is
    > that they can be used as keys in sets and dicts, which isn't the case for
    > mutable types[1].
    >
    > What I'm now considering is to only allow a single instance of these objects
    > for each set of values, similar to interned strings. What I would gain is
    > that I could safely compare objects for identity instead of equality. What
    > I'm not yet sure is how much overhead that would give me and/or how to keep
    > it low. The idea is to store each instance in a set and after creating a new
    > object I would first look up an equal object in the global set and return
    > that instead, otherwise add the new one.
    >
    > The problem I foresee is that if I define equality as identity, this lookup
    > when creating will never eliminate duplicates. If I only fall back to
    > equality comparison for non-identical objects, I would probably sacrifice
    > most of the gain. If I build a dict mapping between the values and the
    > actual objects, I would have doubled the required memory and uselessly store
    > the same values twice there.
    >
    > Am I looking in the wrong direction? Is there some better approach? Please
    > don't tell me to use C, as I'm specifically interested in learning Python,
    > I'm pretty sure I could have solved the problem quickly in C++ otherwise.
    > Other suggestions?
    >
    > Cheers!
    >
    > Uli
    >
    >
    > [1] Somebody correct me if I'm wrong, but I believe I could have defined a
    > hashing function for the type and thus allowed its use in a set or dict,
    > right? However, goofing up because you accidentally modified an object and
    > changed its hash value is something I don't want to risk anyway.


    I believe what you are looking for is (some variant of) the singleton pattern:

    http://en.wikipedia.org/wiki/Singleton_pattern

    How it's done in python see http://www.google.com/search?q=python singleton

    Cheers,
    Daniel

    --
    Psss, psss, put it down! - http://www.cafepress.com/putitdown
    Daniel Fetchinson, Dec 27, 2010
    #2
    1. Advertising

  3. Ulrich Eckhardt, Dec 27, 2010
    #3
  4. >> I believe what you are looking for is (some variant of) the singleton
    >> pattern:
    >>
    >> http://en.wikipedia.org/wiki/Singleton_pattern

    >
    > Actually, no. What I want is the flyweight pattern instead:
    >
    > http://en.wikipedia.org/wiki/Flyweight_pattern


    Oh I see. I did not know about this pattern, but in my defense it
    looks like a variant of the singleton pattern :)

    Thanks! One always learns something new on python-list.

    Cheers,
    Daniel


    --
    Psss, psss, put it down! - http://www.cafepress.com/putitdown
    Daniel Fetchinson, Dec 27, 2010
    #4
  5. Ulrich Eckhardt

    Terry Reedy Guest

    On 12/27/2010 6:05 AM, Ulrich Eckhardt wrote:
    > Hi!
    >
    > I'm trying to solve a computational problem and of course speed and size is
    > important there. Apart from picking the right algorithm, I came across an
    > idea that could help speed up things and keep memory requirements down. What
    > I have is regions described by min and max coordinates.


    What sort of numbers are the coordinates? If integers in a finite range,
    your problem is a lot simpler than if float of indefinite precision.

    --
    Terry Jan Reedy
    Terry Reedy, Dec 27, 2010
    #5
  6. On Mon, 27 Dec 2010 12:05:10 +0100, Ulrich Eckhardt wrote:

    > What I'm now considering is to only allow a single instance of these
    > objects for each set of values, similar to interned strings. What I
    > would gain is that I could safely compare objects for identity instead
    > of equality. What I'm not yet sure is how much overhead that would give
    > me and/or how to keep it low. The idea is to store each instance in a
    > set and after creating a new object I would first look up an equal
    > object in the global set and return that instead, otherwise add the new
    > one.


    Try this technique:

    >>> class InternedTuple(tuple):

    .... _cache = {}
    .... def __new__(cls, *args):
    .... t = super().__new__(cls, *args)
    .... return cls._cache.setdefault(t, t)
    ....
    >>>
    >>>
    >>> t1 = InternedTuple((1.0, 2.0))
    >>> t2 = InternedTuple((0.0, 0.0))
    >>> t3 = InternedTuple((1.0, 2.0))
    >>>
    >>> t1 is t2

    False
    >>> t1 is t3

    True
    >>> t1 == t2

    False
    >>> t1 == t3

    True



    --
    Steven
    Steven D'Aprano, Dec 27, 2010
    #6
  7. Steven D'Aprano wrote:
    >>>> class InternedTuple(tuple):

    > ... _cache = {}
    > ... def __new__(cls, *args):
    > ... t = super().__new__(cls, *args)
    > ... return cls._cache.setdefault(t, t)


    That looks good. The only thing that first bothered me is that it creates an
    object and then possibly discards it again. However, there is no way around
    that, since at least the key to the dict must be created for lookup. Since
    key and value are the same here, this is even for free.

    What I also found was that with the above, I can't provide __eq__ and __ne__
    that just check for identity. If I do, the lookup in setdefault() will never
    find an existing tuple and I will never save memory for a single object.


    Thanks!

    Uli
    Ulrich Eckhardt, Dec 28, 2010
    #7
  8. On Tue, 28 Dec 2010 13:42:39 +0100, Ulrich Eckhardt wrote:

    > Steven D'Aprano wrote:
    >>>>> class InternedTuple(tuple):

    >> ... _cache = {}
    >> ... def __new__(cls, *args):
    >> ... t = super().__new__(cls, *args)
    >> ... return cls._cache.setdefault(t, t)

    >
    > That looks good. The only thing that first bothered me is that it
    > creates an object and then possibly discards it again. However, there is
    > no way around that, since at least the key to the dict must be created
    > for lookup. Since key and value are the same here, this is even for
    > free.
    >
    > What I also found was that with the above, I can't provide __eq__ and
    > __ne__ that just check for identity. If I do, the lookup in setdefault()
    > will never find an existing tuple and I will never save memory for a
    > single object.


    If all you want is to save memory, you don't need to change the __eq__
    method. But if you still want to, try this:


    # Untested
    class InternedTuple(tuple):
    _cache = {}
    def __new__(cls, *args):
    t = super().__new__(cls, *args)
    return cls._cache.setdefault(args, t)
    def __eq__(self, other):
    return self is other
    def __ne__(self, other):
    return self is not other



    --
    Steven
    Steven D'Aprano, Dec 28, 2010
    #8
  9. Steven D'Aprano, 28.12.2010 15:11:
    > On Tue, 28 Dec 2010 13:42:39 +0100, Ulrich Eckhardt wrote:
    >
    >> Steven D'Aprano wrote:
    >>>>>> class InternedTuple(tuple):
    >>> ... _cache = {}
    >>> ... def __new__(cls, *args):
    >>> ... t = super().__new__(cls, *args)
    >>> ... return cls._cache.setdefault(t, t)

    >>
    >> That looks good. The only thing that first bothered me is that it
    >> creates an object and then possibly discards it again. However, there is
    >> no way around that, since at least the key to the dict must be created
    >> for lookup. Since key and value are the same here, this is even for
    >> free.
    >>
    >> What I also found was that with the above, I can't provide __eq__ and
    >> __ne__ that just check for identity. If I do, the lookup in setdefault()
    >> will never find an existing tuple and I will never save memory for a
    >> single object.

    >
    > If all you want is to save memory, you don't need to change the __eq__
    > method. But if you still want to, try this:
    >
    > # Untested


    Yep, that' the problem. ;)


    > class InternedTuple(tuple):
    > _cache = {}
    > def __new__(cls, *args):
    > t = super().__new__(cls, *args)
    > return cls._cache.setdefault(args, t)
    > def __eq__(self, other):
    > return self is other
    > def __ne__(self, other):
    > return self is not other


    What Ulrich meant, was: doing this will actually kill the caching, because
    the first time the comparison is called is when looking up the tuple while
    adding it to the interning dict. Since the new tuple is, well, new, it will
    not be equal (read: identical) to any cached tuple, thus resulting in a new
    entry regardless of its content.

    Stefan
    Stefan Behnel, Dec 28, 2010
    #9
  10. Ulrich Eckhardt

    John Nagle Guest

    On 12/27/2010 3:05 AM, Ulrich Eckhardt wrote:
    > Hi!
    >
    > I'm trying to solve a computational problem and of course speed and size is
    > important there.


    Then you probably shouldn't be using CPython.

    > Apart from picking the right algorithm, I came across an
    > idea that could help speed up things and keep memory requirements down. What
    > I have is regions described by min and max coordinates. At first, I just
    > modeled these as a simple class containing two values for each axis.


    There are more effective techniques for that. In game physics
    collision detection systems, it's common to use axis-ordered lists
    of bounding boxes to partition objects. This still works even when
    the ranges overlap. Look up "I-Collide" for a classic implementation.

    > Am I looking in the wrong direction? Is there some better approach?
    > Please
    > don't tell me to use C, as I'm specifically interested in
    > learning Python,
    > I'm pretty sure I could have solved the problem quickly in
    > C++ otherwise.


    CPython is a naive interpreter which does dictionary
    lookups for every variable and attribute access. If you're doing
    fast manipulation of linked data structures in CPython, performance
    is going to suck.

    Check out Shed Skin 0.7, which is a much faster Python
    system. It's limited, but for the kind of data structure work
    you're doing, all the features you should need already work.

    John Nagle
    >
    > In a second step, I derived this class from tuple instead of object. Some
    > code then moved from __init__ to __new__ and some code that modified these
    > objects had to be changed to replace them instead. The upside to this is
    > that they can be used as keys in sets and dicts, which isn't the case for
    > mutable types[1].
    >
    > What I'm now considering is to only allow a single instance of these objects
    > for each set of values, similar to interned strings. What I would gain is
    > that I could safely compare objects for identity instead of equality. What
    > I'm not yet sure is how much overhead that would give me and/or how to keep
    > it low. The idea is to store each instance in a set and after creating a new
    > object I would first look up an equal object in the global set and return
    > that instead, otherwise add the new one.
    >
    > The problem I foresee is that if I define equality as identity, this lookup
    > when creating will never eliminate duplicates. If I only fall back to
    > equality comparison for non-identical objects, I would probably sacrifice
    > most of the gain. If I build a dict mapping between the values and the
    > actual objects, I would have doubled the required memory and uselessly store
    > the same values twice there.
    >
    >
    > Cheers!
    >
    > Uli
    >
    >
    > [1] Somebody correct me if I'm wrong, but I believe I could have defined a
    > hashing function for the type and thus allowed its use in a set or dict,
    > right? However, goofing up because you accidentally modified an object and
    > changed its hash value is something I don't want to risk anyway.
    >
    John Nagle, Dec 28, 2010
    #10
  11. Christian Heimes <> writes:

    > Also this code is going to use much more memory than an ordinary tuple
    > since every instance of InternedTuple has a __dict__ attribute. This
    > code works but I had to move the cache outside the class because of
    > __slots__.


    Wouldn't it work inside the class as well? __slots__ applies to
    instances, not to classes themselves. In Python 3.1:

    >>> class Foo(tuple):

    .... __slots__ = ()
    .... _cache = {}
    ....
    >>> Foo._cache # works as expected

    {}
    >>> Foo()._cache # and so does this

    {}
    Hrvoje Niksic, Dec 28, 2010
    #11
  12. Christian Heimes <> writes:

    > You are right as long as you don't try to rebind the variable.


    And then only if you assign through an instance, which doesn't make
    sense for a class-level cache. Assignment like Example2._cache = {}
    would work.
    Hrvoje Niksic, Dec 29, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Siegl
    Replies:
    1
    Views:
    758
  2. Mike Thompson

    interning strings

    Mike Thompson, Nov 7, 2004, in forum: Python
    Replies:
    7
    Views:
    308
    Peter Otten
    Nov 8, 2004
  3. Terry Hancock
    Replies:
    0
    Views:
    459
    Terry Hancock
    Oct 12, 2005
  4. George Sakkis

    Disable automatic interning

    George Sakkis, Mar 18, 2009, in forum: Python
    Replies:
    9
    Views:
    283
    Hrvoje Niksic
    Mar 23, 2009
  5. Chris Angelico
    Replies:
    0
    Views:
    90
    Chris Angelico
    Jan 24, 2012
Loading...

Share This Page