Interning own classes like strings for speed and size?

Ulrich Eckhardt · Dec 27, 2010

Hi!

I'm trying to solve a computational problem and of course speed and size is
important there. Apart from picking the right algorithm, I came across an
idea that could help speed up things and keep memory requirements down. What
I have is regions described by min and max coordinates. At first, I just
modeled these as a simple class containing two values for each axis.

In a second step, I derived this class from tuple instead of object. Some
code then moved from __init__ to __new__ and some code that modified these
objects had to be changed to replace them instead. The upside to this is
that they can be used as keys in sets and dicts, which isn't the case for
mutable types[1].

What I'm now considering is to only allow a single instance of these objects
for each set of values, similar to interned strings. What I would gain is
that I could safely compare objects for identity instead of equality. What
I'm not yet sure is how much overhead that would give me and/or how to keep
it low. The idea is to store each instance in a set and after creating a new
object I would first look up an equal object in the global set and return
that instead, otherwise add the new one.

The problem I foresee is that if I define equality as identity, this lookup
when creating will never eliminate duplicates. If I only fall back to
equality comparison for non-identical objects, I would probably sacrifice
most of the gain. If I build a dict mapping between the values and the
actual objects, I would have doubled the required memory and uselessly store
the same values twice there.

Am I looking in the wrong direction? Is there some better approach? Please
don't tell me to use C, as I'm specifically interested in learning Python,
I'm pretty sure I could have solved the problem quickly in C++ otherwise.
Other suggestions?

Cheers!

Uli

[1] Somebody correct me if I'm wrong, but I believe I could have defined a
hashing function for the type and thus allowed its use in a set or dict,
right? However, goofing up because you accidentally modified an object and
changed its hash value is something I don't want to risk anyway.

Daniel Fetchinson · Dec 27, 2010

I'm trying to solve a computational problem and of course speed and size is

important there. Apart from picking the right algorithm, I came across an
idea that could help speed up things and keep memory requirements down. What
I have is regions described by min and max coordinates. At first, I just
modeled these as a simple class containing two values for each axis.

In a second step, I derived this class from tuple instead of object. Some
code then moved from __init__ to __new__ and some code that modified these
objects had to be changed to replace them instead. The upside to this is
that they can be used as keys in sets and dicts, which isn't the case for
mutable types[1].

What I'm now considering is to only allow a single instance of these objects
for each set of values, similar to interned strings. What I would gain is
that I could safely compare objects for identity instead of equality. What
I'm not yet sure is how much overhead that would give me and/or how to keep
it low. The idea is to store each instance in a set and after creating a new
object I would first look up an equal object in the global set and return
that instead, otherwise add the new one.

The problem I foresee is that if I define equality as identity, this lookup
when creating will never eliminate duplicates. If I only fall back to
equality comparison for non-identical objects, I would probably sacrifice
most of the gain. If I build a dict mapping between the values and the
actual objects, I would have doubled the required memory and uselessly store
the same values twice there.

Am I looking in the wrong direction? Is there some better approach? Please
don't tell me to use C, as I'm specifically interested in learning Python,
I'm pretty sure I could have solved the problem quickly in C++ otherwise.
Other suggestions?

Cheers!

Uli

[1] Somebody correct me if I'm wrong, but I believe I could have defined a
hashing function for the type and thus allowed its use in a set or dict,
right? However, goofing up because you accidentally modified an object and
changed its hash value is something I don't want to risk anyway.

I believe what you are looking for is (some variant of) the singleton pattern:

http://en.wikipedia.org/wiki/Singleton_pattern

How it's done in python see http://www.google.com/search?q=python+singleton

Cheers,
Daniel

Ulrich Eckhardt · Dec 27, 2010

Daniel said:
I believe what you are looking for is (some variant of) the singleton
pattern:

http://en.wikipedia.org/wiki/Singleton_pattern

Actually, no. What I want is the flyweight pattern instead:

http://en.wikipedia.org/wiki/Flyweight_pattern

....but thank you for the approach of looking for a suitable pattern!

Cheers!

Uli

Daniel Fetchinson · Dec 27, 2010

I believe what you are looking for is (some variant of) the singleton

Actually, no. What I want is the flyweight pattern instead:

http://en.wikipedia.org/wiki/Flyweight_pattern

Oh I see. I did not know about this pattern, but in my defense it
looks like a variant of the singleton pattern

Thanks! One always learns something new on python-list.

Cheers,
Daniel

Terry Reedy · Dec 27, 2010

Hi!

I'm trying to solve a computational problem and of course speed and size is
important there. Apart from picking the right algorithm, I came across an
idea that could help speed up things and keep memory requirements down. What
I have is regions described by min and max coordinates.

What sort of numbers are the coordinates? If integers in a finite range,
your problem is a lot simpler than if float of indefinite precision.

Steven D'Aprano · Dec 27, 2010

What I'm now considering is to only allow a single instance of these
objects for each set of values, similar to interned strings. What I
would gain is that I could safely compare objects for identity instead
of equality. What I'm not yet sure is how much overhead that would give
me and/or how to keep it low. The idea is to store each instance in a
set and after creating a new object I would first look up an equal
object in the global set and return that instead, otherwise add the new
one.

Try this technique:
.... _cache = {}
.... def __new__(cls, *args):
.... t = super().__new__(cls, *args)
.... return cls._cache.setdefault(t, t)
.... True

Ulrich Eckhardt · Dec 28, 2010

Steven said:
... _cache = {}
... def __new__(cls, *args):
... t = super().__new__(cls, *args)
... return cls._cache.setdefault(t, t)

That looks good. The only thing that first bothered me is that it creates an
object and then possibly discards it again. However, there is no way around
that, since at least the key to the dict must be created for lookup. Since
key and value are the same here, this is even for free.

What I also found was that with the above, I can't provide __eq__ and __ne__
that just check for identity. If I do, the lookup in setdefault() will never
find an existing tuple and I will never save memory for a single object.

Thanks!

Uli

Steven D'Aprano · Dec 28, 2010

That looks good. The only thing that first bothered me is that it
creates an object and then possibly discards it again. However, there is
no way around that, since at least the key to the dict must be created
for lookup. Since key and value are the same here, this is even for
free.

What I also found was that with the above, I can't provide __eq__ and
__ne__ that just check for identity. If I do, the lookup in setdefault()
will never find an existing tuple and I will never save memory for a
single object.

If all you want is to save memory, you don't need to change the __eq__
method. But if you still want to, try this:

# Untested
class InternedTuple(tuple):
_cache = {}
def __new__(cls, *args):
t = super().__new__(cls, *args)
return cls._cache.setdefault(args, t)
def __eq__(self, other):
return self is other
def __ne__(self, other):
return self is not other

Stefan Behnel · Dec 28, 2010

Steven D'Aprano, 28.12.2010 15:11:

If all you want is to save memory, you don't need to change the __eq__
method. But if you still want to, try this:

# Untested

Yep, that' the problem.

class InternedTuple(tuple):
_cache = {}
def __new__(cls, *args):
t = super().__new__(cls, *args)
return cls._cache.setdefault(args, t)
def __eq__(self, other):
return self is other
def __ne__(self, other):
return self is not other

What Ulrich meant, was: doing this will actually kill the caching, because
the first time the comparison is called is when looking up the tuple while
adding it to the interning dict. Since the new tuple is, well, new, it will
not be equal (read: identical) to any cached tuple, thus resulting in a new
entry regardless of its content.

Stefan

John Nagle · Dec 28, 2010

Hi!

I'm trying to solve a computational problem and of course speed and size is
important there.

Then you probably shouldn't be using CPython.

Apart from picking the right algorithm, I came across an
idea that could help speed up things and keep memory requirements down. What
I have is regions described by min and max coordinates. At first, I just
modeled these as a simple class containing two values for each axis.

There are more effective techniques for that. In game physics
collision detection systems, it's common to use axis-ordered lists
of bounding boxes to partition objects. This still works even when
the ranges overlap. Look up "I-Collide" for a classic implementation.

> Am I looking in the wrong direction? Is there some better approach?
> Please
> don't tell me to use C, as I'm specifically interested in
> learning Python,
> I'm pretty sure I could have solved the problem quickly in
> C++ otherwise.

CPython is a naive interpreter which does dictionary
lookups for every variable and attribute access. If you're doing
fast manipulation of linked data structures in CPython, performance
is going to suck.

Check out Shed Skin 0.7, which is a much faster Python
system. It's limited, but for the kind of data structure work
you're doing, all the features you should need already work.

John Nagle

In a second step, I derived this class from tuple instead of object. Some
code then moved from __init__ to __new__ and some code that modified these
objects had to be changed to replace them instead. The upside to this is
that they can be used as keys in sets and dicts, which isn't the case for
mutable types[1].

What I'm now considering is to only allow a single instance of these objects
for each set of values, similar to interned strings. What I would gain is
that I could safely compare objects for identity instead of equality. What
I'm not yet sure is how much overhead that would give me and/or how to keep
it low. The idea is to store each instance in a set and after creating a new
object I would first look up an equal object in the global set and return
that instead, otherwise add the new one.

The problem I foresee is that if I define equality as identity, this lookup
when creating will never eliminate duplicates. If I only fall back to
equality comparison for non-identical objects, I would probably sacrifice
most of the gain. If I build a dict mapping between the values and the
actual objects, I would have doubled the required memory and uselessly store
the same values twice there.

Cheers!

Uli

[1] Somebody correct me if I'm wrong, but I believe I could have defined a
hashing function for the type and thus allowed its use in a set or dict,
right? However, goofing up because you accidentally modified an object and
changed its hash value is something I don't want to risk anyway.

Hrvoje Niksic · Dec 28, 2010

Christian Heimes said:
Also this code is going to use much more memory than an ordinary tuple
since every instance of InternedTuple has a __dict__ attribute. This
code works but I had to move the cache outside the class because of
__slots__.

Wouldn't it work inside the class as well? __slots__ applies to
instances, not to classes themselves. In Python 3.1:
.... __slots__ = ()
.... _cache = {}
.... {}

Hrvoje Niksic · Dec 29, 2010

Christian Heimes said:
You are right as long as you don't try to rebind the variable.

And then only if you assign through an instance, which doesn't make
sense for a class-level cache. Assignment like Example2._cache = {}
would work.

Variables and classes in pyside6???	1	Aug 19, 2024
Explore the Power of AI: Build Your Own Console Chatbot Using GPT-2 XL in Python	2	Mar 17, 2026
interning strings	7	Nov 7, 2004
Padding strings for a clean visual print out...	5	Dec 23, 2023
Is it possible to program a peer-to-peer network where you can customize your own client and server IP addresses? like radios on the same Hz	1	Oct 25, 2022
Who are low code solutions designed for?	1	Oct 22, 2023
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
classes and sub classes?	7	Apr 9, 2013

Interning own classes like strings for speed and size?

Ulrich Eckhardt

Daniel Fetchinson

Ulrich Eckhardt

Daniel Fetchinson

Terry Reedy

Steven D'Aprano

Ulrich Eckhardt

Steven D'Aprano

Stefan Behnel

John Nagle

Hrvoje Niksic

Hrvoje Niksic

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads