Better dict of dicts

Bill Jackson · Apr 19, 2007

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:
.... "string_3":5,
.... "string_4":10},
.... "string_2": {"string_2":12,
.... "string_6":2,
.... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.

My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}

Then the underlying dictionary of dictionaries would look like:

a = {1:{2:1,3:5,4:10},2:{2:12,5:2,1:4}}

Somehow I need to intercept every possible call though....such that

a["string_1"]["string_2"] actually calls a[1][2]

and

a.keys() returns ["string_1", "string_2", "string_3"....]
rather than [1,2,3,4,5]

etc.

Ideally, I would like the option to have different key hashes for the
rows and columns as well.

Any ideas?

Adam Atlas · Apr 19, 2007

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.

I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.

John Bauman · Apr 19, 2007

Adam said:
I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.

Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 19, 2007

Bill said:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary.

What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?

The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:

... "string_3":5,
... "string_4":10},
... "string_2": {"string_2":12,
... "string_6":2,
... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.

You seem to assume that every occurrence of "string_2" has separate
memory. That may or may not be the case, depending on where the strings
come from. For example, it may be that the keys are the very same strings:

s1 = "string_1"
s2 = "string_2"

a[s1][s2] = 1
a[s2][s1] = 4

Then, the memory for the string would exist only once.

My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}

Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.

Any ideas?

HTH,
Martin

Larry Bates · Apr 19, 2007

John said:
Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.

Maybe this will help:

http://wiki.python.org/moin/DictionaryKeys

-Larry

DillonCo · Apr 19, 2007

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

I think you may want to look into that rarely used function "intern"
(under "on-essential Built-in Functions").

Basically, Python keeps a cache of certain strings are are frequently used so
comparisons and dictionary lookups only require a pointer comparison. You
could then subclass dict (though using "DictMixin" could be better) like:

class IDict(DictMixin):
def __setitem__(self, key, value):
key=intern(key)
self.__dict[key]=value

That's totally untested and incomplete, but you hopefully get the idea.

Python (or at least CPython) seems to auto intern some strings occasionally
(you could look at the source if you care about the exact rules).
Example:True

So you don't have all that much to worry about.

Bill Jackson · Apr 20, 2007

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:

What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.

So, my structure is something like this:

a = {tuple_1: {tuple_2:value_1, tuple_3:value_2},
tuple_4: {tuple_1:value_3, tuple_3:value_4}
tuple_5: {tuple_2:value_5, tuple_3:value_6, tuple_4:value_7}}

Since the tuples from the inner dictionaries and the outer dictionaries
are frequently the same, I would benefit from using a single intern
function. Then, the tuples will always be "pointing" to the values
stored in the intern dictionary.

Now suppose there is little overlap between the keys for the outer
dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?

Steven D'Aprano · Apr 20, 2007

Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.

The key is already stored, so long as the string 'string_2' exists. And it
will continue to exist so long as the dictionary includes it as a key. An
extra copy isn't made (unless you make an extra copy yourself).

In other words, the dictionary stores a reference to the string, not a
copy of it:

s = "this is a long string"
d = {s: 1}
id(s) -1209073840
id(d.keys()[0])

Click to expand...

Click to expand...

-1209073840

DillonCo · Apr 20, 2007

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)

True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.

Paddy · Apr 20, 2007

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:

Click to expand...

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

Click to expand...

That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)

True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.

Hi DillonCo,
Martins earlier local_intern function would work for tuples as well as
strings.

- Paddy.

DillonCo · Apr 20, 2007

Martins earlier local_intern function would work for tuples as well as
strings.

It certainly would. I had written that class, though, primarily to offer a
performance improvement in the __eq__ and perhaps __hash__ methods. However,
I ended up being rather surprised by just how much overhead there was in
calling the Python methods vs. the builtin ones.

So really mine only ends up being useful if the tuple consists of a couple
(i.e. 2+) of objects with Python __eq__ methods. Oh well.

As an interesting aside:.... def __eq__(self, o):
.... return False
....True

Apparently the tuple's __eq__ assumes: "a is b" => "a==b"
Bug? Or am I creating evil corner cases

?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 20, 2007

Now suppose there is little overlap between the keys for the outer

dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?

Correct. Notice that this thing is a time-space-tradeoff resolved in
favor of space - quite uncommon, as people these days always chose to
resolve them in favor of time. Interning a tuple will take some time,
as it is a dictionary lookup (requiring to hash the tuple); then
you will pass the interned tuple again for another dictionary lookup.

So make sure that you don't apply local_intern() twice to the very
same tuple object. E.g. when you also store the tuple in a variable,
don't do

var = (..., ..., ...)
a[foo][local_intern(var)] = foovar
print a[foo][local_intern(var)]

instead do

var = local_intern((..., ..., ...))
a[foo][var] = foovar
print a[foo][var]

Strictly speaking, you can omit the interning on read operation,
as that won't affect the keys stored in the dictionary. However,
performing the lookup with the interned key gives you speed
back, as then dictionary lookup won't need to compare the tuples,
but will find that the search key is identical to the stored
key, so the keys are certainly equal.

Regards,
Martin

Bruno Desthuilliers · Apr 21, 2007

Bill Jackson a écrit :

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Is this actually a *real* problem (or do you have evidences - based on
both measurements of the behaviour with test data sets and knowledge of
the real data sets- that there will be a problem) ? Or this this just
"worrying" ? In the second case, I suggest that you bench and profile
your code to know for sure...

dicts,instances,containers, slotted instances, et cetera.	8	Jan 28, 2009
searching a value of a dict (each value is a list)	21	Dec 9, 2007
Data saving in condition of changing reality	0	Apr 29, 2022
Order of elements in a dict	7	Apr 26, 2005
set, dict and other structures	8	Jan 31, 2005
del and sets proposal	21	Oct 2, 2008
hashkey/digest for a complex object	22	Oct 6, 2010
lists as an efficient implementation of large two-dimensionalarrays(!)	0	Feb 2, 2010

Better dict of dicts

Bill Jackson

Adam Atlas

John Bauman

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Larry Bates

DillonCo

Bill Jackson

Steven D'Aprano

DillonCo

Paddy

DillonCo

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads