Better dict of dicts

B

Bill Jackson

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:
.... "string_3":5,
.... "string_4":10},
.... "string_2": {"string_2":12,
.... "string_6":2,
.... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.

My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}

Then the underlying dictionary of dictionaries would look like:

a = {1:{2:1,3:5,4:10},2:{2:12,5:2,1:4}}

Somehow I need to intercept every possible call though....such that

a["string_1"]["string_2"] actually calls a[1][2]

and

a.keys() returns ["string_1", "string_2", "string_3"....]
rather than [1,2,3,4,5]

etc.

Ideally, I would like the option to have different key hashes for the
rows and columns as well.

Any ideas?
 
A

Adam Atlas

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.

I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.
 
J

John Bauman

Adam said:
I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.
Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bill said:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary.

What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?
The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:

... "string_3":5,
... "string_4":10},
... "string_2": {"string_2":12,
... "string_6":2,
... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.

You seem to assume that every occurrence of "string_2" has separate
memory. That may or may not be the case, depending on where the strings
come from. For example, it may be that the keys are the very same strings:

s1 = "string_1"
s2 = "string_2"

a[s1][s2] = 1
a[s2][s1] = 4

Then, the memory for the string would exist only once.
My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}

Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.
Any ideas?

HTH,
Martin
 
D

DillonCo

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

I think you may want to look into that rarely used function "intern"
(under "on-essential Built-in Functions").

Basically, Python keeps a cache of certain strings are are frequently used so
comparisons and dictionary lookups only require a pointer comparison. You
could then subclass dict (though using "DictMixin" could be better) like:

class IDict(DictMixin):
def __setitem__(self, key, value):
key=intern(key)
self.__dict[key]=value

That's totally untested and incomplete, but you hopefully get the idea.

Python (or at least CPython) seems to auto intern some strings occasionally
(you could look at the source if you care about the exact rules).
Example:True

So you don't have all that much to worry about.
 
B

Bill Jackson

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:
What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.
Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.

So, my structure is something like this:

a = {tuple_1: {tuple_2:value_1, tuple_3:value_2},
tuple_4: {tuple_1:value_3, tuple_3:value_4}
tuple_5: {tuple_2:value_5, tuple_3:value_6, tuple_4:value_7}}

Since the tuples from the inner dictionaries and the outer dictionaries
are frequently the same, I would benefit from using a single intern
function. Then, the tuples will always be "pointing" to the values
stored in the intern dictionary.

Now suppose there is little overlap between the keys for the outer
dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?
 
S

Steven D'Aprano

Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.


The key is already stored, so long as the string 'string_2' exists. And it
will continue to exist so long as the dictionary includes it as a key. An
extra copy isn't made (unless you make an extra copy yourself).

In other words, the dictionary stores a reference to the string, not a
copy of it:
s = "this is a long string"
d = {s: 1}
id(s) -1209073840
id(d.keys()[0])
-1209073840
 
D

DillonCo

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)

True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.
 
P

Paddy

Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:
Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)

True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.

Hi DillonCo,
Martins earlier local_intern function would work for tuples as well as
strings.

- Paddy.
 
D

DillonCo

Martins earlier local_intern function would work for tuples as well as
strings.

It certainly would. I had written that class, though, primarily to offer a
performance improvement in the __eq__ and perhaps __hash__ methods. However,
I ended up being rather surprised by just how much overhead there was in
calling the Python methods vs. the builtin ones.

So really mine only ends up being useful if the tuple consists of a couple
(i.e. 2+) of objects with Python __eq__ methods. Oh well.

As an interesting aside:.... def __eq__(self, o):
.... return False
....True

Apparently the tuple's __eq__ assumes: "a is b" => "a==b"
Bug? Or am I creating evil corner cases ;)?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Now suppose there is little overlap between the keys for the outer
dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?

Correct. Notice that this thing is a time-space-tradeoff resolved in
favor of space - quite uncommon, as people these days always chose to
resolve them in favor of time. Interning a tuple will take some time,
as it is a dictionary lookup (requiring to hash the tuple); then
you will pass the interned tuple again for another dictionary lookup.

So make sure that you don't apply local_intern() twice to the very
same tuple object. E.g. when you also store the tuple in a variable,
don't do

var = (..., ..., ...)
a[foo][local_intern(var)] = foovar
print a[foo][local_intern(var)]

instead do

var = local_intern((..., ..., ...))
a[foo][var] = foovar
print a[foo][var]

Strictly speaking, you can omit the interning on read operation,
as that won't affect the keys stored in the dictionary. However,
performing the lookup with the interned key gives you speed
back, as then dictionary lookup won't need to compare the tuples,
but will find that the search key is identical to the stored
key, so the keys are certainly equal.

Regards,
Martin
 
B

Bruno Desthuilliers

Bill Jackson a écrit :
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.
Is this actually a *real* problem (or do you have evidences - based on
both measurements of the behaviour with test data sets and knowledge of
the real data sets- that there will be a problem) ? Or this this just
"worrying" ? In the second case, I suggest that you bench and profile
your code to know for sure...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,609
Members
45,253
Latest member
BlytheFant

Latest Threads

Top