don't need dictionary's keys - hash table?

P

Peter Otten

Terry said:
Nick said:
The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
else:
d[key] = val

or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url)):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
"""A cluster of dictionary values with the same hash(key).

Helper for Dict.
"""
def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return "Cluster(default=%r, pairs=%r)" % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
"""A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.
"""
def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return "Dict(%s)" % repr(self.dict)[1:-1]

if __name__ == "__main__":
class BadHash(object):
"""Key with user-defined hash.

Allows enforcing hash collisions.
"""
def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return "BadHash(value=%r, hash=%r)" % (self.value, self.hash)
d = Dict()
for i, v in enumerate("alpha beta gamma delta".split()):
d[BadHash(v, i)] = v
for v in "sigma tau omega".split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash("gamma", 2)] == "gamma"

assert d[BadHash("sigma", 42)] == "sigma"
assert d[BadHash("tau", 42)] == "tau"

# but be warned that
assert d[BadHash("whatever", 42)] == "sigma"

expected = sorted("alpha beta gamma delta sigma tau omega".split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter
 
P

Peter Otten

Terry said:
Nick said:
The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ] then the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i = 0
if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
else:
d[key] = val

or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url), sys.maxint):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
"""A cluster of dictionary values with the same hash(key).

Helper for Dict.
"""
def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return "Cluster(default=%r, pairs=%r)" % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
"""A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.
"""
def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return "Dict(%s)" % repr(self.dict)[1:-1]

if __name__ == "__main__":
class BadHash(object):
"""Key with user-defined hash.

Allows enforcing hash collisions.
"""
def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return "BadHash(value=%r, hash=%r)" % (self.value, self.hash)
d = Dict()
for i, v in enumerate("alpha beta gamma delta".split()):
d[BadHash(v, i)] = v
for v in "sigma tau omega".split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash("gamma", 2)] == "gamma"

assert d[BadHash("sigma", 42)] == "sigma"
assert d[BadHash("tau", 42)] == "tau"

# but be warned that
assert d[BadHash("whatever", 42)] == "sigma"

expected = sorted("alpha beta gamma delta sigma tau omega".split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter
 
F

Fredrik Lundh

Thanks (everyone) for the comments. I like the idea of the bloom
filter or using an md5 hash, since a rare collision will not be a
show-stopper in my case.

btw, since my brain is on vacation, could anyone who already has all the necessary
background information in their head (Paul?) tell me if using a chopped-up MD5 or
SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good idea...

i.e. something like:

import sha

try:
all
except NameError:
def all(S):
for x in S:
if not x:
return False
return True

def digest(data):
d = sha.sha(data).digest()
return [d[i:i+4] for i in range(0, len(d), 4)]

class bloom(object):

def __init__(self):
self.filter = [set() for i in range(len(digest("")))]

def add(self, data):
for s, p in zip(self.filter, digest(data)):
s.add(p)
def __contains__(self, data):

return all(p in s for s, p in zip(self.filter, digest(data)))

b = bloom()
b.add("showbiz")
b.add("absolution")
print "hullaballo" in b
print "showbiz" in b
print "origin of symmetry" in b

(and if this isn't a good idea, why it isn't).

</F>
 
P

Paul Rubin

Fredrik Lundh said:
btw, since my brain is on vacation, could anyone who already has all
the necessary background information in their head (Paul?) tell me
if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash
functions for a bloom filter is a good idea...

Erm, my brain isn't on vacation, it's just plain missing ;-). However
I'd say there's nothing wrong in principle with chopping up sha/md5
output if you want to get several hashes as you're doing, assuming the
total hash length is enough. The advantage over doing separate hashes
for each piece is optimization: you don't compute as many hashes.
But there's more flexibility with something like (untested):

def multi_hash(data, i, size): # find i'th slice of hash
return sha.new('%x,%s'% (i, data)).digest()[:size]

Also, the idea of a Bloom filter is to use less memory than a hash
table, by not storing the keys. In your implementation you chop the
sha/md5 output into 4-byte pieces, which means about 4 billion "slots"
in each set, which might be too many for a typical computer. Also,
you end up storing each of those 4-byte values as a Python string. So
I think in practice you'd want smaller sets, and to represent them as
bit vectors instead of as Python sets (untested):

import array
class bit_vector(object):
def __init__(self, size_in_bits):
nbytes = (size_in_bits + 7) // 8
self.vec = array.array('B', '\0' * nbytes)

def __getattr__(self, i): # get i'th bit from vector
a,b = divmod(i, 8)
return bool(self.vec[a] & (1 << b))

def __setattr__(self, i, value): # set i'th bit in vector
a,b = divmod(i, 8)
v = int(bool(value)) # the actual bit to set, 1 or 0
self.vec[a] = (self.vec[a] & ~(1 << b)) | (v << b)

If we use a bit vector representation we want the hash value to be an
integer (untested):

def bloom_hash(data, i, size_in_bits):
# my favorite way to get an integer value from sha
h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16)
return h % size_in_bits

We'd then implement the Bloom filter (untested):

class bloom(object):
def __init__(self, nlayers, layersize):
self.nlayers, self.layersize = nlayers, layersize
self.filter = [bit_vector(layersize) for i in xrange(nlayers)]

def add(self, data):
for i in xrange(self.nlayers):
self.filter[bloom_hash(data, i, self.layersize)] = True

def __contains__(self, data):
return False not in \
(self.filter[bloom_hash(data, i, self.layersize)] \
for i in xrange(self.nlayers))

Finally, I think to use Bloom filters effectively you have to choose
nlayers and layersize carefully, according to the number of keys you
expect to see, the amount of memory you have available, and/or the
probability of false matches that you're willing to accept. The
standard references (whatever they are) about Bloom filters probably
have some suitable formulas for computing these things.
 
P

Paul Rubin

Paul Rubin said:
Finally, I think to use Bloom filters effectively you have to choose
nlayers and layersize carefully, according to the number of keys you
expect to see, the amount of memory you have available, and/or the
probability of false matches that you're willing to accept. The
standard references (whatever they are) about Bloom filters probably
have some suitable formulas for computing these things.

Heh, there's an online calculator:

http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html

Link is from

http://en.wikipedia.org/wiki/Bloom_filter

which is a good article. As described in the article, there's only
one "layer" in the Bloom filter (you use n different hashes but they
all touch the same bit map) which corresponds to how the ancient Unix
spell checker worked, I think. It would take some analysis to show
which way is better. I may try working out the formulas as an
exercise but I'm too sleepy for that right now.
 
P

Piet van Oostrum

k> Hello,
k> I am using some very large dictionaries with keys that are long strings
k> (urls). For a large dictionary these keys start to take up a
k> significant amount of memory. I do not need access to these keys -- I
k> only need to be able to retrieve the value associated with a certain
k> key, so I do not want to have the keys stored in memory. Could I just
k> hash() the url strings first and use the resulting integer as the key?
k> I think what I'm after here is more like a tradition hash table. If I
k> do it this way am I going to get the memory savings I am after? Will
k> the hash function always generate unique keys? Also, would the same
k> technique work for a set?

Maybe a Berkeley DB hash file would be a good alternative. It can contain
all your key,value pairs but will only keep a small amount in memory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top