Today's fun and educational Python recipe

  • Thread starter Raymond Hettinger
  • Start date
I

Irmen de Jong

Here's a 22-line beauty for a classic and amazing algorithm:
http://bit.ly/bloom_filter

The wiki article on the algorithm is brief and well-written:
http://en.wikipedia.org/wiki/Bloom_filter

It turns out that people in the 1970's were pretty smart :)

I think that often, the cleverness of people is inversely proportional
to the amount of CPU power and RAM that they have in their computer.

"Meh, what would I need such a thing for, I could just as well stick
everything into a list"

Thankfully there are also people for whom this doesn't count.
(are they stuck in the '70s?)

Also: wasn't there a talk on Pycon in which a bloom filter was mentioned?


Irmen
 
R

Raymond Hettinger

It turns out that people in the 1970's were pretty smart :)
I think that often, the cleverness of people is inversely proportional
to the amount of CPU power and RAM that they have in their computer.

The Google guys have plenty of CPU power *and* plenty of
cleverness :)

According to the wikipedia article, Google BigTable uses Bloom filters
to reduce the disk lookups for non-existent rows or column. The
Google Chrome web browser also uses Bloom filters to speed up its Safe
Browsing service.

Also: wasn't there a talk on Pycon in which a bloom filter was mentioned?

Yes! As a matter of fact there was:
http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters


Raymond
 
G

Grant Edwards

I think that often, the cleverness of people is inversely
proportional to the amount of CPU power and RAM that they have in
their computer.

True.

Unfortunately the difficulty in debugging and maintaining code is
often directly proportional to the cleverness exhibited by the
original programmer.
 
P

Paul Rubin

Raymond Hettinger said:
Here's a 22-line beauty for a classic and amazing algorithm:
http://bit.ly/bloom_filter

The use of pickle to serialize the keys is a little bit suspicious if
there might be a reason to dump the filter to disk and re-use it in
another run of the program. Pickle representation might change between
Python releases, for example. It's just supposed to stay interoperable
between versions, not necessarily bitwise-identical.

Otherwise it's quite nice. I'd suggest adding a .update() operation
that adds keys from a user-supplied iterator.
 
I

Irmen de Jong

The Google guys have plenty of CPU power *and* plenty of
cleverness :)

Haha, true. We need to add a Googlyness factor in the equation.
Or perhaps: think what they could achieve if they only had a few
machines instead of thousands ;-)


Thanks, that was the one.

I didn't attend Pycon but I watched a truckload of talks on blip.tv and
that one caught my attention because of its somewhat funny title
'handling ridiculous amounts of data with probabilistic data structures'

I didn't understand all of it but the name Bloom Filter stuck, it seems.
Adding it to my list of bookmarks of
useful-stuff-I-intend-to-use-one-day-in-the-future...

Irmen
 
T

Terry Reedy

Here's a 22-line beauty for a classic and amazing algorithm:
http://bit.ly/bloom_filter

The wiki article on the algorithm is brief and well-written:
http://en.wikipedia.org/wiki/Bloom_filter

As I understand the article, the array of num_bits should have
num_probes (more or less independent) bits set for each key. But as I
understand the code

for i in range(self.num_probes):
h, array_index = divmod(h, num_words)
h, bit_index = divmod(h, 32)
yield array_index, 1 << bit_index

the same bit is being set or tested num_probes times. The last three
lines have no dependence on i that I can see, so they appear to do the
same thing each time. This seems like a bug.

The article says "For a good hash function with a wide output, there
should be little if any correlation between different bit-fields of such
a hash, so this type of hash can be used to generate multiple
"different" hash functions by slicing its output into multiple bit
fields. Alternatively, one can pass k different initial values (such as
0, 1, ..., k − 1) to a hash function that takes an initial value;or add
(or append) these values to the key." I do not see the code doing either
of these.
 
R

Raymond Hettinger

As I understand the article, the array of num_bits should have
num_probes (more or less independent) bits set for each key. But as I
understand the code

         for i in range(self.num_probes):
             h, array_index = divmod(h, num_words)
             h, bit_index = divmod(h, 32)
             yield array_index, 1 << bit_index

the same bit is being set or tested num_probes times. The last three
lines have no dependence on i that I can see, so they appear to do the
same thing each time. This seems like a bug.

The 512 bits in h are progressively eaten-up between iterations. So
each pass yields a different (array index, bit_mask) pair.

It's easy to use the interactive prompt to show that different probes
are produced on each pass:
[(19, 1073741824),
(11, 64),
(9, 134217728),
(25, 1024),
(24, 33554432),
(6, 16),
(7, 16777216),
(22, 1048576)]

The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)

The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised. It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives. That should be somewhat
convincing that it all works.


Raymond
 
R

Raymond Hettinger

The use of pickle to serialize the keys is a little bit suspicious if
there might be a reason to dump the filter to disk and re-use it in
another run of the program.  Pickle representation might change between
Python releases, for example.  It's just supposed to stay interoperable
between versions, not necessarily bitwise-identical.

Otherwise it's quite nice.  I'd suggest adding a .update() operation
that adds keys from a user-supplied iterator.

I chose pickle because it solved the problem of turning arbitrary
objects into bytes which are needed as inputs to sha512.

It seems that this particular choice of hash function is distracting
some readers away from the interesting part of how a Bloom filter
works.

Since the choice of hash functions is completely arbitrary, I'm
thinking of substituting something a little more basic. What do you
think of this alternative as a way to make it clearer that each
successive probe is uncorrelated, yet fully dependent on the key?

def get_probes(self, key):
hasher = Random(key).randrange
num_words = len(self.arr)
for _ in range(self.num_probes):
array_index = hasher(num_words)
bit_index = hasher(32)
yield array_index, 1 << bit_index


Raymond
 
T

Terry Reedy

The 512 bits in h are progressively eaten-up between iterations. So
each pass yields a different (array index, bit_mask) pair.

Yeh, obvious now that I see it.
It's easy to use the interactive prompt to show that different probes
are produced on each pass:
[(19, 1073741824),
(11, 64),
(9, 134217728),
(25, 1024),
(24, 33554432),
(6, 16),
(7, 16777216),
(22, 1048576)]

Should have tried that.
The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)
The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised. It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives.

I presume you mean 'false positives', as in the program comment and
Wikipedia.

The test would be more convincing to many with 100000 other geographic
names (hard to come by, I know), or other english names or words or even
with longer random strings that matched the lengths of the state names.
But an average of 5/100000 false positives in 5 runs is good.
 
R

Raymond Hettinger

The test would be more convincing to many with 100000 other geographic
names (hard to come by, I know), or other english names or words or even
with longer random strings that matched the lengths of the state names.
But an average of 5/100000 false positives in 5 runs is good.

I've just posted an update with a spell checker using a large
dictionary. That should make it easy to validate against some real
world text samples.


Raymond
 
C

Chris Angelico

I think that often, the cleverness of people is inversely proportional to
the amount of CPU power and RAM that they have in their computer.

As Mark Rosewater is fond of saying, restrictions breed creativity.
Lack of computational resources is a major restriction (for an extreme
example of RAM shortage, look at how much code you can fit into a boot
sector without loading anything more from the disk). Take away all the
restrictions, and people will tend to code sloppily.

Chris Angelico
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top