hash and eq

Aaron Brady · May 30, 2009

I am writing a mapping object, and I want to ask about the details of
__hash__ and __eq__. IIUC if I understand correctly, the Python
dict's keys' hash codes are looked up first in O( 1 ), then all the
matching hash entries are compared on equality in O( n ). That is,
the hash code just really narrows down the search. Am I correct?

P.S. I always feel like my posts should start like, "A mapping object
am writing I." Not too many verbs you can do that with in English.
What if I did it?

Dennis Lee Bieber · May 30, 2009

P.S. I always feel like my posts should start like, "A mapping object
am writing I." Not too many verbs you can do that with in English.
What if I did it?

You turn into a short fuzzy large eared critter in a robe with a
light-saber?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Aaron Brady · May 30, 2009

You turn into a short fuzzy large eared critter in a robe with a
light-saber?

A fuzzy large eared critter turn I into.

"Ending sentences in prepositions is nonsense up with which I will not
put." -Churchill

I'm mimicking the dict type with an SQL table (something to which I
alluded some time ago). I'm going to 'select' on the hash code, then
(*THEN*) revive the keys and compare on equality. The values don't
need to be revived.

P.S. def revive!

Piet van Oostrum · May 30, 2009

Aaron Brady said:
AB> I am writing a mapping object, and I want to ask about the details of
AB> __hash__ and __eq__. IIUC if I understand correctly, the Python
AB> dict's keys' hash codes are looked up first in O( 1 ), then all the
AB> matching hash entries are compared on equality in O( n ). That is,
AB> the hash code just really narrows down the search. Am I correct?

The O(1) part is correct. The O(n) is incorrect if n as usual is taken
to be the number of keys in the dict. But it is true if n is the number
of keys in the `bucket' that belongs to the hash value. The average
complexity of looking up a key is O(1).

See http://wiki.python.org/moin/DictionaryKeys

Arnaud Delobelle · May 30, 2009

Piet van Oostrum said:
The O(1) part is correct. The O(n) is incorrect if n as usual is taken
to be the number of keys in the dict. But it is true if n is the number
of keys in the `bucket' that belongs to the hash value. The average
complexity of looking up a key is O(1).

As all the keys could be in the same bucket, O(n) seems correct to me
(with n the total number of keys).

Terry Reedy · May 30, 2009

Aaron said:
I am writing a mapping object, and I want to ask about the details of
__hash__ and __eq__. IIUC if I understand correctly, the Python
dict's keys' hash codes are looked up first in O( 1 ), then all the
matching hash entries are compared on equality in O( n ). That is,
the hash code just really narrows down the search. Am I correct?

Yes. Even if your mapping object did a linear O(n) search with the hash
value, only testing for equality after finding an equal hash value would
be faster. Your own mapping object can do whatever you want it to do
internally, as long as it provides the appropriate interface. You
could, for instance, make a map that allowed unhashable lists,
(unfrozen) sets, and dicts as heys by using len() as a surrogate for hash().

tjr

Robert Kern · May 30, 2009

Yes. Even if your mapping object did a linear O(n) search with the hash
value, only testing for equality after finding an equal hash value would
be faster. Your own mapping object can do whatever you want it to do
internally, as long as it provides the appropriate interface. You could,
for instance, make a map that allowed unhashable lists, (unfrozen) sets,
and dicts as heys by using len() as a surrogate for hash().

Only as long as you don't modify the length of those objects serving as keys in
the meantime. But if you're careful about that, knock yourself out.

The important property to maintain is that if two keys compare equal to each
other, then their hashes should equal each other. The reverse does not have to
be true.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Piet van Oostrum · May 30, 2009

Arnaud Delobelle said:
AB> I am writing a mapping object, and I want to ask about the details of
AB> __hash__ and __eq__. IIUC if I understand correctly, the Python
AB> dict's keys' hash codes are looked up first in O( 1 ), then all the
AB> matching hash entries are compared on equality in O( n ). That is,
AB> the hash code just really narrows down the search. Am I correct?

AD> As all the keys could be in the same bucket, O(n) seems correct to me
AD> (with n the total number of keys).

For worst case only (mainly with a badly designed hash function).

Arnaud Delobelle · May 31, 2009

Piet van Oostrum said:
For worst case only (mainly with a badly designed hash function).

.... or a cleverly designed data set!

AFAIK, 'complexity' means 'worst case complexity' unless otherwise
stated.

Steven D'Aprano · May 31, 2009

AFAIK, 'complexity' means 'worst case complexity' unless otherwise
stated.

No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

Here's a quote on-line:

"You should be clear about which cases big-oh notation describes. By
default it usually refers to the average case, using random data.
However, the characteristics for best, worst, and average cases can be
very different..."

http://leepoint.net/notes-java/algorithms/big-oh/bigoh.html

It is simply misleading to describe dicts as O(N) without the
qualification "if the data is chosen maliciously, or otherwise by an
incredible fluke". And even if it isn't, Piet explicitly said he was
talking about the average behaviour, not worst.

Arnaud Delobelle · May 31, 2009

Steven D'Aprano said:
No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

OK. I remember from my student years when studying complexity theory
that complexity implicitely referred to worst case complexity. I
assumed this was a widely used convention - I don't have any evidence
that it is so you may very well be correct.

Anyway, it's good to know that quicksort is O(n^2) in the worst case -
and that this worst case can crop up very easily in some situations,
especially if not too much care has been taken implementing it.

[...]

It is simply misleading to describe dicts as O(N) without the
qualification "if the data is chosen maliciously, or otherwise by an
incredible fluke". And even if it isn't, Piet explicitly said he was
talking about the average behaviour, not worst.

But the OP only mentioned 'complexity', not 'average complexity', and I
imagine that he knows the average case complexity of accessing a key in
a hashtable.

Steven D'Aprano · Jun 1, 2009

Anyway, it's good to know that quicksort is O(n^2) in the worst case -
and that this worst case can crop up very easily in some situations,
especially if not too much care has been taken implementing it.

The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).
It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.

Another strategy is to randomly exchange the pivot element when sorting.
This avoids having to do a separate pass over the list, while still
ensuring that no particular input can cause degraded performance.

A third strategy is to use a hybrid insertion/quicksort function. This is
probably a good idea regardless, because for small lists, the overhead of
quicksort generally makes insertion sort faster anyway. (CPython's sort
used to be similar: it was a hybrid insertion/samplesort until, by
memory, version 2.2, when it was changed for Timsort.)

Terry Reedy · Jun 1, 2009

Steven said:
The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).
It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.

Another strategy is to randomly exchange the pivot element when sorting.
This avoids having to do a separate pass over the list, while still
ensuring that no particular input can cause degraded performance.

A third strategy is to use a hybrid insertion/quicksort function. This is
probably a good idea regardless, because for small lists, the overhead of
quicksort generally makes insertion sort faster anyway. (CPython's sort
used to be similar: it was a hybrid insertion/samplesort until, by
memory, version 2.2, when it was changed for Timsort.

Which is hybrid binary insertion sort and mergesort, with b.i.s used for
sublists less than 64 items.

Aahz · Jun 2, 2009

No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

Here's a quote on-line:

"You should be clear about which cases big-oh notation describes. By
default it usually refers to the average case, using random data.
However, the characteristics for best, worst, and average cases can be
very different..."

http://leepoint.net/notes-java/algorithms/big-oh/bigoh.html

When talking about big-oh, I prefer to define "worst-case" as "real world
likely worst-case" -- for example, feeding an already-sorted list to
Quicksort. However, the kind of worst-case that causes a dict to have
O(N) behavior I would call "pathological case". I generally try to
define big-oh in terms of what I call worst-case, because I think it's
important to keep track of where your algorithm is likely to run into
problems, but I don't think it's worth the effort in most cases to worry
about pathological cases.

In the case of dicts, it's important to remember that pathological cases
are far more likely to arise from poorly designed hashable user classes
than from data problems per se; Python's hashing of built-in types has
almost twenty years of tuning.

Albert van der Horst · Jun 4, 2009

The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).

No it isn't (in general).
It only does so if you use the first element as a pivot,
resulting in a set with one element and a set with n-1 elements.
It is an incredible long time ago that someone implemented qsort
this very naive way, especially since in the 80's Knuth warned
about it.

A current naive way is use element (m+n)/2 when sorting the range m..n.
This is nearly optimal for a sorted set. For a random set the chance
for a consistent bad choice is astronomically low. I have a QSORT
implemented in the library for my lina Forth and didn't bother to do
better than this. (Used in some of the demanding problems in
projecteuler.net).

More sophisticated qsorts select three elements at random and use
best of three. The worst behaviour is still O(n^2) but the multiplication
factor is lower and the chance is astronomically low to the third power.

It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.

Another strategy is to randomly exchange the pivot element when sorting.

choose?

A good sort utility with facilities for records with fields
(using a modified heap sort, and using disk base merge sort for very
large sets) can be find on my site below (called ssort).
Because of the heap sort even the worst case is O(Nlog(N)) if
the O(N^2) behaviour of qsort is your worry.

Groetjes Albert

__eq__ problem with subclasses	1	Aug 22, 2008
hash values and equality	16	May 20, 2011
Unexpected comparisons in dict lookup	1	Mar 18, 2014
os.path and Path	12	Jun 16, 2011
booleans, hash() and objects having the same value	2	Jan 31, 2008
Push regex search result into hash with multiple values	14	May 19, 2014
Hash#hash and detecting changes?	3	Mar 21, 2010
Hash key types and equality of hash keys	2	Mar 1, 2012

hash and eq

Aaron Brady

Dennis Lee Bieber

Aaron Brady

Piet van Oostrum

Arnaud Delobelle

Terry Reedy

Robert Kern

Piet van Oostrum

Arnaud Delobelle

Steven D'Aprano

Arnaud Delobelle

Steven D'Aprano

Terry Reedy

Aahz

Albert van der Horst

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

hash and __eq__

Aaron Brady

Dennis Lee Bieber

Aaron Brady

Piet van Oostrum

Arnaud Delobelle

Terry Reedy

Robert Kern

Piet van Oostrum

Arnaud Delobelle

Steven D'Aprano

Arnaud Delobelle

Steven D'Aprano

Terry Reedy

Aahz

Albert van der Horst

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

hash and eq