hash and __eq__

A

Aaron Brady

I am writing a mapping object, and I want to ask about the details of
__hash__ and __eq__. IIUC if I understand correctly, the Python
dict's keys' hash codes are looked up first in O( 1 ), then all the
matching hash entries are compared on equality in O( n ). That is,
the hash code just really narrows down the search. Am I correct?

P.S. I always feel like my posts should start like, "A mapping object
am writing I." Not too many verbs you can do that with in English.
What if I did it?
 
D

Dennis Lee Bieber

P.S. I always feel like my posts should start like, "A mapping object
am writing I." Not too many verbs you can do that with in English.
What if I did it?

You turn into a short fuzzy large eared critter in a robe with a
light-saber?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
A

Aaron Brady

        You turn into a short fuzzy large eared critter in a robe with a
light-saber?

A fuzzy large eared critter turn I into.

"Ending sentences in prepositions is nonsense up with which I will not
put." -Churchill

I'm mimicking the dict type with an SQL table (something to which I
alluded some time ago). I'm going to 'select' on the hash code, then
(*THEN*) revive the keys and compare on equality. The values don't
need to be revived.

P.S. def revive!
 
P

Piet van Oostrum

Aaron Brady said:
AB> I am writing a mapping object, and I want to ask about the details of
AB> __hash__ and __eq__. IIUC if I understand correctly, the Python
AB> dict's keys' hash codes are looked up first in O( 1 ), then all the
AB> matching hash entries are compared on equality in O( n ). That is,
AB> the hash code just really narrows down the search. Am I correct?

The O(1) part is correct. The O(n) is incorrect if n as usual is taken
to be the number of keys in the dict. But it is true if n is the number
of keys in the `bucket' that belongs to the hash value. The average
complexity of looking up a key is O(1).

See http://wiki.python.org/moin/DictionaryKeys
 
A

Arnaud Delobelle

Piet van Oostrum said:
The O(1) part is correct. The O(n) is incorrect if n as usual is taken
to be the number of keys in the dict. But it is true if n is the number
of keys in the `bucket' that belongs to the hash value. The average
complexity of looking up a key is O(1).

As all the keys could be in the same bucket, O(n) seems correct to me
(with n the total number of keys).
 
T

Terry Reedy

Aaron said:
I am writing a mapping object, and I want to ask about the details of
__hash__ and __eq__. IIUC if I understand correctly, the Python
dict's keys' hash codes are looked up first in O( 1 ), then all the
matching hash entries are compared on equality in O( n ). That is,
the hash code just really narrows down the search. Am I correct?

Yes. Even if your mapping object did a linear O(n) search with the hash
value, only testing for equality after finding an equal hash value would
be faster. Your own mapping object can do whatever you want it to do
internally, as long as it provides the appropriate interface. You
could, for instance, make a map that allowed unhashable lists,
(unfrozen) sets, and dicts as heys by using len() as a surrogate for hash().

tjr
 
R

Robert Kern

Yes. Even if your mapping object did a linear O(n) search with the hash
value, only testing for equality after finding an equal hash value would
be faster. Your own mapping object can do whatever you want it to do
internally, as long as it provides the appropriate interface. You could,
for instance, make a map that allowed unhashable lists, (unfrozen) sets,
and dicts as heys by using len() as a surrogate for hash().

Only as long as you don't modify the length of those objects serving as keys in
the meantime. But if you're careful about that, knock yourself out.

The important property to maintain is that if two keys compare equal to each
other, then their hashes should equal each other. The reverse does not have to
be true.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
P

Piet van Oostrum

Arnaud Delobelle said:
AB> I am writing a mapping object, and I want to ask about the details of
AB> __hash__ and __eq__. IIUC if I understand correctly, the Python
AB> dict's keys' hash codes are looked up first in O( 1 ), then all the
AB> matching hash entries are compared on equality in O( n ). That is,
AB> the hash code just really narrows down the search. Am I correct?
AD> As all the keys could be in the same bucket, O(n) seems correct to me
AD> (with n the total number of keys).

For worst case only (mainly with a badly designed hash function).
 
A

Arnaud Delobelle

Piet van Oostrum said:
For worst case only (mainly with a badly designed hash function).

.... or a cleverly designed data set!

AFAIK, 'complexity' means 'worst case complexity' unless otherwise
stated.
 
S

Steven D'Aprano

AFAIK, 'complexity' means 'worst case complexity' unless otherwise
stated.

No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

Here's a quote on-line:

"You should be clear about which cases big-oh notation describes. By
default it usually refers to the average case, using random data.
However, the characteristics for best, worst, and average cases can be
very different..."

http://leepoint.net/notes-java/algorithms/big-oh/bigoh.html


It is simply misleading to describe dicts as O(N) without the
qualification "if the data is chosen maliciously, or otherwise by an
incredible fluke". And even if it isn't, Piet explicitly said he was
talking about the average behaviour, not worst.
 
A

Arnaud Delobelle

Steven D'Aprano said:
No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

OK. I remember from my student years when studying complexity theory
that complexity implicitely referred to worst case complexity. I
assumed this was a widely used convention - I don't have any evidence
that it is so you may very well be correct.

Anyway, it's good to know that quicksort is O(n^2) in the worst case -
and that this worst case can crop up very easily in some situations,
especially if not too much care has been taken implementing it.

[...]
It is simply misleading to describe dicts as O(N) without the
qualification "if the data is chosen maliciously, or otherwise by an
incredible fluke". And even if it isn't, Piet explicitly said he was
talking about the average behaviour, not worst.

But the OP only mentioned 'complexity', not 'average complexity', and I
imagine that he knows the average case complexity of accessing a key in
a hashtable.
 
S

Steven D'Aprano

Anyway, it's good to know that quicksort is O(n^2) in the worst case -
and that this worst case can crop up very easily in some situations,
especially if not too much care has been taken implementing it.

The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).
It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.

Another strategy is to randomly exchange the pivot element when sorting.
This avoids having to do a separate pass over the list, while still
ensuring that no particular input can cause degraded performance.

A third strategy is to use a hybrid insertion/quicksort function. This is
probably a good idea regardless, because for small lists, the overhead of
quicksort generally makes insertion sort faster anyway. (CPython's sort
used to be similar: it was a hybrid insertion/samplesort until, by
memory, version 2.2, when it was changed for Timsort.)
 
T

Terry Reedy

Steven said:
The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).
It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.

Another strategy is to randomly exchange the pivot element when sorting.
This avoids having to do a separate pass over the list, while still
ensuring that no particular input can cause degraded performance.

A third strategy is to use a hybrid insertion/quicksort function. This is
probably a good idea regardless, because for small lists, the overhead of
quicksort generally makes insertion sort faster anyway. (CPython's sort
used to be similar: it was a hybrid insertion/samplesort until, by
memory, version 2.2, when it was changed for Timsort.

Which is hybrid binary insertion sort and mergesort, with b.i.s used for
sublists less than 64 items.
 
A

Aahz

No, it means "average or typical case" unless otherwise specified.
Consult almost any comp sci text book and you'll see hash tables with
chaining (like Python dicts) described as O(1) rather than O(N),
Quicksort as O(N log N) instead of O(N**2), and similar. If the default
was "worst case unless otherwise specified", then Quicksort would be
called "Slower than Bubblesort Sort".

(Both are O(N**2), but Quicksort does more work.)

Here's a quote on-line:

"You should be clear about which cases big-oh notation describes. By
default it usually refers to the average case, using random data.
However, the characteristics for best, worst, and average cases can be
very different..."

http://leepoint.net/notes-java/algorithms/big-oh/bigoh.html

When talking about big-oh, I prefer to define "worst-case" as "real world
likely worst-case" -- for example, feeding an already-sorted list to
Quicksort. However, the kind of worst-case that causes a dict to have
O(N) behavior I would call "pathological case". I generally try to
define big-oh in terms of what I call worst-case, because I think it's
important to keep track of where your algorithm is likely to run into
problems, but I don't think it's worth the effort in most cases to worry
about pathological cases.

In the case of dicts, it's important to remember that pathological cases
are far more likely to arise from poorly designed hashable user classes
than from data problems per se; Python's hashing of built-in types has
almost twenty years of tuning.
 
A

Albert van der Horst

The naive quicksort algorithm, where you recursively split the list into
two halves, performs badly if the list is already sorted (or nearly so).

No it isn't (in general).
It only does so if you use the first element as a pivot,
resulting in a set with one element and a set with n-1 elements.
It is an incredible long time ago that someone implemented qsort
this very naive way, especially since in the 80's Knuth warned
about it.

A current naive way is use element (m+n)/2 when sorting the range m..n.
This is nearly optimal for a sorted set. For a random set the chance
for a consistent bad choice is astronomically low. I have a QSORT
implemented in the library for my lina Forth and didn't bother to do
better than this. (Used in some of the demanding problems in
projecteuler.net).

More sophisticated qsorts select three elements at random and use
best of three. The worst behaviour is still O(n^2) but the multiplication
factor is lower and the chance is astronomically low to the third power.
It's easy to fix that: randomise the list before you sort! You can do
that with a single pass of the list. That has the advantage of ensuring
that no specific input can cause degraded performance, so an attacker
can't DOS your application by passing sorted lists to be sorted.
Another strategy is to randomly exchange the pivot element when sorting.
choose?

A good sort utility with facilities for records with fields
(using a modified heap sort, and using disk base merge sort for very
large sets) can be find on my site below (called ssort).
Because of the heap sort even the worst case is O(Nlog(N)) if
the O(N^2) behaviour of qsort is your worry.

Groetjes Albert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top