Hash stability

Steven D'Aprano · Jan 14, 2012

On the Python Dev mailing list, there is a discussion going on about the
stability of the hash function for strings.

How many people rely on hash(some_string) being stable across Python
versions? Does anyone have code that will be broken if the string hashing
algorithm changes?

Peter Otten · Jan 14, 2012

Steven said:
On the Python Dev mailing list, there is a discussion going on about the
stability of the hash function for strings.

How many people rely on hash(some_string) being stable across Python
versions? Does anyone have code that will be broken if the string hashing
algorithm changes?

Nobody who understands the question

Heiko Wundram · Jan 14, 2012

Am 14.01.2012 10:46, schrieb Peter Otten:

Nobody who understands the question

Erm, not exactly true. There are actually some packages out there (take
suds [https://fedorahosted.org/suds/], for example) that rely on the
hashing algorithm to be stable to function "properly" (suds uses hash()
of strings to create caches of objects/XML Schemas on the filesystem).
This, in a different context, bit me at the end of last week, when
required to use suds to access EWS.

I'd personally start debating the sensibility of this decision on the
part of the suds developers, but... That's not the question. ;-)

Chris Angelico · Jan 15, 2012

On the Python Dev mailing list, there is a discussion going on about the
stability of the hash function for strings.

How many people rely on hash(some_string) being stable across Python
versions? Does anyone have code that will be broken if the string hashing
algorithm changes?

On reading your post I immediately thought that you could, if changing
algorithm, simultaneously fix the issue of malicious collisions, but
that appears to be what you're doing it for primarily

Suggestion: Create a subclass of dict, the SecureDict or something,
which could either perturb the hashes or even use a proper
cryptographic hash function; normal dictionaries can continue to use
the current algorithm. The description in Objects/dictnotes.txt
suggests that it's still well worth keeping the current system for
programmer-controlled dictionaries, and only change user-controlled
ones (such as POST data etc).

It would then be up to the individual framework and module authors to
make use of this, but it would not impose any cost on the myriad other
uses of dictionaries - there's no point adding extra load to every
name lookup just because of a security issue in an extremely narrow
situation. It would also mean that code relying on hash(str) stability
wouldn't be broken.

ChrisA

Roy Smith · Jan 15, 2012

Steven D'Aprano said:
On the Python Dev mailing list, there is a discussion going on about the
stability of the hash function for strings.

How many people rely on hash(some_string) being stable across Python
versions? Does anyone have code that will be broken if the string hashing
algorithm changes?

I would never rely on something like that unless the docs unambiguously
stated it were so. Which they don't. All I can find about hash() is:

"Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same hash
value (even if they are of different types, as is the case for 1 and
1.0)."

Terry Reedy · Jan 15, 2012

I would never rely on something like that unless the docs unambiguously
stated it were so. Which they don't. All I can find about hash() is:

"Return the hash value of the object (if it has one).

Based on the pydev discussion since, it appears that enough people have
inferred stability either from that or empirical stability that it will
not be broken, by default, in pre-3.3 releases. What ever option is
chosen to guard against attacks will probably be the default in 3.3.

Stefan Behnel · Jan 15, 2012

Heiko Wundram, 14.01.2012 23:45:

Am 14.01.2012 10:46, schrieb Peter Otten:

Nobody who understands the question

Click to expand...

Erm, not exactly true. There are actually some packages out there (take
suds [https://fedorahosted.org/suds/], for example) that rely on the
hashing algorithm to be stable to function "properly" (suds uses hash() of
strings to create caches of objects/XML Schemas on the filesystem).

That's a stupid design. Using a hash function that the application does not
control to index into persistent storage just screams for getting the code
broken at some point.

Stefan

Heiko Wundram · Jan 15, 2012

Am 15.01.2012 11:13, schrieb Stefan Behnel:

That's a stupid design. Using a hash function that the application does not
control to index into persistent storage just screams for getting the code
broken at some point.

I agree completely with that (I hit the corresponding problem with suds
while transitioning from 32-bit Python to 64-bit Python, where hashes
aren't stable either), but as stated in my mail: that wasn't the
original question. ;-)

Bryan · Jan 15, 2012

Chris said:
Suggestion: Create a subclass of dict, the SecureDict or something,
which could either perturb the hashes or even use a proper
cryptographic hash function; normal dictionaries can continue to use
the current algorithm. The description in Objects/dictnotes.txt
suggests that it's still well worth keeping the current system for
programmer-controlled dictionaries, and only change user-controlled
ones (such as POST data etc).

I have to disagree; that's not how the world works, at least not
anymore. Competent, skilled, dedicated programmers have over and over
again failed to appreciate the importance and the difficulty of
maintaining proper function in an adversarial environment. The tactic
of ignoring security issues unless and until they are proven
problematic stands utterly discredited.

It would then be up to the individual framework and module authors to
make use of this, but it would not impose any cost on the myriad other
uses of dictionaries - there's no point adding extra load to every
name lookup just because of a security issue in an extremely narrow
situation. It would also mean that code relying on hash(str) stability
wouldn't be broken.

That seemingly "extremely narrow situation" turns out to be wide as
Montana. Maybe Siberia. Does your program take input? Does it accept a
format that could possibly be downloaded from a malicious site on the
Internet? Does your market include users who occasionally make
mistakes? If not, enjoy your utter irrelevance. If so,
congratulations: you write Internet software.

Varying the hash function is just the first step. Plausible attacks
dynamically infer how to induce degenerate behavior. Replacing the
dictionary hash function with a "proper cryptographic hash function"
is a naive non-solution; all things considered it's somewhat worse
than useless. An old and interesting and relevant exercise is to
implement a dictionary with O(1) insert, look-up, and delete in the
average non-adversarial case; and O(lg n) insert, look-up, and delete
in the worse case.

Chris Angelico · Jan 15, 2012

That seemingly "extremely narrow situation" turns out to be wide as
Montana. Maybe Siberia. Does your program take input? Does it accept a
format that could possibly be downloaded from a malicious site on the
Internet? Does your market include users who occasionally make
mistakes? If not, enjoy your utter irrelevance. If so,
congratulations: you write Internet software.

Yes, but in that "Internet software", there will only be a small
number of dictionaries that an attacker can stuff with keys (GET/POST
data, headers, cookies, etc, and anything derived therefrom); compare
the huge number of dictionaries that exist elsewhere in your Python
program. Adding load to dictionaries will add load to a huge number of
lookups that can never come under attack.

However, since posting that I've read the entire thread on the
python-dev archive. (It is, I might mention, a LOT of text.) A number
of suggestions and arguments are put forth, including a subclassing
notion similar to my postulation, and the same point is raised: that
app/framework developers won't secure their apps. Other options are
also offered (personally, I'm liking the one where an exception is
raised if something collides with too many keys - current suggestion
1000, although it could possibly work well with something that scales
with the dictionary size), and I'm sure that something will be done
that's a lot smarter than one quick idea spun off in response to a
separate query. So, I retract this idea

ChrisA

Peter Otten · Jan 15, 2012

Heiko said:
Am 15.01.2012 11:13, schrieb Stefan Behnel:

I agree completely with that (I hit the corresponding problem with suds
while transitioning from 32-bit Python to 64-bit Python, where hashes
aren't stable either), but as stated in my mail: that wasn't the
original question. ;-)

I'm curious: did you actually get false cache hits or just slower responses?

Heiko Wundram · Jan 15, 2012

Am 15.01.2012 13:22, schrieb Peter Otten:

I'm curious: did you actually get false cache hits or just slower responses?

It broke the application using suds, not due to false cache hits, but
due to not getting a cache hit anymore at all.

Long story: to interpret WSDL-files, suds has to get all related DTDs
for the WSDL file, and Microsoft (as I wrote I was querying Exchange Web
Services) insists on using http://www.w3.org/2001/xml.dtd for the XML
spec path. This path is sometimes functional as a GET URL, but mostly
not (due to overload of the W3-servers), so basically I worked around
the problem by creating an appropriate cache entry with the appropriate
name based on hash() using a local copy of xml.dtd I had around. This
took place on a development machine (32-bit), and when migrating the
application to a production machine (64-bit), the cache file wasn't used
anymore (due to the hash not being stable).

It's not that this came as a surprise (I quickly knew the "workaround"
by simply rehashing on the target machine and moving the cache file
appropriately), and I already said that this is mostly just a plain bad
design decision on the part of the suds developers, but it's one of
those cases where a non-stable hash() can break applications, and except
if you know the internal workings of suds, this will seriously bite the
developer.

I don't know the prevalence of suds, but I guess there's more people
than me using it to query SOAP-services - all of those will be affected
if the hash() output is changed. Additionally, if hash() isn't stable
between runs (the randomized hash() solution which is preferred, and
would also be my preference), suds caching becomes completely useless.
And for the results, see above.

Chris Angelico · Jan 15, 2012

I don't know the prevalence of suds, but I guess there's more people than me
using it to query SOAP-services - all of those will be affected if the
hash() output is changed. Additionally, if hash() isn't stable between runs
(the randomized hash() solution which is preferred, and would also be my
preference), suds caching becomes completely useless. And for the results,
see above.

Or you could just monkey-patch it so that 'hash' points to an old
hashing function. If the current hash() is kept in builtins as (say)
hash_320() or hash_272() or something, then anyone who wants the old
version of the hash can still get it.

Of course, it's still dodgy to depend on the stability of something
that isn't proclaimed stable, and would be far better to use some
other hashing algorithm (MD5 or SHA for uberreliability).

ChrisA

Heiko Wundram · Jan 15, 2012

Am 15.01.2012 17:13, schrieb Chris Angelico:

Or you could just monkey-patch it so that 'hash' points to an old
hashing function. If the current hash() is kept in builtins as (say)
hash_320() or hash_272() or something, then anyone who wants the old
version of the hash can still get it.

Or even easier: overwrite the default caching module (called FileCache)
with something that implements "sensible" caching, for example by using
the complete URL (with special characters replaced) of the DTD as a
cache index, instead of hash()ing it. ;-)

There's "workarounds", I know - and I may be implementing one of them if
the time comes. Again, my mail was only to point at the fact that there
are (serious) projects out there relying on the "stableness" of hash(),
and that these will get bitten when hash() is replaced. Which is not a
bad thing if you ask me. ;-)

Stefan Behnel · Jan 15, 2012

Chris Angelico, 15.01.2012 17:13:

Of course, it's still dodgy to depend on the stability of something
that isn't proclaimed stable, and would be far better to use some
other hashing algorithm (MD5 or SHA for uberreliability).

I've seen things like MD5 or SHA* being used quite commonly for file caches
(or file storage in general, e.g. for related files referenced in a text
document). Given that these algorithms are right there in the stdlib, I
find them a rather obvious choice.

However, note that they may also be subject to complexity attacks at some
point, although likely requiring substantially more input data. In the
specific case of a cache, an attacker may only need an arbitrary set of
colliding hashes. Those can be calculated in advance for a given hash
function. For example, Wikipedia currently presents MD5 with a collision
complexity of ~2^20, that sounds a bit weak. Something like SHA256 should
be substantially more robust.

https://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms

Stefan

Peter Otten · Jan 16, 2012

Heiko said:
Am 15.01.2012 13:22, schrieb Peter Otten:

in a suds cache

It broke the application using suds, not due to false cache hits, but
due to not getting a cache hit anymore at all.

so basically I worked around
the problem by creating an appropriate cache entry with the appropriate
name based on hash() using a local copy of xml.dtd I had around. This
took place on a development machine (32-bit), and when migrating the
application to a production machine (64-bit), the cache file wasn't used
anymore (due to the hash not being stable).

Thanks for the explanation.

if the hash() output is changed. Additionally, if hash() isn't stable
between runs (the randomized hash() solution which is preferred, and
would also be my preference), suds caching becomes completely useless.
And for the results, see above.

I've taken a quick look into the suds source; the good news is that you have
to change a single method, reader.Reader.mangle(), to fix the problem with
hash stability.

However, I didn't see any code to deal with hash collisions at all.

Heiko Wundram · Jan 16, 2012

Am 16.01.2012 09:44, schrieb Christian Heimes:

Am 16.01.2012 09:18, schrieb Peter Otten:

It smells like suds is vulnerable to cache poisoning.

That it is, yes, at least partially. Generally, this is only relevant in
case you are actually caching DTDs (which is the default) and in case
you are querying untrusted SOAP-servers (in which case you'll most
likely/should not use caching anyway), and in case the attacker has
control over the URL namespace of a DTD-serving host (because the
host-part of the DTD URL is used in the cache filename, unhashed, only
the actual path is hashed to form the cache index).

The easier way to poison the cache is most probably through actual
traffic modification, as most DTD URLs are served through plain http and
thus are suspect to MitM-modifications, anyway.

Hash of None varies per-machine	9	Apr 3, 2009
Changing Hash Values Across Versions	0	Jun 11, 2009
ANN: eGenix mx Base Distribution 3.2.6 (mxDateTime, mxTextTools, etc.)	0	Apr 17, 2013
multiple 2.7.3 versions?	4	Feb 15, 2013
hash() algorithm	7	Jul 21, 2004
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
[RELEASED] Python 3.3.4 release candidate 1	0	Jan 27, 2014
How to escape # hash character in regex match strings	8	Jun 10, 2009

Hash stability

Steven D'Aprano

Peter Otten

Heiko Wundram

Chris Angelico

Roy Smith

Terry Reedy

Stefan Behnel

Heiko Wundram

Bryan

Chris Angelico

Peter Otten

Heiko Wundram

Chris Angelico

Heiko Wundram

Stefan Behnel

Peter Otten

Heiko Wundram

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads