unicode and hashlib

J

Jeff H

hashlib.md5 does not appear to like unicode,
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/[email protected]/msg09824.html

So what is the canonical way to hash unicode?
* convert unicode to local
* hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method? i.e. Should I just use the built-in hash function?
 
M

MRAB

Jeff said:
hashlib.md5 does not appear to like unicode,
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/[email protected]/msg09824.html

So what is the canonical way to hash unicode?
* convert unicode to local
* hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method? i.e. Should I just use the built-in hash function?
>
It can handle bytestrings, but if you give it unicode it performs a
default encoding to ASCII, but that fails if there's a codepoint >=
U+0080. Personally, I'd recommend encoding unicode to UTF-8.
 
T

Terry Reedy

Jeff said:
hashlib.md5 does not appear to like unicode,
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

It is the (default) ascii encoder that does not like non-ascii chars.
I suspect that is you encode to bytes first with an encoder that does
work (latin-???), md5 will be happy.

Reports like this should include Python version.
 
P

Paul Boddie

It is the (default) ascii encoder that does not like non-ascii chars.
I suspect that is you encode to bytes first with an encoder that does
work (latin-???), md5 will be happy.

I know that the "Python roadmap" answer to such questions might refer
to Python 3.0 and its "strings are Unicode" features, and having seen
this mentioned a lot recently, I'm surprised that no-one has done so
at the time of writing, but I do wonder whether good old Python 2.x
wouldn't benefit from a more explicit error message in these
situations.

Since the introduction of Unicode in Python 1.6/2.0, I've always tried
to make the distinction between what I call "plain strings" or "byte
strings" and "Unicode objects" or "character strings", and perhaps the
UnicodeEncodeError message should be enhanced to say what is actually
going on: that an attempt is being made to convert characters into
byte values and that the chosen way of doing so (which often involves
the default, ASCII encoding) cannot manage the job.

Paul
 
J

Jeff H

Unicode is characters, not a character encoding.
You could hash on a utf-8 encoding of the Unicode.


There is no _the_ way to hash Unicode, any more than
there is no _the_ way to hash vectors.  You need to
convert the abstract entity something concrete with
a well-defined representation in bytes, and hash that.


No, it is a definitional problem.  Perhaps you could explain how you
want to use the hash.  If the internal hash is acceptable (e.g. for
grouping in dictionaries within a single run), use that.  If you intend
to store and compare on the same system, say that.  If you want cross-
platform execution of your code to produce the same hashes, say that.
A hash is a means to an end, and it is hard to give advice without
knowing the goal.
I am checking for changes to large text objects stored in a database
against outside sources. So the hash needs to be reproducible/stable.
 
J

Jeff H

It is the (default) ascii encoder that does not like non-ascii chars.
I suspect that is you encode to bytes first with an encoder that does
work (latin-???), md5 will be happy.

Reports like this should include Python version.

Python v2.52 -- however, this is not really a bug report because your
analysis is correct. I am converting cp1252 strings to unicode before
I persist them in a database. I am looking for advice/direction/
wisdom on how to sling these strings<g>

-Jeff
 
J

Jeff H

Python v2.52 -- however, this is not really a bug report because your
analysis is correct. I am converting cp1252 strings to unicode before
I persist them in a database.  I am looking for advice/direction/
wisdom on how to sling these strings<g>

-Jeff

Actually, what I am surprised by, is the fact that hashlib cares at
all about the encoding. A md5 hash can be produced for an .iso file
which means it can handle bytes, why does it care what it is being
fed, as long as there are bytes. I would have assumed that it would
take whatever was feed to it and view it as a byte array and then hash
it. You can read a binary file and hash it
print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?
 
M

Marc 'BlackJack' Rintsch

Actually, what I am surprised by, is the fact that hashlib cares at all
about the encoding. A md5 hash can be produced for an .iso file which
means it can handle bytes, why does it care what it is being fed, as
long as there are bytes.

But you don't have bytes, you have a `unicode` object. The internal byte
representation is implementation specific and not your business.
I would have assumed that it would take
whatever was feed to it and view it as a byte array and then hash it.

How? There is no (sane) way to get at the internal byte representation.
And that byte representation might contain things like pointers to memory
locations that are different for two `unicode` objects which compare
equal, so you would get different hash values for objects that otherwise
look the same from the Python level. Not very useful.
You can read a binary file and hash it
print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?

It's not about *de*coding, it is about *en*coding your `unicode` object
so you get bytes to feed to the MD5 algorithm.

Ciao,
Marc 'BlackJack' Rintsch
 
J

Jeff H

Scott David Daniels wrote:

...


Of course my dyslexia sticks out here as I get encode and decode exactly
backwards -- Marc 'BlackJack' Rintsch has it right.

Characters (a concept) are "encoded" to a byte format (representation).
Bytes (a precise representation) are "decoded" to characters (a format
with semantics).

--Scott David Daniels
(e-mail address removed)

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
128 - shhh'boom. So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)
'b4e5418a36bc4badfc47deb657a2b50c'

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'
Jeff
 
B

Bryan Olson

Jeff said:
[...] So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

It looks like hashlib in Python 3 will not even attempt to digest a
unicode object. Trying to hash 'abcdefg' in in Python 3.0rc3 I get:

TypeError: object supporting the buffer API required

I think that's good behavior, except that the error message is likely to
send beginners to look up the obscure buffer interface before they find
they just need mystring.decode('utf8') or bytes(mystring, 'utf8').
'b4e5418a36bc4badfc47deb657a2b50c'

Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also
includes the stronger SHA-2 family.
 
B

Bryan Olson

Oops, careful here (I made this mistake once in this thread as well).
You _decode_ from unicode to bytes. The code you quoted doesn't run.

Doh! I even tested it with .encode(), then wrote it wrong.

Just in case anyone Googles the error message and lands here: If you are
working with a Python str (string) object and get,

TypeError: object supporting the buffer API required

Then you probably want to encode the string to a bytes object, and
UTF-8 is likely the encoding of choice, as in:

mystring.encode('utf8')

or

bytes(mystring, 'utf8')


Thanks for the correction.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,045
Messages
2,570,389
Members
47,052
Latest member
ketan

Latest Threads

Top