md5.hexdigest() converting unicode string to ascii

U

uebertester

I'm trying to get a MD5 hash of a registry key value on Windows 2000
Server. The registry value is a string which is stored as a unicode
string. I can get the registry key value and pass it to md5 via
update(), however, hexdigest() returns a hash value for the ascii
equivalent of the unicode string. Not the hash of the unicode string
itself. I've verified this by creating a text file containing an
ascii form of the string and one containing a unicode form of the
string. The hash returned by md5.hexdigest() matches that when I run
md5sum on the ascii file.

Here is the code I'm using:

import _winreg
from md5 import md5

x=_winreg.ConnectRegistry(None,_winreg.HKEY_LOCAL_MACHINE)
y= _winreg.OpenKey(x,
r"SOFTWARE\Microsoft\Windows\CurrentVersion\URL\DefaultPrefix")

sValue = _winreg.QueryValueEx(y,"")
print sValue

m = md5()
m.update(unicode(sValue[0]))
MD5 = m.hexdigest()

print "%s \n%d" % (sValue[0], sValue[1])
print MD5

_winreg.CloseKey(y)
_winreg.CloseKey(x)

Any help would be appreciated.
 
K

Krzysztof Stachlewski

uebertester said:
I'm trying to get a MD5 hash of a registry key value on Windows 2000
Server. The registry value is a string which is stored as a unicode
string. I can get the registry key value and pass it to md5 via
update(), however, hexdigest() returns a hash value for the ascii
equivalent of the unicode string. Not the hash of the unicode string
itself. I've verified this by creating a text file containing an
ascii form of the string and one containing a unicode form of the
string. The hash returned by md5.hexdigest() matches that when I run
md5sum on the ascii file.

The md5() function is defined on character strings
not on unicode strings. An unicode string is a sequence
of integers. Such sequence may be converted to a character
string, but there are many different methods of doing that.
In Python you convert an unicode string to a character
string by using encoding of your choice.
For instance:
u"abcd".encode("utf-16")
utf-16 is just an example - you have to decide which
encoding to choose.

Stach
 
F

Fredrik Lundh

Krzysztof said:
The md5() function is defined on character strings
not on unicode strings. An unicode string is a sequence
of integers. Such sequence may be converted to a character
string

message.replace("character", "byte")

(unicode characters are characters too, you know...)

</F>
 
U

uebertester

Krzysztof Stachlewski said:
You're right. :)

Stach

None of the suggestions seem to address the issue. sValue =
_winreg.QueryValueEx(y,"") returns a tuple containing the following
(u'http://', 1). The string u'http://' is added to the md5 object via
the update() and then hashed via hexdigest(). How do I keep the
unicode string from being converted to ascii with the md5 functions?
Or can I?

Thanks again.
 
P

Peter Hansen

uebertester said:
None of the suggestions seem to address the issue. sValue =
_winreg.QueryValueEx(y,"") returns a tuple containing the following
(u'http://', 1). The string u'http://' is added to the md5 object via
the update() and then hashed via hexdigest(). How do I keep the
unicode string from being converted to ascii with the md5 functions?
Or can I?

You cannot. You missed the key fact, which is that Unicode strings
are sequences of "characters" (roughly, 16-bit values), not sequences
of bytes. MD5 is defined on byte sequences. *You* must specify the
encoding scheme you want to use, by converting the string before
passing it to the hash function.

If you are trying to match the MD5 values calculated by some other
tool, you must find out what encoding scheme that other tool was
using (maybe by trial and error, starting with utf-8 probably).
If this is just for your own purposes, simply pick a convenient
scheme and encode consistently.

md5.update(yourUnicode.encode('utf-8')) for example...

-Peter
 
K

Krzysztof Stachlewski

uebertester said:
None of the suggestions seem to address the issue. sValue =
_winreg.QueryValueEx(y,"") returns a tuple containing the following
(u'http://', 1). The string u'http://' is added to the md5 object via
the update() and then hashed via hexdigest(). How do I keep the
unicode string from being converted to ascii with the md5 functions?

You *have to* convert the unicode string to byte character string
(I'm trying not to call it 'character string' :) )
md5 needs bytes to work, not unicode characters.
As I have already said, you have some ready-to-use conversions
to choose from. utf8? maybe utf16? I don't know which one you want.
If you don't specify the codec yourself then the ascii codec is used
by default.
Unicode characters are implemented in Python either as 2 or 4 byte integers.
You can't rely on memory representation of those integers (although
you can play with ord() function). You need a codec.
 
F

Fredrik Lundh

uebertester said:
None of the suggestions seem to address the issue. sValue =
_winreg.QueryValueEx(y,"") returns a tuple containing the following
(u'http://', 1). The string u'http://' is added to the md5 object via
the update() and then hashed via hexdigest(). How do I keep the
unicode string from being converted to ascii with the md5 functions?

krzysztof already explained this:

- MD5 is calculated on bytes, not characters.
- Unicode strings contain characters, not bytes.
- if you pass in a Unicode string where Python expects a byte string,
Python converts the Unicode string to an 8-bit string using the default
rules (which simply creates 8-bit bytes with the same values as the
corresponding Unicode characters, as long as the Unicode string only
contains characters for which ord(ch) < 128).
- if you're not happy with that rule, you have to convert the Unicode
string to a byte string yourself, using the "encode" method.

m.update(u.encode(encoding))

- if you don't know what encoding you're supposed to use, you have
to guess. if it doesn't matter, as long as you remember what you used,
I'd suggest "utf-8" or perhaps "utf-16-le".
Or can I?

given how things work, the "how do I keep the string from being
converted" doesn't really make sense.

</F>
 
U

uebertester

Fredrik Lundh said:
krzysztof already explained this:

- MD5 is calculated on bytes, not characters.
- Unicode strings contain characters, not bytes.
- if you pass in a Unicode string where Python expects a byte string,
Python converts the Unicode string to an 8-bit string using the default
rules (which simply creates 8-bit bytes with the same values as the
corresponding Unicode characters, as long as the Unicode string only
contains characters for which ord(ch) < 128).
- if you're not happy with that rule, you have to convert the Unicode
string to a byte string yourself, using the "encode" method.

m.update(u.encode(encoding))

- if you don't know what encoding you're supposed to use, you have
to guess. if it doesn't matter, as long as you remember what you used,
I'd suggest "utf-8" or perhaps "utf-16-le".


given how things work, the "how do I keep the string from being
converted" doesn't really make sense.

</F>

Thanks for the clarification. My confusion stemed from the Python
Library Reference which states, "Its use is quite straightforward: use
new() to create an md5 object. You can now feed this object with
arbitrary strings using the update() method, and at any point you can
ask it for the digest...".

I've attempted the suggested solution specifying different encodings,
however, the hash value that is returned does not match what I expect
based upon another utility I'm checking against. Hash value returned
by python specifying utf16 encoding: 731f46dd88cb3a67a4ee1392aa84c6f4
.. Hash value returned by other utility:
0b0ebc769e2b89cf61a10a72d5a11dda . Note: I've tried other encoding
also. As the utility I'm verifying against is extensively used, I'm
assuming it is returning the correct value. I appreciate any help in
resolving this as I'm trying to enhance an automated test suite
written in python.

Thanks
 
P

Peter Hansen

uebertester said:
I've attempted the suggested solution specifying different encodings,
however, the hash value that is returned does not match what I expect
based upon another utility I'm checking against. Hash value returned
by python specifying utf16 encoding: 731f46dd88cb3a67a4ee1392aa84c6f4
. Hash value returned by other utility:
0b0ebc769e2b89cf61a10a72d5a11dda . Note: I've tried other encoding
also. As the utility I'm verifying against is extensively used, I'm
assuming it is returning the correct value.

If this utility is so extensively used, it's almost certain
that someone, somewhere, knows precisely what encoding scheme
was used for Unicode strings. Isn't there documentation on
how it calculates the hash? Source code? An expert? A
vendor?

This is not exactly something that is standardized or obvious,
so it seems very unlikely they just picked some weird scheme
and didn't note anywhere what they did.

It's not, however, a Python question at this point, so you've
probably got no choice but to search elsewhere. (What is the
utility, by the way?)

-Peter
 
H

Heather Coppersmith

On 20 Apr 2004 16:16:33 -0700,
I've attempted the suggested solution specifying different
encodings, however, the hash value that is returned does not
match what I expect based upon another utility I'm checking
against. Hash value returned by python specifying utf16
encoding: 731f46dd88cb3a67a4ee1392aa84c6f4 . Hash value
returned by other utility: 0b0ebc769e2b89cf61a10a72d5a11dda .
Note: I've tried other encoding also. As the utility I'm
verifying against is extensively used, I'm assuming it is
returning the correct value. I appreciate any help in resolving
this as I'm trying to enhance an automated test suite written in
python.

Other things that may bite or may have bitten you:

o the byte order marker or lack thereof
o different newline conventions
o trailing newlines or lack thereof
o don't forget that there are two utf16 encodings, big endian
and little endian

Adding to what Peter indicated, the source (code and/or persons)
of your extensively used utility may also contain specific test
vectors.

Regards,
Heather
 
U

uebertester

Heather Coppersmith said:
On 20 Apr 2004 16:16:33 -0700,


Other things that may bite or may have bitten you:

o the byte order marker or lack thereof
o different newline conventions
o trailing newlines or lack thereof
o don't forget that there are two utf16 encodings, big endian
and little endian

Adding to what Peter indicated, the source (code and/or persons)
of your extensively used utility may also contain specific test
vectors.

Regards,
Heather

Your first bullet is the problem. _winreg.QueryValueEx(y,"") returns
a unicode string with the BOM "fffe". The utility I'm comparing
against removes this prior to hashing the registry value. Thanks for
all the input I received from everyone.

Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top