MD5 algorithm

C

Christoph Burschka

I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an
existing MD5 library; this is an exercise). It compiles and returns
something that looks like a hash superficially, but that's clearly the
wrong result. Even if there were a Unicode/Ascii problem, the empty
string should return the correct result (since it is always padded to
one 1-bit and 511 0-bits), but it doesn't.

Also, several strings (say, "" and "The") return the same incorrect hash
- but by dumping the 128-bit hash in each iteration, I've found only
very little in common between the intermediate steps, only the final one
is completely identical.

This is the function, as implemented from the pseudocode on Wikipedia.
http://en.wikipedia.org/wiki/MD5#Pseudocode

--------------------------------------------------------

Constants:

int[4] s = {
0x67452301,
0xEFCDAB89,
0x98BADCFE,
0x10325476
};
int[64] shift= {
7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22,
5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20,
4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23,
6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21
};


int[64] k;

for (int i=0;i<64;i++)
{
k=(int)(Math.abs(Math.sin(i+1))*1073741824*4);
}


private static int[] processFin(int[] message)
{
int[] result=s.clone(); // initialize
// message is already padded, just split:
int blocks=message.length/16;
for (int i=0;i<blocks;i++)
{ // for each 16-int block
int[] state=result.clone(); // copy state
for (int j=0;j<64;j++)
{
int f,g,temp;
if (j<16)
{
f= state[1] & state[2] | ~state[1] & state[3];
g=j;
}
else if (j<32)
{
f=state[3] & state[1] | ~state[3] & state[2];
g=(5*j+1)%16;
}
else if (j<48)
{
f=state[1]^state[2]^state[3];
g=(3*j+5)%16;
}
else
{
f=state[2]^ (state[1] | ~state[3]);
g=(7*j)%16;
}
temp=state[3];
state[3]=state[2];
state[2]=state[1];
state[1]=(state[0]+f+k[j]+message[i*16+g]) << shift[j];
state[0]=temp;
}
for(int j=0;j<4;j++) result+=state;
}
return result;
}

---------------------------------------------
The hex string returned for both "" and "The" is
3c00a101efcdab8998badcfe10325476.

I just can't see what's wrong with the code - all parts of the
pseudocode description seem to be implemented exactly as they should
be... unless the signed ints don't wrap modulo 2^32 as unsigned ones
should. But they do for all examples I could think of, and anyway, that
shouldn't cause those hashes to be exactly the same...

--cb
 
C

Christoph Burschka

sebbaer said:
Her i have found some intressting webpages about it. Maybe it helps.

http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html

If I had found something that helped me in the first 10 Google results,
I wouldn't have had to ask here, now would I? :)

The only thing I could use would be some kind of step-by-step
walkthrough for the MD5 calculation of a short string, so I could check
it against the intermediate steps in my own calculation

--cb
 
R

Roedy Green

The only thing I could use would be some kind of step-by-step
walkthrough for the MD5 calculation of a short string, so I could check
it against the intermediate steps in my own calculation

Can you find Sun's code? Do you have an assembler level bebugger than
will let you single step through the code?
 
R

Roedy Green

The only thing I could use would be some kind of step-by-step
walkthrough for the MD5 calculation of a short string, so I could check
it against the intermediate steps in my own calculation
 
F

Filip Larsen

Christoph said:
private static int[] processFin(int[] message)
{
int[] result=s.clone(); // initialize
// message is already padded, just split:
int blocks=message.length/16;
for (int i=0;i<blocks;i++)
{ // for each 16-int block

It seems you are skipping the last 16 element in messages. Perhaps the
"i < blocks" condition should have been "i <= blocks"?


Regards,
 
R

rossum

I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an
existing MD5 library; this is an exercise). It compiles and returns
something that looks like a hash superficially, but that's clearly the
wrong result. Even if there were a Unicode/Ascii problem, the empty
string should return the correct result (since it is always padded to
one 1-bit and 511 0-bits), but it doesn't.

Also, several strings (say, "" and "The") return the same incorrect hash
- but by dumping the 128-bit hash in each iteration, I've found only
very little in common between the intermediate steps, only the final one
is completely identical.

This is the function, as implemented from the pseudocode on Wikipedia.
http://en.wikipedia.org/wiki/MD5#Pseudocode

--------------------------------------------------------

[snip]

private static int[] processFin(int[] message)
{
int[] result=s.clone(); // initialize
// message is already padded, just split:
Have you implemented the padding correctly? To quote RFC 1321:
"3.1 Step 1. Append Padding Bits

The message is "padded" (extended) so that its length (in bits) is
congruent to 448, modulo 512. That is, the message is extended so
that it is just 64 bits shy of being a multiple of 512 bits long.
Padding is always performed, even if the length of the message is
already congruent to 448, modulo 512.

Padding is performed as follows: a single "1" bit is appended to
the message, and then "0" bits are appended so that the length in
bits of the padded message becomes congruent to 448, modulo 512. In
all, at least one bit and at most 512 bits are appended.

3.2 Step 2. Append Length

A 64-bit representation of b (the length of the message before the
padding bits were added) is appended to the result of the previous
step. In the unlikely event that b is greater than 2^64, then only
the low-order 64 bits of b are used. (These bits are appended as
two 32-bit words and appended low-order word first in accordance
with the previous conventions.)

At this point the resulting message (after padding with bits and
with b) has a length that is an exact multiple of 512 bits.
Equivalently, this message has a length that is an exact multiple
of 16 (32-bit) words. Let M[0 ... N-1] denote the words of the
resulting message, where N is a multiple of 16."

Note the endianness of the 64 bit message length field, which is not
clearly explained in the Wikipedia article - two big-endian 32 bit
words in little-endian order. This is not a standard byte order and
you will need to implement it carefully.

It is also necessary to get all endian issues resolved correctly
within the algorithm itself - they can often be a source of problems
when implementing cryptographic algorithms.

RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a
canonical implementation of MD5 in C, which might be helpful to read.
It also contains a few more Test Vectors than the Wikipedia article.

rossum
 
O

Oliver Wong

Christoph Burschka said:
If I had found something that helped me in the first 10 Google results,
I wouldn't have had to ask here, now would I? :)

You'd be surprised...

- Oliver
 
C

Christoph Burschka

Filip said:
Christoph said:
private static int[] processFin(int[] message)
{
int[] result=s.clone(); // initialize
// message is already padded, just split:
int blocks=message.length/16;
for (int i=0;i<blocks;i++)
{ // for each 16-int block

It seems you are skipping the last 16 element in messages. Perhaps the
"i < blocks" condition should have been "i <= blocks"?


Regards,

Mh... you had me there for a few moments. But no, the message is already padded
to a multiple of 16, so blocks is exactly the right number of blocks (if the
message has 32 elements, blocks will be 2, and the loop will go from block #0 to
block #1).

If i is the number of the block, the messages in the block are message[16*i ..
16*i+15]. Since blocks is exactly the message length divided by 16, the loop I
have now goes from message[0] to message[(message.length / 16 -1) * 16 +15], or
message[message.length-1]. If I changed the boundary condition to i<=blocks, I'd
be running out of the array boundary.

--cb
 
C

Christoph Burschka

rossum said:
Note the endianness of the 64 bit message length field, which is not
clearly explained in the Wikipedia article - two big-endian 32 bit
words in little-endian order. This is not a standard byte order and
you will need to implement it carefully.

It is also necessary to get all endian issues resolved correctly
within the algorithm itself - they can often be a source of problems
when implementing cryptographic algorithms.

RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a
canonical implementation of MD5 in C, which might be helpful to read.
It also contains a few more Test Vectors than the Wikipedia article.

rossum

Whoops. I didn't get the little-endian thing. So far I just padded my int[]
array to a length of 14 mod 16, then added the length value (long) as two int
words in normal big-endian order.

So what I should do is just swap the order of the two ints that I split the long
value into, leaving the ints themselves unchanged.

Mh. But that can't be my only mistake. Consider the zero-length example - ""
will always be padded to 0x8000[...]00, and the unpadded length, 0, is the same
regardless of endianness. Still, this message gives me the wrong result.

Not even mentioning that I'm somewhat skeptic of the possibility that the order
of two 32bit words at the end of the message could so badly break the hashing
function as to give it the same hash for two random short messages... but I'll
fix this and see what happens.

--cb
 
F

Filip Larsen

Christoph said:
Mh... you had me there for a few moments. But no, the message is already padded
to a multiple of 16, so blocks is exactly the right number of blocks (if the
message has 32 elements, blocks will be 2, and the loop will go from block #0 to
block #1).

Ok, hard to discount that without seeing the padding code.

for(int j=0;j<4;j++) result+=state;


Next bug candidate is this line, where it looks like the arrays should
have been indexed by j and not i.


Regards,
 
C

Christoph Burschka

Christoph said:
rossum said:
Note the endianness of the 64 bit message length field, which is not
clearly explained in the Wikipedia article - two big-endian 32 bit
words in little-endian order. This is not a standard byte order and
you will need to implement it carefully.

It is also necessary to get all endian issues resolved correctly
within the algorithm itself - they can often be a source of problems
when implementing cryptographic algorithms.

RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a
canonical implementation of MD5 in C, which might be helpful to read.
It also contains a few more Test Vectors than the Wikipedia article.

rossum

Whoops. I didn't get the little-endian thing. So far I just padded my int[]
array to a length of 14 mod 16, then added the length value (long) as two int
words in normal big-endian order.

So what I should do is just swap the order of the two ints that I split the long
value into, leaving the ints themselves unchanged.

Mh. But that can't be my only mistake. Consider the zero-length example - ""
will always be padded to 0x8000[...]00, and the unpadded length, 0, is the same
regardless of endianness. Still, this message gives me the wrong result.

Not even mentioning that I'm somewhat skeptic of the possibility that the order
of two 32bit words at the end of the message could so badly break the hashing
function as to give it the same hash for two random short messages... but I'll
fix this and see what happens.

--cb

Whoops again. There's a lot more wrong with my padding than I thought. I appear
to be adding the message length some 8 words before the actual end, and I don't
even know how I managed to do that.

Also, the initialization of my sine table is messed up - casting from double to
int will apparently just set too-high values to INT_MAX_VALUE. But if I cast the
sine to int before multiplying it by 2^32, well... it just ends up at 0, of
course. I'll need to do something else here...

Aha! Casting first to long to get a whole number, then to int, appears to give
me something closer to what I want - the long->int conversion wraps instead of
setting to INT_MAX_VALUE. Whether the result is also the correct one remains to
be seen...

--cb
 
C

Christoph Burschka

Filip said:
Christoph said:
Mh... you had me there for a few moments. But no, the message is
already padded
to a multiple of 16, so blocks is exactly the right number of blocks
(if the
message has 32 elements, blocks will be 2, and the loop will go from
block #0 to
block #1).

Ok, hard to discount that without seeing the padding code.

for(int j=0;j<4;j++) result+=state;


Next bug candidate is this line, where it looks like the arrays should
have been indexed by j and not i.


Regards,


Wow... that thing is literally crawling with bugs. Unfortunately, that still
didn't do the trick - now the result is different again, but the hashes are
still wrong and also nearly identical:

[]: 0112a34159cdab8981f7dcfea9eb2476
[ ]: 0112a34159cdab8981f7dcfea9eb2476
[The]: 8112a34159cdab8981f7dcfea9eb2476
[ttr]: 8112a34159cdab8981f7dcfea9eb2476
[ t t]: 7512a34159cdab8981f7dcfea9eb2476
[The quick brown fox jumps over the lazy dog]: f3b76b417dcdab89cf30dcfe97eb2476

I think I'll sleep on it and look at the code again when I'm more awake. Thanks
very much for the help! :)
 
C

Christoph Burschka

Mh... several more problems fixed. For example, I missed another
Endianness reversal: The words of the message itself are processed with
the lowest byte first.

My fix doesn't look too pretty, but I think it's the fastest it gets
without heavily optimizing:

private static int littleEndian(int in)
{
int out=0;
if (in<0)
{
in-=Integer.MIN_VALUE;
out+=128;
}
for (int i=0;i<4;i++)
{
out+=in%256*Math.pow(256,3-i);
in/=256;
}
return out;
}


Also, I missed a crucial part of the algorithm:

b := ((a + f + k + w[g]) leftrotate r) + b

And finally, it appears that the bitwise shift operator isn't doing what
I thought it would - no rotation; bits to the left are removed, 0s are
added to the right. Too bad - I'd hoped I could use the built-in <<
operator; now I need a new function for it. (i=i*2+(i<0?1:0)) is
reasonably elegant for this, however.

Oh, and the final problem: I forgot to reverse the Endianness of the
hash result after calculation.

With all this, I am now finally getting the correct result for the
zero-length message (d41d8cd98f00b204e9800998ecf8427e). I'm still
getting incorrect results for actual messages, but they have nothing in
common and I can be reasonably sure that's because of incorrect padding
or character encoding, not the algorithm.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top