# Re: question: why isn't a byte of a hash more uniform? how couldI improve my code to cure that?

Discussion in 'Python' started by Tim Chase, Aug 7, 2009.

1. ### Tim ChaseGuest

> After I have written a short Python script that hashes my textfile line by
> line and collects the numbers next to the original, I checked what I got.
> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

That sounds suspiciously like 25% with a +/- 7% fluctuation one
might expect to see from non-random source data.

Remember that your outputs are driven purely by your inputs in a
deterministic fashion -- if your inputs are purely random, then
your inputs aren't random, you get a taste of your own medicine
("my file has just the number 42 on every line...why isn't my
output random?"). And randomness-of-hash-output is a red herring
since hashing is *not* random.

Your input is also finite -- an aspect which leaves you a far cry
from the full hash-space. If an md5 has 32 bytes (256 bits) of
data, your input would have to cover 2**256 possible inputs to
see the full profile of your outputs. That's a lot of input

-tkc

Tim Chase, Aug 7, 2009

2. ### László SándorGuest

Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Thank you, Tim. My comments are below.

On 2009-08-07 13:19:47 -0400, Tim Chase <> said:

>> After I have written a short Python script that hashes my textfile line by
>> line and collects the numbers next to the original, I checked what I got.
>> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

>
> That sounds suspiciously like 25% with a +/- 7% fluctuation one might
> expect to see from non-random source data.

Could you help me where this range comes from? (If all my lines were
"42", I wouldn't hit this range. So it cannot be a rule. Right?)

>
> Remember that your outputs are driven purely by your inputs in a
> deterministic fashion -- if your inputs are purely random, then your
> outputs should more closely match your expected bin'ing. If your
> inputs aren't random, you get a taste of your own medicine ("my file
> has just the number 42 on every line...why isn't my output random?").
> And randomness-of-hash-output is a red herring since hashing is *not*
> random.

Thanks, I tried to be correct with "pseudo-random". I understand that
everything is dependent on input. I want it to be the case. However, I
hoped that good hashes produce random-looking output from input with
very little variation. It would be strange if I could not get more than
18% of lines with a remainder of 3 (after division by 4), whatever hash
I try just because these are names of people.

>
> Your input is also finite -- an aspect which leaves you a far cry from
> the full hash-space. If an md5 has 32 bytes (256 bits) of data, your
> input would have to cover 2**256 possible inputs to see the full
> profile of your outputs. That's a lot of input
>
> -tkc

OK, I understand. Could anyone suggest a better way to do this, then?

(Recap: random-looking, close-to uniform assignment of one number out
of four possibilities to strings.)

Thanks, everyone.

Laszlo

László Sándor, Aug 7, 2009

3. ### Ethan FurmanGuest

Re: question: why isn't a byte of a hash more uniform? how couldI improve my code to cure that?

László Sándor wrote:
> Thank you, Tim. My comments are below.
>
> On 2009-08-07 13:19:47 -0400, Tim Chase <>
> said:
>
>>> After I have written a short Python script that hashes my textfile
>>> line by
>>> line and collects the numbers next to the original, I checked what I
>>> got.
>>> Instead of getting around 25% in each treatment, the range is
>>> 17.8%-31.3%.

>>
>>
>> That sounds suspiciously like 25% with a +/- 7% fluctuation one might
>> expect to see from non-random source data.

>
>
> Could you help me where this range comes from? (If all my lines were
> "42", I wouldn't hit this range. So it cannot be a rule. Right?)
>
>>
>> Remember that your outputs are driven purely by your inputs in a
>> deterministic fashion -- if your inputs are purely random, then your
>> outputs should more closely match your expected bin'ing. If your
>> inputs aren't random, you get a taste of your own medicine ("my file
>> has just the number 42 on every line...why isn't my output random?").
>> And randomness-of-hash-output is a red herring since hashing is *not*
>> random.

>
>
> Thanks, I tried to be correct with "pseudo-random". I understand that
> everything is dependent on input. I want it to be the case. However, I
> hoped that good hashes produce random-looking output from input with
> very little variation. It would be strange if I could not get more than
> 18% of lines with a remainder of 3 (after division by 4), whatever hash
> I try just because these are names of people.
>
>>
>> Your input is also finite -- an aspect which leaves you a far cry from
>> the full hash-space. If an md5 has 32 bytes (256 bits) of data, your
>> input would have to cover 2**256 possible inputs to see the full
>> profile of your outputs. That's a lot of input
>>
>> -tkc

>
>
> OK, I understand. Could anyone suggest a better way to do this, then?
>
> (Recap: random-looking, close-to uniform assignment of one number out of
> four possibilities to strings.)
>
> Thanks, everyone.
>
> Laszlo
>

Well, a very simplistic method is to use the length of the input string
modulus four. If the names have decently "random" lengths, that may work.

~Ethan~

Ethan Furman, Aug 7, 2009
4. ### Paul RubinGuest

Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

László Sándor <> writes:
> OK, I understand. Could anyone suggest a better way to do this, then?
>
> (Recap: random-looking, close-to uniform assignment of one number out
> of four possibilities to strings.)

Use a cryptographic hash function like md5 (deprecated for security
purposes but should be ok for this application). The built-in Python
hash function is made for hashing symbols and is designed to avoid
collisions when you have a bunch of sequential identifiers (a,b,c...).
That is, it deliberately does NOT resemble a random function, which
would have a certain number of collisions by chance.

Paul Rubin, Aug 8, 2009