Re: question: why isn't a byte of a hash more uniform? how couldI improve my code to cure that?

Discussion in 'Python' started by Tim Chase, Aug 7, 2009.

  1. Tim Chase

    Tim Chase Guest

    > After I have written a short Python script that hashes my textfile line by
    > line and collects the numbers next to the original, I checked what I got.
    > Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.


    That sounds suspiciously like 25% with a +/- 7% fluctuation one
    might expect to see from non-random source data.

    Remember that your outputs are driven purely by your inputs in a
    deterministic fashion -- if your inputs are purely random, then
    your outputs should more closely match your expected bin'ing. If
    your inputs aren't random, you get a taste of your own medicine
    ("my file has just the number 42 on every line...why isn't my
    output random?"). And randomness-of-hash-output is a red herring
    since hashing is *not* random.

    Your input is also finite -- an aspect which leaves you a far cry
    from the full hash-space. If an md5 has 32 bytes (256 bits) of
    data, your input would have to cover 2**256 possible inputs to
    see the full profile of your outputs. That's a lot of input :)

    -tkc
    Tim Chase, Aug 7, 2009
    #1
    1. Advertising

  2. Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

    Thank you, Tim. My comments are below.

    On 2009-08-07 13:19:47 -0400, Tim Chase <> said:

    >> After I have written a short Python script that hashes my textfile line by
    >> line and collects the numbers next to the original, I checked what I got.
    >> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

    >
    > That sounds suspiciously like 25% with a +/- 7% fluctuation one might
    > expect to see from non-random source data.


    Could you help me where this range comes from? (If all my lines were
    "42", I wouldn't hit this range. So it cannot be a rule. Right?)

    >
    > Remember that your outputs are driven purely by your inputs in a
    > deterministic fashion -- if your inputs are purely random, then your
    > outputs should more closely match your expected bin'ing. If your
    > inputs aren't random, you get a taste of your own medicine ("my file
    > has just the number 42 on every line...why isn't my output random?").
    > And randomness-of-hash-output is a red herring since hashing is *not*
    > random.


    Thanks, I tried to be correct with "pseudo-random". I understand that
    everything is dependent on input. I want it to be the case. However, I
    hoped that good hashes produce random-looking output from input with
    very little variation. It would be strange if I could not get more than
    18% of lines with a remainder of 3 (after division by 4), whatever hash
    I try just because these are names of people.

    >
    > Your input is also finite -- an aspect which leaves you a far cry from
    > the full hash-space. If an md5 has 32 bytes (256 bits) of data, your
    > input would have to cover 2**256 possible inputs to see the full
    > profile of your outputs. That's a lot of input :)
    >
    > -tkc


    OK, I understand. Could anyone suggest a better way to do this, then?

    (Recap: random-looking, close-to uniform assignment of one number out
    of four possibilities to strings.)

    Thanks, everyone.

    Laszlo
    László Sándor, Aug 7, 2009
    #2
    1. Advertising

  3. Tim Chase

    Ethan Furman Guest

    Re: question: why isn't a byte of a hash more uniform? how couldI improve my code to cure that?

    László Sándor wrote:
    > Thank you, Tim. My comments are below.
    >
    > On 2009-08-07 13:19:47 -0400, Tim Chase <>
    > said:
    >
    >>> After I have written a short Python script that hashes my textfile
    >>> line by
    >>> line and collects the numbers next to the original, I checked what I
    >>> got.
    >>> Instead of getting around 25% in each treatment, the range is
    >>> 17.8%-31.3%.

    >>
    >>
    >> That sounds suspiciously like 25% with a +/- 7% fluctuation one might
    >> expect to see from non-random source data.

    >
    >
    > Could you help me where this range comes from? (If all my lines were
    > "42", I wouldn't hit this range. So it cannot be a rule. Right?)
    >
    >>
    >> Remember that your outputs are driven purely by your inputs in a
    >> deterministic fashion -- if your inputs are purely random, then your
    >> outputs should more closely match your expected bin'ing. If your
    >> inputs aren't random, you get a taste of your own medicine ("my file
    >> has just the number 42 on every line...why isn't my output random?").
    >> And randomness-of-hash-output is a red herring since hashing is *not*
    >> random.

    >
    >
    > Thanks, I tried to be correct with "pseudo-random". I understand that
    > everything is dependent on input. I want it to be the case. However, I
    > hoped that good hashes produce random-looking output from input with
    > very little variation. It would be strange if I could not get more than
    > 18% of lines with a remainder of 3 (after division by 4), whatever hash
    > I try just because these are names of people.
    >
    >>
    >> Your input is also finite -- an aspect which leaves you a far cry from
    >> the full hash-space. If an md5 has 32 bytes (256 bits) of data, your
    >> input would have to cover 2**256 possible inputs to see the full
    >> profile of your outputs. That's a lot of input :)
    >>
    >> -tkc

    >
    >
    > OK, I understand. Could anyone suggest a better way to do this, then?
    >
    > (Recap: random-looking, close-to uniform assignment of one number out of
    > four possibilities to strings.)
    >
    > Thanks, everyone.
    >
    > Laszlo
    >


    Well, a very simplistic method is to use the length of the input string
    modulus four. If the names have decently "random" lengths, that may work.

    ~Ethan~
    Ethan Furman, Aug 7, 2009
    #3
  4. Tim Chase

    Paul Rubin Guest

    Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

    László Sándor <> writes:
    > OK, I understand. Could anyone suggest a better way to do this, then?
    >
    > (Recap: random-looking, close-to uniform assignment of one number out
    > of four possibilities to strings.)


    Use a cryptographic hash function like md5 (deprecated for security
    purposes but should be ok for this application). The built-in Python
    hash function is made for hashing symbols and is designed to avoid
    collisions when you have a bunch of sequential identifiers (a,b,c...).
    That is, it deliberately does NOT resemble a random function, which
    would have a certain number of collisions by chance.
    Paul Rubin, Aug 8, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Casper B
    Replies:
    3
    Views:
    436
    eranb
    Jan 13, 2005
  2. tobiah
    Replies:
    3
    Views:
    256
    tobiah
    Sep 14, 2006
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,763
    Smokey Grindel
    Dec 2, 2006
  4. Chris F.A. Johnson

    how to cure FF problem

    Chris F.A. Johnson, Apr 22, 2008, in forum: HTML
    Replies:
    10
    Views:
    649
    Chris F.A. Johnson
    Apr 23, 2008
  5. rp
    Replies:
    1
    Views:
    499
    red floyd
    Nov 10, 2011
Loading...

Share This Page