hash() yields different results for different platforms

Discussion in 'Python' started by Qiangning Hong, Jul 12, 2006.

  1. I'm writing a spider. I have millions of urls in a table (mysql) to
    check if a url has already been fetched. To check fast, I am
    considering to add a "hash" column in the table, make it a unique key,
    and use the following sql statement:
    insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    to add new url.

    I believe this will be faster than making the "url" column unique key
    and doing string comparation. Right?

    However, when I come to Python's builtin hash() function, I found it
    produces different values in my two computers! In a pentium4,
    hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
    hash function depend on machine's word length?

    If it does, I must consider another hash algorithm because the spider
    will run concurrently in several computers, some are 32-bit, some are
    64-bit. Is md5 a good choice? Will it be too slow that I have no
    performance gain than using the "url" column directly as the unique
    key?

    I will do some benchmarking to find it out. But while making my hands
    dirty, I would like to hear some advice from experts here. :)
     
    Qiangning Hong, Jul 12, 2006
    #1
    1. Advertising

  2. Qiangning Hong

    Paul Rubin Guest

    "Qiangning Hong" <> writes:
    > However, when I come to Python's builtin hash() function, I found it
    > produces different values in my two computers! In a pentium4,
    > hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
    > hash function depend on machine's word length?


    The hash function is unspecified and can depend on anything the
    implementers feel like. It may(?) even be permitted to differ from
    one run of the interpreter to another (I haven't checked the spec for
    this). Don't count on it being consistent from machine to machine.

    > If it does, I must consider another hash algorithm because the spider
    > will run concurrently in several computers, some are 32-bit, some are
    > 64-bit. Is md5 a good choice? Will it be too slow that I have no
    > performance gain than using the "url" column directly as the unique key?


    If you're going to accept the overhead of an SQL database you might as
    well enjoy the use of the abstraction it gives you, instead of trying
    to implement what amounts to your own form of indexing instead of
    letting the db take care of it. But md5(url) is certainly very fast
    compared with processing the outgoing http connection that you
    presumably plan to open for each url.

    > I will do some benchmarking to find it out.


    That's the right way to answer questions like this.
     
    Paul Rubin, Jul 12, 2006
    #2
    1. Advertising

  3. On 2006-07-11, Qiangning Hong <> wrote:

    > I'm writing a spider. I have millions of urls in a table (mysql) to
    > check if a url has already been fetched. To check fast, I am
    > considering to add a "hash" column in the table, make it a unique key,
    > and use the following sql statement:
    > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    > to add new url.
    >
    > I believe this will be faster than making the "url" column unique key
    > and doing string comparation. Right?


    I doubt it will be significantly faster. Comparing two strings
    and hashing a string are both O(N).

    > However, when I come to Python's builtin hash() function, I
    > found it produces different values in my two computers! In a
    > pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
    > 12416037344. Does hash function depend on machine's word
    > length?


    Apparently. :)

    The low 32 bits match, so perhaps you should just use that
    portion of the returned hash?

    >>> hex(12416037344)

    '0x2E40DB1E0L'
    >>> hex(-468864544 & 0xffffffffffffffff)

    '0xFFFFFFFFE40DB1E0L'

    >>> hex(12416037344 & 0xffffffff)

    '0xE40DB1E0L'
    >>> hex(-468864544 & 0xffffffff)

    '0xE40DB1E0L'

    --
    Grant Edwards grante Yow! Uh-oh!! I forgot
    at to submit to COMPULSORY
    visi.com URINALYSIS!
     
    Grant Edwards, Jul 12, 2006
    #3
  4. Qiangning Hong

    Carl Banks Guest

    Grant Edwards wrote:
    > On 2006-07-11, Qiangning Hong <> wrote:
    >
    > > I'm writing a spider. I have millions of urls in a table (mysql) to
    > > check if a url has already been fetched. To check fast, I am
    > > considering to add a "hash" column in the table, make it a unique key,
    > > and use the following sql statement:
    > > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    > > to add new url.
    > >
    > > I believe this will be faster than making the "url" column unique key
    > > and doing string comparation. Right?

    >
    > I doubt it will be significantly faster. Comparing two strings
    > and hashing a string are both O(N).


    Playing Devil's Advocate: The hash would be a one-time operation during
    database insertion, whereas string comparison would happen every
    search. Conceivably, comparing hash strings (which is O(1)) could
    result in a big savings compared to comparing regular strings; but I
    expect most decent sql implementations already hash data internally, so
    rolling your own hash would be useless at best.

    If the OP's database is lacking, md5 is probably fine. Perhaps using a
    subset of the md5 (the low 32 bits, say) could speed up comparisons at
    risk of more collisions. Probably a good trade off unless the DB is
    humungous.


    Carl Banks
     
    Carl Banks, Jul 12, 2006
    #4
  5. On 11 Jul 2006 16:33:50 -0700, "Qiangning Hong" <>
    declaimed the following in comp.lang.python:

    >
    > I believe this will be faster than making the "url" column unique key
    > and doing string comparation. Right?
    >

    Given that the queries pass the data as strings, you're trading an
    integer->string, string->integer conversion pair on each search,
    followed by an index file look-up, for a straight index file look-up.

    Granted, my college days were a quarter century ago, but at the
    time, the data structures class required implementing a hash table
    look-up program (hashed-head, multiple-linked list). Hashes weren't
    guaranteed to be unique -- the hash only gave an entry point from which
    one then performed string comparisons. *

    You could maybe use MD5 encoding if you don't want the pure URL in
    the look-up, and MySQL, at least, has a built-in for MD5 so you wouldn't
    even have to worry about "differences".


    * The Amiga file-system was a hashed-head, multiple-linked list -- the
    only place I'd ever encountered one outside of that 3rd year college
    course. File-names were hashed into an index into a directory block; the
    directory block had pointers to file-header/subdirectory blocks... those
    blocks then had a linked list to other fh/sd blocks that shared the hash
    value.
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Jul 12, 2006
    #5
  6. On 11 Jul 2006 21:37:46 -0700, "Carl Banks"
    <> declaimed the following in
    comp.lang.python:


    >
    > If the OP's database is lacking, md5 is probably fine. Perhaps using a
    > subset of the md5 (the low 32 bits, say) could speed up comparisons at
    > risk of more collisions. Probably a good trade off unless the DB is
    > humungous.
    >

    And if there are collisions, a unique index on the hash is not
    usable...
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Jul 12, 2006
    #6
  7. Grant Edwards wrote:
    > On 2006-07-11, Qiangning Hong <> wrote:
    > > However, when I come to Python's builtin hash() function, I
    > > found it produces different values in my two computers! In a
    > > pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
    > > 12416037344. Does hash function depend on machine's word
    > > length?

    >
    > Apparently. :)
    >
    > The low 32 bits match, so perhaps you should just use that
    > portion of the returned hash?
    >
    > >>> hex(12416037344)

    > '0x2E40DB1E0L'
    > >>> hex(-468864544 & 0xffffffffffffffff)

    > '0xFFFFFFFFE40DB1E0L'
    >
    > >>> hex(12416037344 & 0xffffffff)

    > '0xE40DB1E0L'
    > >>> hex(-468864544 & 0xffffffff)

    > '0xE40DB1E0L'


    Is this relationship (same low 32 bits) guaranteed? Will it change in
    the future version?
     
    Qiangning Hong, Jul 12, 2006
    #7
  8. Qiangning Hong

    Tim Peters Guest

    [Grant Edwards]
    >> ...
    >> The low 32 bits match, so perhaps you should just use that
    >> portion of the returned hash?
    >>
    >> >>> hex(12416037344)

    >> '0x2E40DB1E0L'
    >> >>> hex(-468864544 & 0xffffffffffffffff)

    >> '0xFFFFFFFFE40DB1E0L'
    >>
    >> >>> hex(12416037344 & 0xffffffff)

    >> '0xE40DB1E0L'
    >> >>> hex(-468864544 & 0xffffffff)

    >> '0xE40DB1E0L'


    [Qiangning Hong]
    > Is this relationship (same low 32 bits) guaranteed?


    No. Nothing about hashes is guaranteed, except that when x and y are
    of a hashable type, and x == y, then hash(x) == hash(y) too.

    > Will it change in the future version?


    That's possible, but not planned. Note that the guts of string
    hashing in CPython today is implemented via

    while (--len >= 0)
    x = (1000003*x) ^ *p++;

    where x is C type "long", and the C language doesn't even define what
    that does (behavior when signed multiplication overflows isn't defined
    in C).
     
    Tim Peters, Jul 12, 2006
    #8
  9. Qiangning Hong wrote:

    > /.../ add a "hash" column in the table, make it a unique key


    at this point, you should have slapped yourself on the forehead, and gone
    back to the drawing board.

    </F>
     
    Fredrik Lundh, Jul 12, 2006
    #9
  10. Using Python's hash as column in the table might not be a good idea.
    You just found out why. So you could instead just use the base url and
    create an index based on that so next time just quickly get all urls
    from same base address then do a linear search for a specific one, or
    even easier, implement your own hashes without using any of the
    Python's built-in hash() functions. For example, transform each
    character to an int and multply them all mod 2^32-1 or something like
    that. Even better I think someone already posted the Python's way of
    generating hashes for string, well, just re-implement it in Python such
    that your version will yield the same hash on any platform.

    Hope this helps,
    Nick V.

    Qiangning Hong wrote:
    > I'm writing a spider. I have millions of urls in a table (mysql) to
    > check if a url has already been fetched. To check fast, I am
    > considering to add a "hash" column in the table, make it a unique key,
    > and use the following sql statement:
    > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    > to add new url.
    >
    > I believe this will be faster than making the "url" column unique key
    > and doing string comparation. Right?
    >
    > However, when I come to Python's builtin hash() function, I found it
    > produces different values in my two computers! In a pentium4,
    > hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
    > hash function depend on machine's word length?
    >
    > If it does, I must consider another hash algorithm because the spider
    > will run concurrently in several computers, some are 32-bit, some are
    > 64-bit. Is md5 a good choice? Will it be too slow that I have no
    > performance gain than using the "url" column directly as the unique
    > key?
    >
    > I will do some benchmarking to find it out. But while making my hands
    > dirty, I would like to hear some advice from experts here. :)
     
    Nick Vatamaniuc, Jul 12, 2006
    #10
  11. >>>>> Grant Edwards <> (GE) wrote:

    >GE> The low 32 bits match, so perhaps you should just use that
    >GE> portion of the returned hash?


    If the hashed should be unique, 32 bits is much too low if you have
    millions of entries.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
    Private email:
     
    Piet van Oostrum, Jul 12, 2006
    #11
  12. On 2006-07-12, Carl Banks <> wrote:
    > Grant Edwards wrote:
    >> On 2006-07-11, Qiangning Hong <> wrote:
    >>
    >> > I'm writing a spider. I have millions of urls in a table (mysql) to
    >> > check if a url has already been fetched. To check fast, I am
    >> > considering to add a "hash" column in the table, make it a unique key,
    >> > and use the following sql statement:
    >> > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    >> > to add new url.
    >> >
    >> > I believe this will be faster than making the "url" column unique key
    >> > and doing string comparation. Right?

    >>
    >> I doubt it will be significantly faster. Comparing two strings
    >> and hashing a string are both O(N).

    >
    > Playing Devil's Advocate: The hash would be a one-time operation during
    > database insertion, whereas string comparison would happen every
    > search.


    Good point.

    > Conceivably, comparing hash strings (which is O(1)) could
    > result in a big savings compared to comparing regular strings;


    Still, I doubt that the URLs are long enough so that there's a
    significant difference.

    > but I expect most decent sql implementations already hash data
    > internally, so rolling your own hash would be useless at best.


    Precisely. DB designers and implementers have been working on
    this problem for 30 years. I doubt the OP is going to be able
    to best them with a few minutes work.

    > If the OP's database is lacking, md5 is probably fine. Perhaps
    > using a subset of the md5 (the low 32 bits, say) could speed
    > up comparisons at risk of more collisions. Probably a good
    > trade off unless the DB is humungous.


    My advice: do it the simple way first (let the DB handle it).
    Don't try to fix a problem until you know it exists.

    Premature optimization....

    --
    Grant Edwards grante Yow! It's strange, but I'm
    at only TRULY ALIVE when I'm
    visi.com covered in POLKA DOTS and
    TACO SAUCE...
     
    Grant Edwards, Jul 12, 2006
    #12
  13. On 2006-07-12, Qiangning Hong <> wrote:
    > Grant Edwards wrote:
    >> On 2006-07-11, Qiangning Hong <> wrote:
    >> > However, when I come to Python's builtin hash() function, I
    >> > found it produces different values in my two computers! In a
    >> > pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
    >> > 12416037344. Does hash function depend on machine's word
    >> > length?

    >>
    >> Apparently. :)
    >>
    >> The low 32 bits match, so perhaps you should just use that
    >> portion of the returned hash?
    >>
    >> >>> hex(12416037344)

    >> '0x2E40DB1E0L'
    >> >>> hex(-468864544 & 0xffffffffffffffff)

    >> '0xFFFFFFFFE40DB1E0L'
    >>
    >> >>> hex(12416037344 & 0xffffffff)

    >> '0xE40DB1E0L'
    >> >>> hex(-468864544 & 0xffffffff)

    >> '0xE40DB1E0L'

    >
    > Is this relationship (same low 32 bits) guaranteed?


    No, I don't believe so.

    > Will it change in the future version?


    It may.

    --
    Grant Edwards grante Yow! Is this an out-take
    at from the "BRADY BUNCH"?
    visi.com
     
    Grant Edwards, Jul 12, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kerry, Richard
    Replies:
    2
    Views:
    427
    Piet van Oostrum
    Jul 13, 2006
  2. =?ISO-8859-1?Q?Bj=F8rn_Augestad?=

    double to int conversion yields strange results

    =?ISO-8859-1?Q?Bj=F8rn_Augestad?=, Feb 11, 2005, in forum: C Programming
    Replies:
    31
    Views:
    943
    Tim Rentsch
    Feb 16, 2005
  3. Dirk T. Shelley

    Converting Floats to Strings yields erratic results

    Dirk T. Shelley, Jun 7, 2011, in forum: C Programming
    Replies:
    29
    Views:
    817
    Seebs
    Jun 10, 2011
  4. Rafael Nenninger

    File System Search on an asp file yields not results

    Rafael Nenninger, Nov 1, 2004, in forum: ASP General
    Replies:
    2
    Views:
    210
    Aaron [SQL Server MVP]
    Nov 1, 2004
  5. x1
    Replies:
    11
    Views:
    278
Loading...

Share This Page