hash

Discussion in 'Perl Misc' started by George Mpouras, May 12, 2012.

  1. The clever equal $a ~~ @array is a lot of slower than exists $hash{$a}
    Also hashing algorythm is much slower than binary search, so wouldn't be
    better if we store key,values to btree structures instead of hashes ?
     
    George Mpouras, May 12, 2012
    #1
    1. Advertising

  2. "George Mpouras" <> wrote:
    [...]
    >Also hashing algorythm is much slower than binary search,


    ???

    hash access is typically O(1) while binary search is O(log n). Why do
    you think hashing is slower than binary search?

    >so wouldn't be
    >better if we store key,values to btree structures instead of hashes ?


    Is this a question about the internal implementation of Perl hashes deep
    down in the guts of the abstract Perl machine?

    jue
     
    Jürgen Exner, May 12, 2012
    #2
    1. Advertising

  3. Jürgen Exner <> writes:
    > "George Mpouras" <> wrote:
    > [...]
    >>Also hashing algorythm is much slower than binary search,

    >
    > ???
    >
    > hash access is typically O(1) while binary search is O(log n). Why do
    > you think hashing is slower than binary search?


    Hashing itself is not really an algorithm; it's a data structure.

    Some particular thing you do with hashing may well be slower than binary
    search, but without knowing what what the OP is doing it's difficult to
    speculate.

    [...]

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Will write code for food.
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, May 13, 2012
    #3
  4. Keith Thompson <> wrote:
    >Jürgen Exner <> writes:
    >> "George Mpouras" <> wrote:
    >> [...]
    >>>Also hashing algorythm is much slower than binary search,

    >>
    >> ???
    >>
    >> hash access is typically O(1) while binary search is O(log n). Why do
    >> you think hashing is slower than binary search?

    >
    >Hashing itself is not really an algorithm; it's a data structure.


    True

    >Some particular thing you do with hashing may well be slower than binary
    >search,


    Absolutely. That's why I said "typically". Of course, if you got a
    degenerated hash then access might be as bad as a linear search.

    >but without knowing what what the OP is doing it's difficult to
    >speculate.


    ACK

    jue
     
    Jürgen Exner, May 13, 2012
    #4

  5. >> hash access is typically O(1) while binary search is O(log n). Why do
    >> you think hashing is slower than binary search?

    >
    >Hashing itself is not really an algorithm; it's a data structure.


    True

    hash is a data structure and there is a an internal logic of how to retrieve
    values. If there is a big number of keys then hash may have conflict keys;
    these conflicts are stored as lists where its retrieval is slow (n).
    Retrieving btree values is much faster O(log n). There is no particular
    problem, I have a conversation with my colleague discussing this and maybe
    there is a need for a module to do it easy.
     
    George Mpouras, May 13, 2012
    #5
  6. On 2012-05-13 03:10, Jürgen Exner <> wrote:
    > Keith Thompson <> wrote:
    >>Jürgen Exner <> writes:
    >>> "George Mpouras" <> wrote:
    >>> [...]
    >>>>Also hashing algorythm is much slower than binary search,
    >>>
    >>> ???
    >>>
    >>> hash access is typically O(1) while binary search is O(log n). Why do
    >>> you think hashing is slower than binary search?


    Big-O notation only tells you how an algorithm scales with increasing n.
    It doesn't tell you how it performs for any particular n (especially not
    for a small n).


    >>Hashing itself is not really an algorithm; it's a data structure.

    >
    > True


    <nitpick>
    The data structure is called a "hash table". There is also the "hash
    function". Both are frequently abbreviated as "hash". "Hashing" may
    refer to computing a hash function or to filling a hash table. In
    both cases it is an algorithm.
    </nitpick>


    >>Some particular thing you do with hashing may well be slower than binary
    >>search,

    >
    > Absolutely. That's why I said "typically". Of course, if you got a
    > degenerated hash then access might be as bad as a linear search.


    A degenerated hash is not necessary.

    A successful lookup of an element in a hash consists of the following
    steps:

    1) compute the hash of the lookup key (cost: c1 * length(lookup_key))
    2) compute the location of element in the hash (cost: negligible)
    3) compare the lookup key to the element key
    (cost c2 * length(lookup_key) if successfull, almost zero if
    unsuccessfull, because two strings with the same hash collision are
    unlikely to have a common prefix).
    4) If unsuccessfull, compute the location of the next element (this may
    be a simple pointer lookup or involve the computation of a new hash)
    and continue at 3.

    If the hash is properly constructed, step 4 is very rare, so the access
    time is

    c1 * length(lookup_key) + c2 * length(lookup_key)

    For a binary tree, you have to descend d levels, and do a string
    comparison at every level, so the cost is

    c2 * length(common_prefix(lookup_key, node1_key)) +
    c2 * length(common_prefix(lookup_key, node2_key)) +
    ...
    c2 * length(lookup_key)

    (the common prefix is likely to get longer as we descend down the tree)

    Note that the last term is the same, so we can eliminate it.

    It follows that the hash access is faster than the binary tree[1] access
    if the computation of the hash is faster than the (d-1) partial string
    compares. This depends on the length of lookup_key, the distribution of
    the keys in the dictionary, the hash algorithm, the depth of the tree,
    how much of your data is already in the CPU cache, etc.

    In short, it is impossible to say that a hash is faster than a binary
    tree or vice versa. You have to benchmark particular implementations for
    particular data.

    There are some general observations, though:

    The tree gets slower with increasing depth while the hash is quite
    insensitive to the number of elements, so the hash has an advantage
    for a large number of elements.

    For a hash lookup you have to compute a hash of the entire key up front,
    while for the tree you only have to compare the common prefix (+1 byte),
    so the tree has an advantage if the keys are long (and don't have a
    common prefix).

    The hash is likely to be more cache friendly because you need to
    look only at two contiguous memory locations while in the tree you need
    to look at (d+1) memory locations.


    >>but without knowing what what the OP is doing it's difficult to
    >>speculate.

    >
    > ACK


    Also ACK.

    hp

    [1] George also wrote "btree", which is not a binary tree. For a btree
    it gets even more complicated, but btrees are normally used for disk
    data structures, not in memory, although some in-memory data structures
    (e.g. Judy arrays) use similar techniques to exploit caching.


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, May 13, 2012
    #6
  7. Keith Thompson <> writes:
    > Jürgen Exner <> writes:
    >> "George Mpouras" <> wrote:
    >> [...]
    >>>Also hashing algorythm is much slower than binary search,

    >>
    >> ???
    >>
    >> hash access is typically O(1) while binary search is O(log n). Why do
    >> you think hashing is slower than binary search?

    >
    > Hashing itself is not really an algorithm; it's a data structure.


    Hashing is an algorithm and not a data structure: Usually, it refers
    to 'calculate a "hash value"' (relatively small integer) from some
    (significantly) larger 'input data value'. Usually, this hash value is
    then used as index into a vector of pointers to locate 'a list' on
    which some kind of data item associated with this 'input data value'
    (key) should either exist or needs to be put on.
     
    Rainer Weikusat, May 13, 2012
    #7
  8. "George Mpouras" <>
    writes:

    [...]

    > hash is a data structure and there is a an internal logic of how to
    > retrieve values. If there is a big number of keys then hash may have
    > conflict keys; these conflicts are stored as lists where its retrieval
    > is slow (n). Retrieving btree values is much faster O(log n).


    This doesn't exactly make sense: Let's assume I'm storing 1024
    key-value pairs in a hash table and what I end up with are 128 lists of
    length 8. Ignoring the overhead for calculating the hash values, this
    means that I can locate each of these 1024 pairs with doing at most 8
    key comparisons. If the same 1024 key-value pairs where organized as
    some kind of balanced binary search tree, that would be log2(1024) =
    10 key comparisons. Since '128 pointers' isn't exactly a lot of data
    nowadays, it should be possible to double the size of the table and
    thus end up with 256 list of length 4 and so on.

    NB: This is a theoretical consideration and 'in practice' things
    aren't that simple. But it should be sufficient to show that 'searcing
    on a linked list is slow while ...' doesn't hold water.
     
    Rainer Weikusat, May 13, 2012
    #8
  9. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Keith Thompson <> writes:
    >> > Jürgen Exner <> writes:
    >> >> "George Mpouras" <> wrote:
    >> >> [...]
    >> >>>Also hashing algorythm is much slower than binary search,
    >> >>
    >> >> hash access is typically O(1) while binary search is O(log n). Why do
    >> >> you think hashing is slower than binary search?
    >> >
    >> > Hashing itself is not really an algorithm; it's a data structure.

    >>
    >> Hashing is an algorithm and not a data structure: Usually, it refers
    >> to 'calculate a "hash value"' (relatively small integer) from some
    >> (significantly) larger 'input data value'. Usually, this hash value is
    >> then used as index into a vector of pointers to locate 'a list' on
    >> which some kind of data item associated with this 'input data value'
    >> (key) should either exist or needs to be put on.

    >
    > I don't know about 'usually'. One of the more common uses of hash
    > algorithms is in message digests and digital signatures and such.


    The common use of hashing in perl is the implementation of so-called
    hashes (if you think that hashes in perl are sufficiently uncommon
    that referring to them as 'commmon' seems unwarranted, please feel
    free to argue for another definition of 'commmonly used'). In this
    case, it reduces a sequence of bytes to an unsigned 32-bit value
    which is used as 'base value' for an index into a table of
    lists.

    So-called 'hash functions' have other uses than in lookup
    tables (such as for generating message digests which can then be
    'signed' by encrypting them with the private key of a public/private
    key pair for public key cryptography to produce a so-called 'digital
    signature or to calcluate hased message authentication codes [HMAC])
    but in Perl, such uses are relatively rare, and in any case, this is
    besides the point in a discussion about 'fast/slow lookups'.

    > I think perhaps Keith meant to say 'A hash *table* is not an
    > algorithm, it's a data structure', which is entirely true.


    The term is commonly used but this is really just as sloppy as
    referring to 'the usual case' as 'the usual case': A 'hash table' is
    inherently in now way different from any other 'table' (for this
    definition of table), just the way it is being used differs.
     
    Rainer Weikusat, May 13, 2012
    #9
  10. On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:

    > "George Mpouras" <>
    > writes:
    >
    > [...]
    >
    >> hash is a data structure and there is a an internal logic of how to
    >> retrieve values. If there is a big number of keys then hash may have
    >> conflict keys; these conflicts are stored as lists where its retrieval
    >> is slow (n). Retrieving btree values is much faster O(log n).

    >
    > This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
    > pairs in a hash table and what I end up with are 128 lists of length 8.
    > Ignoring the overhead for calculating the hash values, this means that I
    > can locate each of these 1024 pairs with doing at most 8 key
    > comparisons.


    If you end up with so many lists of that length, something is seriously
    wrong with your hash table. It's either to small, or you have a very bad
    hashing function.

    > If the same 1024 key-value pairs where organized as some
    > kind of balanced binary search tree, that would be log2(1024) = 10 key
    > comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
    > it should be possible to double the size of the table and thus end up
    > with 256 list of length 4 and so on.


    George was talking about a btree, not a binary tree. Those are fairly
    different beasts.

    HTH,
    M4
     
    Martijn Lievaart, May 14, 2012
    #10
  11. On 2012-05-14 06:47, Martijn Lievaart <> wrote:
    > On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:
    >> "George Mpouras" <>
    >> writes:
    >>> hash is a data structure and there is a an internal logic of how to
    >>> retrieve values. If there is a big number of keys then hash may have
    >>> conflict keys; these conflicts are stored as lists where its retrieval
    >>> is slow (n). Retrieving btree values is much faster O(log n).

    >>
    >> This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
    >> pairs in a hash table and what I end up with are 128 lists of length 8.
    >> Ignoring the overhead for calculating the hash values, this means that I
    >> can locate each of these 1024 pairs with doing at most 8 key
    >> comparisons.

    >
    > If you end up with so many lists of that length, something is seriously
    > wrong with your hash table. It's either to small, or you have a very bad
    > hashing function.


    The perl hash implementation always keeps the load factor <= 1, so a
    hash table with 1024 elements has at least 1024 slots. Of course it's
    possible that 896 of these slots are empty and 128 slots have 8 elements
    each, but as you say that would mean a very bad hash function or a
    deliberate attack. To guard against the latter possibility, perl >= 5.8
    (IIRC) monitors the chain length and changes the seed of the hash
    function if a chain grows too long.

    But I think Rainer was trying to come up with a worst-case scenario
    where a hash is still faster (requires fewer comparisons) than a binary
    tree, as the next paragraph shows:

    >> If the same 1024 key-value pairs where organized as some
    >> kind of balanced binary search tree, that would be log2(1024) = 10 key
    >> comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
    >> it should be possible to double the size of the table and thus end up
    >> with 256 list of length 4 and so on.

    >
    > George was talking about a btree, not a binary tree. Those are fairly
    > different beasts.


    Yes, I noted that too in an earlier posting, but btrees are normally
    used for on-disk data structures and they don't implement a binary
    search, so it seems likely that George really meant binary trees, not
    btrees.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, May 14, 2012
    #11
  12. On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:

    >> George was talking about a btree, not a binary tree. Those are fairly
    >> different beasts.

    >
    > Yes, I noted that too in an earlier posting, but btrees are normally
    > used for on-disk data structures and they don't implement a binary


    Btrees are used all over the place, not just on-disk. OTOH, in memory you
    are much more likely to actually use a red-black tree or similar, but the
    interface and performance characteristics are so similar to a btree that
    they are often called btrees as well.

    > search, so it seems likely that George really meant binary trees, not
    > btrees.


    Possible, even probable.

    M4
     
    Martijn Lievaart, May 14, 2012
    #12
  13. Martijn Lievaart <> writes:
    > On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:
    >> "George Mpouras" <>
    >> writes:
    >>
    >> [...]
    >>
    >>> hash is a data structure and there is a an internal logic of how to
    >>> retrieve values. If there is a big number of keys then hash may have
    >>> conflict keys; these conflicts are stored as lists where its retrieval
    >>> is slow (n). Retrieving btree values is much faster O(log n).

    >>
    >> This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
    >> pairs in a hash table and what I end up with are 128 lists of length 8.
    >> Ignoring the overhead for calculating the hash values, this means that I
    >> can locate each of these 1024 pairs with doing at most 8 key
    >> comparisons.

    >
    > If you end up with so many lists of that length, something is seriously
    > wrong with your hash table. It's either to small, or you have a very bad
    > hashing function.


    This was an example supposed to illustrate that 'searching in linked
    list is slow while ...' doesn't make sense.

    Apart from that: If I have a fixed size table (128 slots in this
    example) and my key/value pairs are distributed evenly onto these 128
    slots (yielding 128 lists of length 8), my hash function quite
    obviously did the best job it is theoretically capable of doing.

    >> If the same 1024 key-value pairs where organized as some
    >> kind of balanced binary search tree, that would be log2(1024) = 10 key
    >> comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
    >> it should be possible to double the size of the table and thus end up
    >> with 256 list of length 4 and so on.

    >
    > George was talking about a btree, not a binary tree. Those are fairly
    > different beasts.


    And I was writing about a data structure with the performance
    characteristics he mentioned. Even if he was really referring to 'a
    B-tree' and not just using btree as abbreviations for 'binary tree'
    (which I very strongly suspect), that doesn't matter for this example.
     
    Rainer Weikusat, May 14, 2012
    #13
  14. Martijn Lievaart <> writes:
    > On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
    >>> George was talking about a btree, not a binary tree. Those are fairly
    >>> different beasts.

    >>
    >> Yes, I noted that too in an earlier posting, but btrees are normally
    >> used for on-disk data structures and they don't implement a binary

    >
    > Btrees are used all over the place, not just on-disk. OTOH, in memory you
    > are much more likely to actually use a red-black tree or similar, but the
    > interface and performance characteristics are so similar to a btree that
    > they are often called btrees as well.


    A 'red-black tree' is a balanced, binary search tree with the
    'red-black' referring to a specific balancing algorithm.
     
    Rainer Weikusat, May 14, 2012
    #14
  15. On Mon, 14 May 2012 13:49:48 +0100, Rainer Weikusat wrote:

    > Martijn Lievaart <> writes:
    >> On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
    >>>> George was talking about a btree, not a binary tree. Those are fairly
    >>>> different beasts.
    >>>
    >>> Yes, I noted that too in an earlier posting, but btrees are normally
    >>> used for on-disk data structures and they don't implement a binary

    >>
    >> Btrees are used all over the place, not just on-disk. OTOH, in memory
    >> you are much more likely to actually use a red-black tree or similar,
    >> but the interface and performance characteristics are so similar to a
    >> btree that they are often called btrees as well.

    >
    > A 'red-black tree' is a balanced, binary search tree with the
    > 'red-black' referring to a specific balancing algorithm.


    I actually ment skip lists (brain fart), but the statement is still not
    incorrect.

    M4
     
    Martijn Lievaart, May 14, 2012
    #15
  16. Martijn Lievaart <> writes:
    > On Mon, 14 May 2012 13:49:48 +0100, Rainer Weikusat wrote:
    >> Martijn Lievaart <> writes:
    >>> On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
    >>>>> George was talking about a btree, not a binary tree. Those are fairly
    >>>>> different beasts.
    >>>>
    >>>> Yes, I noted that too in an earlier posting, but btrees are normally
    >>>> used for on-disk data structures and they don't implement a binary
    >>>
    >>> Btrees are used all over the place, not just on-disk. OTOH, in memory
    >>> you are much more likely to actually use a red-black tree or similar,
    >>> but the interface and performance characteristics are so similar to a
    >>> btree that they are often called btrees as well.

    >>
    >> A 'red-black tree' is a balanced, binary search tree with the
    >> 'red-black' referring to a specific balancing algorithm.

    >
    > I actually ment skip lists (brain fart), but the statement is still not
    > incorrect.


    Since a B-tree is a more general tree structure than a binary search
    tree, every binary search tree is also a B-tree, just not vice
    versa. While that's not consistent with a statement you made in other
    subthread, it is surely 'not incorrect'. But this "data structures I
    heard of!"-bingo is IMO pretty pointless.
     
    Rainer Weikusat, May 14, 2012
    #16
  17. On Mon, 14 May 2012 22:49:11 +0100, Rainer Weikusat wrote:

    > Martijn Lievaart <> writes:
    >> On Mon, 14 May 2012 13:49:48 +0100, Rainer Weikusat wrote:
    >>> Martijn Lievaart <> writes:
    >>>> On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
    >>>>>> George was talking about a btree, not a binary tree. Those are
    >>>>>> fairly different beasts.
    >>>>>
    >>>>> Yes, I noted that too in an earlier posting, but btrees are normally
    >>>>> used for on-disk data structures and they don't implement a binary
    >>>>
    >>>> Btrees are used all over the place, not just on-disk. OTOH, in memory
    >>>> you are much more likely to actually use a red-black tree or similar,
    >>>> but the interface and performance characteristics are so similar to a
    >>>> btree that they are often called btrees as well.
    >>>
    >>> A 'red-black tree' is a balanced, binary search tree with the
    >>> 'red-black' referring to a specific balancing algorithm.

    >>
    >> I actually ment skip lists (brain fart), but the statement is still not
    >> incorrect.

    >
    > Since a B-tree is a more general tree structure than a binary search
    > tree, every binary search tree is also a B-tree, just not vice versa.
    > While that's not consistent with a statement you made in other
    > subthread, it is surely 'not incorrect'. But this "data structures I
    > heard of!"-bingo is IMO pretty pointless.


    Well, that was not my intention. Mind responding to the point?

    M4
     
    Martijn Lievaart, May 15, 2012
    #17
  18. Martijn Lievaart <> writes:
    > On Mon, 14 May 2012 22:49:11 +0100, Rainer Weikusat wrote:
    >
    >> Martijn Lievaart <> writes:
    >>> On Mon, 14 May 2012 13:49:48 +0100, Rainer Weikusat wrote:
    >>>> Martijn Lievaart <> writes:
    >>>>> On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
    >>>>>>> George was talking about a btree, not a binary tree. Those are
    >>>>>>> fairly different beasts.
    >>>>>>
    >>>>>> Yes, I noted that too in an earlier posting, but btrees are normally
    >>>>>> used for on-disk data structures and they don't implement a binary
    >>>>>
    >>>>> Btrees are used all over the place, not just on-disk. OTOH, in memory
    >>>>> you are much more likely to actually use a red-black tree or similar,
    >>>>> but the interface and performance characteristics are so similar to a
    >>>>> btree that they are often called btrees as well.
    >>>>
    >>>> A 'red-black tree' is a balanced, binary search tree with the
    >>>> 'red-black' referring to a specific balancing algorithm.
    >>>
    >>> I actually ment skip lists (brain fart), but the statement is still not
    >>> incorrect.

    >>
    >> Since a B-tree is a more general tree structure than a binary search
    >> tree, every binary search tree is also a B-tree, just not vice versa.
    >> While that's not consistent with a statement you made in other
    >> subthread, it is surely 'not incorrect'. But this "data structures I
    >> heard of!"-bingo is IMO pretty pointless.

    >
    > Well, that was not my intention. Mind responding to the point?


    Which point?
     
    Rainer Weikusat, May 15, 2012
    #18
  19. On Tue, 15 May 2012 12:11:40 +0100, Rainer Weikusat wrote:

    > Which point?


    Never mind. I don't feel like explaining everything.

    M4
     
    Martijn Lievaart, May 15, 2012
    #19
  20. Martijn Lievaart <> writes:
    > On Tue, 15 May 2012 12:11:40 +0100, Rainer Weikusat wrote:
    >> Which point?

    >
    > Never mind. I don't feel like explaining everything.


    And I 'feel' that you neither wrote anything even remotely related to
    the original lookup question nor something which could be considered
    'a point' at all, just a bunch of rambling statements about various
    data structures. This may be wrong but if you can't be bothered with
    expressing yourself clearly in face of a misunderstanding, whatever
    you meant to express doesn't matter.
     
    Rainer Weikusat, May 16, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Red Orchid
    Replies:
    3
    Views:
    1,091
  2. Pieter Claassen
    Replies:
    1
    Views:
    1,150
    CBFalconer
    Aug 4, 2004
  3. Bo Peng
    Replies:
    4
    Views:
    817
  4. rp
    Replies:
    1
    Views:
    584
    red floyd
    Nov 10, 2011
  5. Srijayanth Sridhar
    Replies:
    19
    Views:
    671
    David A. Black
    Jul 2, 2008
Loading...

Share This Page