HashMap vs linear table lookup

Discussion in 'Java' started by Roedy Green, Feb 15, 2008.

  1. Roedy Green

    Roedy Green Guest

    For sufficiently small tables, a linear array lookup to find a string
    would be faster than a HashMap. Has anyone experimented to get a rough
    idea where the break-even point is?
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Feb 15, 2008
    #1
    1. Advertising

  2. Roedy Green

    Mark Space Guest

    Roedy Green wrote:
    > For sufficiently small tables, a linear array lookup to find a string
    > would be faster than a HashMap. Has anyone experimented to get a rough
    > idea where the break-even point is?



    If you try to measure this, I think it will depend on the length of the
    Strings. The hashing algorithm for Strings used to only pick a few
    characters to "save time." This didn't work well for large hash tables
    since many similar strings would have the letters chosen in common,
    resulting in poor hash distribution. So now I think every letter in a
    String is used when computing the hash.

    So anyway, yeah, don't forget about string size.
     
    Mark Space, Feb 15, 2008
    #2
    1. Advertising

  3. Roedy Green

    Christian Guest

    Roedy Green schrieb:
    > For sufficiently small tables, a linear array lookup to find a string
    > would be faster than a HashMap. Has anyone experimented to get a rough
    > idea where the break-even point is?
    > --
    >
    > Roedy Green Canadian Mind Products
    > The Java Glossary
    > http://mindprod.com


    well Intresting.. hope you accept this as valid test:

    public class TestString {

    /**
    * @param args
    */
    public static void main(String[] args) {

    for (int size=1;size < 100; size++) {
    if (testHashMap(size) <= testArray(size)) {
    System.out.println("hashmap wins at: "+size);
    }
    }


    }


    private static long testHashMap(int size) {
    Set<String> map = new HashSet<String>();
    for (int i=0 ; i < size; i++) {
    map.add(i+"+"+(-i)+"= 0");
    }

    long time = System.nanoTime();
    for (int repetitions=0; repetitions < 10000; repetitions++ ) {
    for (int i=0;i < size; i++) {
    if (!map.contains(i+"+"+(-i)+"= 0")) {
    throw new IllegalStateException();
    }
    }
    }
    return System.nanoTime()-time;
    }

    private static long testArray(int size) {
    String[] array = new String[size];
    for (int i=0 ; i < size; i++) {
    array =i+"+"+(-i)+"= 0";
    }

    long time = System.nanoTime();
    for (int repetitions=0; repetitions < 10000; repetitions++ ) {
    for (int i=0;i < size; i++) {
    String onSearch = i+"+"+(-i)+"= 0";
    for (int x=0; x < size; x++) {
    if (onSearch.equals(array[x])) {
    break;
    }
    }
    }
    }
    return System.nanoTime()-time;
    }
    }

    on my machine with 5 elements it seems about even .. sometimes the
    HashSet wins .. and sometimes the array.. after that HashSet is the winner.

    Christian
     
    Christian, Feb 15, 2008
    #3
  4. Roedy Green

    Roedy Green Guest

    On Fri, 15 Feb 2008 19:31:58 GMT, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >For sufficiently small tables, a linear array lookup to find a string
    >would be faster than a HashMap. Has anyone experimented to get a rough
    >idea where the break-even point is?
    >--


    I wrote this code to experiment. Turns out binary search is the pits.
    HashSet starts to get better around 11 items.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Feb 15, 2008
    #4
  5. Roedy Green

    Roedy Green Guest

    On Fri, 15 Feb 2008 20:29:30 GMT, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I wrote this code to experiment. Turns out binary search is the pits.
    >HashSet starts to get better around 11 items.


    The sample keys are nicely scrambled to make HashSet happier than it
    deserves to be.


    package com.mindprod.example;

    import java.util.Arrays;
    import java.util.HashSet;
    import java.util.Random;

    /*
    * Test relative speed of 3 methods of looking up a key to see if it is
    is the list of legal keys.
    * HashSet, binary search, linear search for different numbers of keys
    n.
    * <p/>
    * composed with IntelliJ IDEA
    *
    * @author Roedy Green, Canadian Mind Products
    * @version 1.0, 2008-02-15
    */
    public class TestFinding
    {
    // ------------------------------ FIELDS
    ------------------------------

    /**
    * lookup the random strings
    */
    private static HashSet<String> h;

    /**
    * random number generator
    */
    private final static Random wheel = new Random();

    /**
    * list of strings we want to look up
    */
    private static String[] keys;

    // -------------------------- STATIC METHODS
    --------------------------

    /**
    * build lookup structure
    */
    private static void build()
    {
    // build HashSet
    h = new HashSet<String>( Arrays.asList( keys ) );

    // build binary search
    Arrays.sort( keys );
    }

    /**
    * find all the keys by BinarySearch
    */
    private static void findAllViaBinarySearch()
    {
    for ( String lookFor : keys )
    {
    int whereFound = Arrays.binarySearch( keys, lookFor );
    // -ve means not found +ve tells where found.
    }
    }

    /**
    * find all the keys by HashSet
    */
    private static void findAllViaHashSet()
    {
    for ( String lookFor : keys )
    {
    boolean found = h.contains( lookFor );
    }
    }

    /**
    * find all the keys by linear search
    */
    private static void findAllViaLinearSearch()
    {
    for ( String lookFor : keys )
    {
    boolean found = false;
    for ( String candidate : keys )
    {
    if ( lookFor.equals( candidate ) )
    {
    found = true;
    break;
    }
    }
    }
    }

    /**
    * generate N random keys and their lookup structure
    */
    private static void generate( int n )
    {
    keys = new String[n];
    for ( int i = 0; i < n; i++ )
    {
    keys[ i ] = Long.toString( wheel.nextLong() );
    }
    }

    // --------------------------- main() method
    ---------------------------

    /**
    * main method
    *
    * @param args not used
    */
    public static void main( String[] args )
    {

    for ( int n = 1; n <= 10000; n *= 10 )
    {
    generate( n );
    build();
    final long t1 = System.nanoTime();
    findAllViaBinarySearch();
    final long t2 = System.nanoTime();
    findAllViaHashSet();
    final long t3 = System.nanoTime();
    findAllViaLinearSearch();
    final long t4 = System.nanoTime();

    System.out
    .println( "n:" +
    n +
    " binary search: " +
    ( t2 - t1 ) +
    " HashSet: " +
    ( t3 - t2 ) +
    " Linear: " +
    ( t4 - t3 ) );
    }
    // Here in output on my machine With JDK 1.6.0_04
    //Linear is best for 10 or fewer, then HashSet.
    // Binary search is never optimal.
    // n:1 binary search: 21600 HashSet: 9720 Linear: 6400*
    // n:10 binary search: 58720 HashSet: 11360 Linear: 11120*
    // n:100 binary search: 707320 HashSet: 97520* Linear: 695960
    // n:1000 binary search: 1294360 HashSet: 1255480* Linear:
    10911040
    // n:10000 binary search: 9823600 HashSet: 3920200* Linear:
    1513846600

    // Here in output on my machine with Jet
    // n:1 binary search: 5520 HashSet: 4760 Linear: 1240*
    // n:10 binary search: 3560 HashSet: 2480 Linear: 2400*
    // n:100 binary search: 28160 HashSet: 5480* Linear: 44840
    // n:1000 binary search: 408840 HashSet: 43560* Linear:
    7862960
    // n:10000 binary search: 9142440 HashSet: 2295320* Linear:
    1852690520
    }
    }
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Feb 15, 2008
    #5
  6. Roedy Green

    Lord Zoltar Guest

    On Feb 15, 3:14 pm, Mark Space <> wrote:
    > Roedy Green wrote:
    > > For sufficiently small tables, a linear array lookup to find a string
    > > would be faster than a HashMap. Has anyone experimented to get a rough
    > > idea where the break-even point is?

    >
    > If you try to measure this, I think it will depend on the length of the
    > Strings.  The hashing algorithm for Strings used to only pick a few
    > characters to "save time."  This didn't work well for large hash tables
    > since many similar strings would have the letters chosen in common,
    > resulting in poor hash distribution.  So now I think every letter in a
    > String is used when computing the hash.
    >
    > So anyway, yeah, don't forget about string size.


    It would probably also be possible to write your own hash function, if
    you know more about the input (such as avg length of strings, if the
    strings only use a certain subset of all characters, etc...). Doing
    that might provide better performance, in certain cases. You'd just
    have to try it to see ;)
     
    Lord Zoltar, Feb 15, 2008
    #6
  7. Roedy Green wrote:
    ....
    > The sample keys are nicely scrambled to make HashSet happier than it
    > deserves to be.

    ....

    Although the approximate break-even point is interesting, if I had a
    performance problem involving this sort of lookup I would do my own
    experiments, using typical table contents and probe sequences from my
    application.

    Patricia
     
    Patricia Shanahan, Feb 15, 2008
    #7
  8. Roedy Green

    Stefan Ram Guest

    Lord Zoltar <> writes:
    >It would probably also be possible to write your own hash function,


    Here is an example, using a lookup indexed by the final
    two letters of the month TLA. (This might actually be a
    perfect hash, I am not sure about this.)

    A small table might be used instead of the »switch« in »show«.
    This is no hash map, but a direct array look up by a special
    hash function, I believe O(1).

    public class Main
    {
    int[] v;

    public Main()
    { v = new int[ 256 ]; for( int i = 0; i < 256; ++i )v[ i ]=15;
    v[ 97 ]= 0; v[ 98 ]= 9; v[ 99 ]= 4; v[ 101 ]= 2; v[ 103 ]= 9;
    v[ 108 ]= 8; v[ 110 ]= 2; v[ 111 ]= 4; v[ 112 ]= 3; v[ 114 ]= 1;
    v[ 116 ]= 3; v[ 117 ]= 1; v[ 118 ]= 4; v[ 121 ]= 0; }

    int hash( final java.lang.String s )
    { return v[ s.charAt( 1 )]+v[ s.charAt( 2 )] ; }

    void show( final java.lang.String s )
    { switch ( hash( s ))
    { case 4: case 3: case 5: case 8:
    java.lang.System.out.println( 30 ); break;
    case 11: break;
    default: java.lang.System.out.println( 31 ); break; }}

    public static void main( final java.lang.String[] args )
    { new Main().show( "Jun" ); new Main().show( "Oct" ); }}

    It prints:

    30
    31
     
    Stefan Ram, Feb 15, 2008
    #8
  9. Roedy Green wrote:
    > For sufficiently small tables, a linear array lookup to find a
    > string
    > would be faster than a HashMap. Has anyone experimented to get a
    > rough
    > idea where the break-even point is?


    I presume that it's small enough that, in all nromal cases, they're
     
    Mike Schilling, Feb 16, 2008
    #9
  10. Roedy Green

    Roedy Green Guest

    On Fri, 15 Feb 2008 20:31:19 GMT, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    > for ( int n = 1; n <= 10000; n *= 10 )


    at n = 100,000 the benchmark becomes impossibly slow. The linear
    search total time is o(n^2).
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Feb 16, 2008
    #10
  11. Roedy Green

    none Guest

    Roedy Green wrote:
    > For sufficiently small tables, a linear array lookup to find a string
    > would be faster than a HashMap. Has anyone experimented to get a rough
    > idea where the break-even point is?
    > --
    >
    > Roedy Green Canadian Mind Products
    > The Java Glossary
    > http://mindprod.com


    HashSet have to compute the hashCode of all Strings on the whole
    content, but hashCode is compute only once per String.
    String.equals first use length and then each caracter in order to return
    result. You have to do it each time in linear comparison.

    Basically,
    if n is the total number of strings
    p is the avg length,
    hashset is O(n*p) to O(pn²), worse case if hash function is poor.
    linear is O(pn²) to O(n²), best case if strings are very different
    lengths or beginings.

    Hence you have to profile your dataset (avg length, number of unique,
    similarity, etc.) to know the break point.
     
    none, Feb 18, 2008
    #11
  12. Roedy Green

    Lew Guest

    none wrote:
    > HashSet have to compute the hashCode of all Strings on the whole
    > content, but hashCode is compute only once per String.
    > String.equals first use length and then each caracter in order to return
    > result. You have to do it each time in linear comparison.


    The HashSet has to compute the hash exactly once for each entry when it's
    inserted, as you seem to say.

    > if n is the total number of strings
    > p is the avg length,
    > hashset [sic] is O(n*p) to O(pn²), worse case if hash function is poor.


    The String hashCode() computation should be reasonable for most purposes. But
    even if the hash were worst-case (e.g., returned 17 for every String), there'd
    still only be 'n' comparisons when doing a lookup.

    And in big-O notation the 'p' factor goes away.

    > linear is O(pn²) to O(n²), best case if strings are very different
    > lengths or beginings.


    How are you getting n squared? A linear lookup only needs to make n
    comparisons to find a single input.

    > Hence you have to profile your dataset (avg length, number of unique,
    > similarity, etc.) to know the break point.


    On lookup, i.e., to compare if a given input is already in the Set, you only
    compute the hash for the input, not for the elements in the Set already. The
    range for HashSet lookups should be O(1) to O(n) depending on the hash
    quality. (O(n) occurring when all entries hash to the same value, which
    devolves the HashSet into the linear lookup.)

    --
    Lew
     
    Lew, Feb 19, 2008
    #12
  13. Roedy Green

    Roedy Green Guest

    On Mon, 18 Feb 2008 22:10:04 +0100, none <""mrloyal\"@(none)"> wrote,
    quoted or indirectly quoted someone who said :

    >HashSet have to compute the hashCode of all Strings on the whole
    >content, but hashCode is compute only once per String.


    So you will compute it to put the strings in the HashSet, and you will
    compute it for each key to look them up. You don't need to compute it
    when you chase a HashSet chain to compare colliding strings. They are
    pre-computed.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Feb 21, 2008
    #13
  14. Roedy Green

    Boris Stumm Guest

    Roedy Green wrote:
    > keys[ i ] = Long.toString( wheel.nextLong() );


    What about intern'ed strings, i.e. Long.toString().intern()?

    If string creation is rare and lookup is often needed, one
    would probably do this, and then an .equals() would nearly
    behave like an ==, which should much faster for the linear
    case.
     
    Boris Stumm, Feb 21, 2008
    #14
  15. Roedy Green

    EJP Guest

    >> HashSet have to compute the hashCode of all Strings on the whole
    >> content, but hashCode is compute only once per String.

    >
    > So you will compute it to put the strings in the HashSet, and you will
    > compute it for each key to look them up.

    As the poster said, the hashCode for a String is computed once per
    String. It is not recomputed every time you call String.hashCode().
     
    EJP, Feb 21, 2008
    #15
  16. Roedy Green

    Lew Guest

    EJP wrote:
    > As the poster said, the hashCode for a String is computed once per
    > String. It is not recomputed every time you call String.hashCode().


    Are you sure of that? The Javadocs don't mention that.

    However, it is true that the Map does not invoke hashCode() but once for each
    key, upon insertion, and once for the query object upon lookup.

    P.S., please attribute your citations.

    --
    Lew
     
    Lew, Feb 21, 2008
    #16
  17. Lew wrote:
    > EJP wrote:
    >> As the poster said, the hashCode for a String is computed once per
    >> String. It is not recomputed every time you call String.hashCode().

    >
    > Are you sure of that? The Javadocs don't mention that.


    I guess java implementations are not required to
    but the sun implementation caches the hash
    (kinda changing the internal state of an immutable object...)

    It is only recomputed every time for the empty String and for any other
    String that has a hash of 0(zero).
     
    Thomas Schodt, Feb 21, 2008
    #17
  18. Roedy Green

    Lew Guest

    Thomas Schodt wrote:
    > Lew wrote:
    >> EJP wrote:
    >>> As the poster said, the hashCode for a String is computed once per
    >>> String. It is not recomputed every time you call String.hashCode().

    >>
    >> Are you sure of that? The Javadocs don't mention that.

    >
    > I guess java implementations are not required to
    > but the sun implementation caches the hash
    > (kinda changing the internal state of an immutable object...)


    Actually, the immutability of an object only applies to its external state.
    Lazy instantiation of things like hashCode() or (for classes other than
    String) the toString() result is a standard idiom. One hopes that the String
    implementation of hashCode() is suitably thread-safe, otherwise they are
    changing the external state.

    --
    Lew
     
    Lew, Feb 21, 2008
    #18
  19. "Lew" <> wrote in message
    news:...
    > Thomas Schodt wrote:
    >> Lew wrote:
    >>> EJP wrote:
    >>>> As the poster said, the hashCode for a String is computed once
    >>>> per String. It is not recomputed every time you call
    >>>> String.hashCode().
    >>>
    >>> Are you sure of that? The Javadocs don't mention that.

    >>
    >> I guess java implementations are not required to
    >> but the sun implementation caches the hash
    >> (kinda changing the internal state of an immutable object...)

    >
    > Actually, the immutability of an object only applies to its external
    > state. Lazy instantiation of things like hashCode() or (for classes
    > other than String) the toString() result is a standard idiom. One
    > hopes that the String implementation of hashCode() is suitably
    > thread-safe, otherwise they are changing the external state.


    What would "suitably thread-safe" require here? The external behavior
    is that all calls to hashCode() should return the same value. I think
    this algorithm works fine:

    1. define a private int _hashCode;
    2. on a call to hashCode(), check the value of _hashCode. If it's
    non-zero, return it.
    3. otherwise, calculate the hash code of the string.
    4. store this value in _hashcode
    5. return _hashcode

    The only thread-safety issue is that the store of _hashcode in step 4
    be atomic, and I think that that's guaranteed by the JVM.. You could
    minimize the number of hash code calculations by locking between steps
    2 and 4, ensuring that all threads will see the changed value of
    _hashcode once it's ready, but that's merely an optimization.
     
    Mike Schilling, Feb 21, 2008
    #19
  20. EJP <> writes:

    >> So you will compute it to put the strings in the HashSet, and you
    >> will compute it for each key to look them up.


    > As the poster said, the hashCode for a String is computed once per
    > String. It is not recomputed every time you call String.hashCode().


    Even if they are computed every time, you don't need to know the hashCode
    of elements in the HashSet to find them.

    When you add an element to a HashSet/HashMap, the hash code is used
    once, to pick a bucket, and the element is put into the linked list in
    that bucket.

    To look up an element, you use the hash code of the key/element to
    search for, find a bucket based on that, and traverse the linked list
    to find one that .equals() what you search for.

    If the HashSet/HashMap grows its internal table, it needs the hash
    codes again to put elements into new buckets.


    The implementation of String in the Sun standard library does
    cache the hashCode (unless it happens to be 0, which they take
    as an uninitialized cache - but that's documented :).

    /L
    --
    Lasse Reichstein Nielsen -
    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
    'Faith without judgement merely degrades the spirit divine.'
     
    Lasse Reichstein Nielsen, Feb 21, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gerry
    Replies:
    0
    Views:
    561
    gerry
    Apr 24, 2004
  2. Vince Darley
    Replies:
    4
    Views:
    4,504
    emilchacko
    Mar 2, 2010
  3. Replies:
    1
    Views:
    1,192
    Roedy Green
    Nov 15, 2005
  4. Ian Pilcher

    Hash table: linear probe vs. chaining

    Ian Pilcher, Dec 9, 2005, in forum: Java
    Replies:
    18
    Views:
    9,524
    Charles Richmond
    Dec 13, 2005
  5. Rakesh
    Replies:
    10
    Views:
    12,231
    Mike Schilling
    Apr 8, 2008
Loading...

Share This Page