performance of HashSet with strings and integers

Discussion in 'Java' started by Frederik, Oct 7, 2009.

  1. Frederik

    Frederik Guest

    Hi,

    I thought of replacing strings with integers in some code I wrote,
    because I believed it would benefit performance. But before doing so,
    I made a small test class:

    public class testStringSetVsIntSet {
    public static void main(String[] args) {
    long time;
    boolean b;
    Set set;

    Integer I1 = new Integer(100), I2 = new Integer(500);

    set = new HashSet();
    set.add(I1);
    set.add(900);

    time = System.currentTimeMillis();
    for (int i=0; i<50000000; i++) {
    b = set.contains(I1);
    b = set.contains(I2);
    }

    time = System.currentTimeMillis() - time;
    System.out.println("Time 1: " + time);

    String headStr = "Head";
    String agentStr = "Agent";
    String qualifStr = "Qualif";

    set = new HashSet();
    set.add(headStr);
    set.add(agentStr);

    time = System.currentTimeMillis();
    for (int i=0; i<50000000; i++) {
    b = set.contains(headStr);
    b = set.contains(qualifStr);
    }

    time = System.currentTimeMillis() - time;
    System.out.println("Time 2: " + time);
    }
    }

    But to my surprise, the second loop with the strings appeared to be
    twice as fast as the first one with the integers! (first loop 3
    seconds, second 1.5 seconds)
    I didn't expect this because calculating the hashcode for a string is
    normally slower than for an integer (every string character is taken
    into account) and I thought the "equals" method for a string should be
    slower than for an Integer as well.
    Can anybody explain this to me?
     
    Frederik, Oct 7, 2009
    #1
    1. Advertising

  2. Frederik wrote:
    > Hi,
    >
    > I thought of replacing strings with integers in some code I wrote,
    > because I believed it would benefit performance. But before doing
    > so,
    > I made a small test class:
    >
    > public class testStringSetVsIntSet {
    > public static void main(String[] args) {
    > long time;
    > boolean b;
    > Set set;
    >
    > Integer I1 = new Integer(100), I2 = new Integer(500);
    >
    > set = new HashSet();
    > set.add(I1);
    > set.add(900);
    >
    > time = System.currentTimeMillis();
    > for (int i=0; i<50000000; i++) {
    > b = set.contains(I1);
    > b = set.contains(I2);
    > }
    >
    > time = System.currentTimeMillis() - time;
    > System.out.println("Time 1: " + time);
    >
    > String headStr = "Head";
    > String agentStr = "Agent";
    > String qualifStr = "Qualif";
    >
    > set = new HashSet();
    > set.add(headStr);
    > set.add(agentStr);
    >
    > time = System.currentTimeMillis();
    > for (int i=0; i<50000000; i++) {
    > b = set.contains(headStr);
    > b = set.contains(qualifStr);
    > }
    >
    > time = System.currentTimeMillis() - time;
    > System.out.println("Time 2: " + time);
    > }
    > }
    >
    > But to my surprise, the second loop with the strings appeared to be
    > twice as fast as the first one with the integers! (first loop 3
    > seconds, second 1.5 seconds)
    > I didn't expect this because calculating the hashcode for a string
    > is
    > normally slower than for an integer (every string character is taken
    > into account) and I thought the "equals" method for a string should
    > be
    > slower than for an Integer as well.
    > Can anybody explain this to me?


    1. Since Strings are immutable, their hash code is calculated only
    once and then stored in the String object.
    2. Since your strings are of different lengths, they're discovered to
    be unequal almost immediately (the second thing equals() checks is the
    number of characters).
    3. When contains() returns true, its parameter is nott just equal to
    the one in the set but the same object; thus it's discovered to be
    equal even faster (since the first thing equals() checks is "this ==
    param").
    4.This one is just a guess, but I wouldn't be surprised if the Integer
    version became faster if you reversed the order. In general, Java
    gets faster as it goes and the JIT can optimize the code.
     
    Mike Schilling, Oct 7, 2009
    #2
    1. Advertising

  3. On 7 oct, 16:06, Patricia Shanahan <> wrote:
    >
    > The equals method only gets called if the HashSet contains an element
    > whose hash code is equal to the probe's hash code but that is not the
    > same object as the probe. That is very unlikely in a test with so few
    > objects.


    Not really. If I remember correctly, the equals is always called
    (because same
    hash not imply same object). But in this case, the equals first test
    that the
    references are the same. As the Strings of the example are all in the
    constant pool,
    the references are the same, and so the equals test is fast.

    >
    > I strongly suspect that your real use of the HashSet is significantly
    > different from your benchmark. The large number of contains calls with
    > the same key, the small number of distinct strings, and the lack of any
    > case in which you have two distinct but equal objects may all affect the
    > results.
    >


    +1
     
    Pascal Lecointe, Oct 7, 2009
    #3
  4. Patricia Shanahan wrote:
    >
    > The equals method only gets called if the HashSet contains an
    > element
    > whose hash code is equal to the probe's hash code but that is not
    > the
    > same object as the probe. That is very unlikely in a test with so
    > few
    > objects.


    Does HashSet check == rather than letting the equals() method do it?
    So it does. You learn something new every day. Though in the case,
    as I mentioned earlier, String.equals() would be as fast as
    Integer.equals().
     
    Mike Schilling, Oct 7, 2009
    #4
  5. Frederik

    Frederik Guest

    > 4.This one is just a guess, but I wouldn't be surprised if the Integer
    > version became faster if you reversed the order.  In general, Java
    > gets faster as it goes and the JIT can optimize the code.


    I tried that, but that is not the case, the version with the strings
    remains twice as fast.

    Thank you both for your help.
     
    Frederik, Oct 7, 2009
    #5
  6. Frederik wrote:
    >> 4.This one is just a guess, but I wouldn't be surprised if the
    >> Integer version became faster if you reversed the order. In
    >> general,
    >> Java gets faster as it goes and the JIT can optimize the code.

    >
    > I tried that, but that is not the case, the version with the strings
    > remains twice as fast.


    In that case, I'd guess you picked numbers that hash to the same
    bucket. Yup. The initial capacity of a HashSet is 16, and all the
    numbers equal 0 mod 16.
     
    Mike Schilling, Oct 7, 2009
    #6
  7. Frederik

    Lew Guest

    Patricia Shanahan wrote:
    >> The equals method only gets called if the HashSet contains an element
    >> whose hash code is equal to the probe's hash code but that is not the
    >> same object as the probe. That is very unlikely in a test with so few
    >> objects.

    >


    Pascal Lecointe erroneously claimed:
    > Not really. If I remember correctly, the equals is always called


    You don't remember correctly.

    > (because same hash not imply same object).


    No, but different hash codes do imply different objects. Note that
    Patricia pointed out that 'equals()' is only called if the hash codes
    are the same. It is not called if the hash codes differ, because in
    that case 'equals()' would always return 'false', so there's no point.

    > But in this case, the equals first test
    > that the
    > references are the same. As the Strings of the example are all in the
    > constant pool,
    > the references are the same, and so the equals test is fast.
    >


    Really, really fast in the OP's example, because 'equals()' is not
    called.

    It has nothing to do with the Strings being in the constant pool.

    --
    Lew
     
    Lew, Oct 7, 2009
    #7
  8. Frederik

    Lew Guest

    On Oct 7, 10:59 am, Patricia Shanahan <> wrote:
    > Pascal Lecointe wrote:
    > > On 7 oct, 16:06, Patricia Shanahan <> wrote:
    > >> The equals method only gets called if the HashSet contains an element
    > >> whose hash code is equal to the probe's hash code but that is not the
    > >> same object as the probe. That is very unlikely in a test with so few
    > >> objects.

    >
    > > Not really. If I remember correctly, the equals is always called
    > > (because same
    > > hash not imply same object). But in this case, the equals first test
    > > that the
    > > references are the same. As the Strings of the example are all in the
    > > constant pool,
    > > the references are the same, and so the equals test is fast.

    >
    > The match test in HashMap's getEntry is such that either hash code
    > mismatch or reference equality prevents execution of the equals call.
    > The implementation of the HashSet contains uses the HashMap containsKey,
    > which uses getEntry.
    >


    Note also that typical implementations of 'equals()' that do not do
    mere reference equality do check reference equality first, and only
    compare attributes if the references differ. Of course, there's no
    guarantee that any particular 'equals()' implementation does this. In
    the case of the logic just described, it may well be that reference
    equality is checked twice - once by the 'Map' and once by the element
    - before value equality is finally invoked.

    To review:

    HashMap equality checks:
    hashCode(): if unequal, no need to continue
    reference == : if 'true', no need to continue
    value equality: return result

    --
    Lew
     
    Lew, Oct 7, 2009
    #8
  9. Frederik

    Roedy Green Guest

    On Wed, 7 Oct 2009 06:48:14 -0700 (PDT), Frederik
    <> wrote, quoted or indirectly quoted someone who
    said :

    >Set set;
    >Integer I1 = new Integer(100), I2 = new Integer(500);
    >set = new HashSet();


    Aside from the main thrust of your question, you should code this with
    generics and autoboxing like this:

    final Set<Integer> set = new HashSet<Integer>(7);
    // give it a size estimate.
    Integer i1 = 100; // follow caps naming conventions
    Integer i2 = 500; // use autoboxing for brevity
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    I advocate that super programmers who can juggle vastly more complex balls than average guys can, should be banned, by management, from dragging the average crowd into system complexity zones where the whole team will start to drown.
    ~ Jan V.
     
    Roedy Green, Oct 15, 2009
    #9
  10. In article
    <>,
    Frederik <> wrote:

    > Hi,
    >
    > I thought of replacing strings with integers in some code I wrote,
    > because I believed it would benefit performance. But before doing so,
    > I made a small test class:
    >
    > public class testStringSetVsIntSet {
    > public static void main(String[] args) {
    > long time;
    > boolean b;
    > Set set;
    >
    > Integer I1 = new Integer(100), I2 = new Integer(500);
    >
    > set = new HashSet();
    > set.add(I1);
    > set.add(900);
    >
    > time = System.currentTimeMillis();
    > for (int i=0; i<50000000; i++) {
    > b = set.contains(I1);
    > b = set.contains(I2);
    > }
    >
    > time = System.currentTimeMillis() - time;
    > System.out.println("Time 1: " + time);
    >
    > String headStr = "Head";
    > String agentStr = "Agent";
    > String qualifStr = "Qualif";
    >
    > set = new HashSet();
    > set.add(headStr);
    > set.add(agentStr);
    >
    > time = System.currentTimeMillis();
    > for (int i=0; i<50000000; i++) {
    > b = set.contains(headStr);
    > b = set.contains(qualifStr);
    > }
    >
    > time = System.currentTimeMillis() - time;
    > System.out.println("Time 2: " + time);
    > }
    > }
    >
    > But to my surprise, the second loop with the strings appeared to be
    > twice as fast as the first one with the integers! (first loop 3
    > seconds, second 1.5 seconds)
    > I didn't expect this because calculating the hashcode for a string is
    > normally slower than for an integer (every string character is taken
    > into account) and I thought the "equals" method for a string should be
    > slower than for an Integer as well.
    > Can anybody explain this to me?


    A few things:

    - Strings cache their hash value in recent JVMs.
    - The performance of hashing varies for individual keys based on
    collisions and locations within the collision chain.
    - All of your map hits can use the fast '==' test rather than equals(),
    making collisions an unusually large factor of performance.
    - Hotspot was compiling during loop 1. Call the test method twice and
    the second results will probably be different.

    nitpick: Integer.valueOf(n) and autoboxing returns pooled objects for a
    range of values so it's sometimes much more efficient than the Integer
    constructor. It's not in the loop so it doesn't matter here.
    --
    I won't see Goolge Groups replies because I must filter them as spam
     
    Kevin McMurtrie, Oct 15, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. La'ie Techie
    Replies:
    0
    Views:
    417
    La'ie Techie
    Sep 26, 2003
  2. Anony!

    HashSet is strange

    Anony!, Aug 13, 2004, in forum: Java
    Replies:
    8
    Views:
    600
    Timo Kinnunen
    Aug 29, 2004
  3. zero

    HashSet performance

    zero, Nov 12, 2005, in forum: Java
    Replies:
    14
    Views:
    14,514
  4. Ye Dafeng

    HashSet and TreeSet

    Ye Dafeng, Nov 15, 2006, in forum: Java
    Replies:
    4
    Views:
    1,677
    Mike Schilling
    Nov 16, 2006
  5. Marcin

    TreeSet and HashSet

    Marcin, Feb 2, 2007, in forum: Java
    Replies:
    18
    Views:
    7,208
Loading...

Share This Page