which collection to use

Discussion in 'Java' started by gaurav v bagga, Dec 22, 2006.

  1. hi all,

    I have key,value pair of structure and I want to store them in a
    collection. Which collection will be good with respect to retrieval
    speed.

    I found out from net that HashMap is better then TreeMap in retrieval
    speed.


    HashMap<String, String> hm = new HashMap<String, String>();
    hm.put("1", "r");
    hm.put("2", "r");
    hm.put("3", "r");
    hm.put("4", "r");
    hm.put("5", "r");
    hm.put("6", "r");
    hm.put("7", "rd");
    hm.put("8", "r");
    hm.put("9", "r");
    hm.put("10", "r");
    hm.put("11", "r");
    hm.put("12", "r");
    hm.put("13", "r");
    hm.put("14", "rd");

    long time=System.nanoTime();
    System.out.println(hm.get("10"));
    System.out.println(System.nanoTime()-time);

    TreeMap<String, String> tm = new TreeMap<String, String>();
    tm.put("1", "r");
    tm.put("2", "r");
    tm.put("3", "r");
    tm.put("4", "r");
    tm.put("5", "r");
    tm.put("6", "r");
    tm.put("7", "rd");
    tm.put("8", "r");
    tm.put("9", "r");
    tm.put("10", "r");
    tm.put("11", "r");
    tm.put("12", "r");
    tm.put("13", "r");
    tm.put("14", "rd");

    time=System.nanoTime();
    System.out.println(tm.get("10"));
    System.out.println(System.nanoTime()-time);


    but this code of mine gives result as

    r
    2405892
    r
    832229

    it seems tree is better or is it that I am measuring the speed in wrong
    way?

    regards
    gaurav
    gaurav v bagga, Dec 22, 2006
    #1
    1. Advertising

  2. In article <>,
    "gaurav v bagga" <> wrote:

    > hi all,
    >
    > I have key,value pair of structure and I want to store them in a
    > collection. Which collection will be good with respect to retrieval
    > speed.
    >
    > I found out from net that HashMap is better then TreeMap in retrieval
    > speed.
    >
    >
    > HashMap<String, String> hm = new HashMap<String, String>();
    > hm.put("1", "r");
    > hm.put("2", "r");
    > hm.put("3", "r");
    > hm.put("4", "r");
    > hm.put("5", "r");
    > hm.put("6", "r");
    > hm.put("7", "rd");
    > hm.put("8", "r");
    > hm.put("9", "r");
    > hm.put("10", "r");
    > hm.put("11", "r");
    > hm.put("12", "r");
    > hm.put("13", "r");
    > hm.put("14", "rd");
    >
    > long time=System.nanoTime();
    > System.out.println(hm.get("10"));
    > System.out.println(System.nanoTime()-time);
    >
    > TreeMap<String, String> tm = new TreeMap<String, String>();
    > tm.put("1", "r");
    > tm.put("2", "r");
    > tm.put("3", "r");
    > tm.put("4", "r");
    > tm.put("5", "r");
    > tm.put("6", "r");
    > tm.put("7", "rd");
    > tm.put("8", "r");
    > tm.put("9", "r");
    > tm.put("10", "r");
    > tm.put("11", "r");
    > tm.put("12", "r");
    > tm.put("13", "r");
    > tm.put("14", "rd");
    >
    > time=System.nanoTime();
    > System.out.println(tm.get("10"));
    > System.out.println(System.nanoTime()-time);
    >
    >
    > but this code of mine gives result as
    >
    > r
    > 2405892
    > r
    > 832229
    >
    > it seems tree is better or is it that I am measuring the speed in wrong
    > way?
    >
    > regards
    > gaurav


    I doubt that your test is really meaningful, having so few actual values
    and using String for your keys as you do.

    TreeMap is a sorted map. HashMap is not. Based on that, it's logical
    to expect some overhead on the TreeMap that could impact performance
    somewhat. Whether and when it becomes meaningful probably depends
    largely on how much data you've got and your retrieval patterns.

    But your choice of one or the other should not be driven by your
    perception of speed in retrieving their contents, but by your program's
    needs. If you never need to iterate over your collection in order of
    its keys, why use a TreeMap? Understand the difference between the two,
    determine what your need is, and choose whichever best fits.

    = Steve =
    --
    Steve W. Jackson
    Montgomery, Alabama
    Steve W. Jackson, Dec 22, 2006
    #2
    1. Advertising

  3. And thus spoke gaurav v bagga...

    > I have key,value pair of structure and I want to store them in a
    > collection. Which collection will be good with respect to retrieval
    > speed.

    ....
    > I found out from net that HashMap is better then TreeMap in retrieval
    > speed.


    NEVER EVER decide something like this based by throwing 14 values at it
    and look how much "Date" it takes... Sorry, but it doesn't say much. I
    wouldn't even trust such an experiment if you threw thousands of random
    numbers at it :)

    A better way to do it...

    1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
    algorithms are used by these classes. In some cases you'll find the
    complexity of these algorithms there.

    2. Use the web or a good book to find out, which Algorithem has which
    complexity for which method, if you didn't find it in the JavaDoc.

    THEN you can decide, which algorithm to use. If you do not understand an
    algorithm, you will never be able to decide if it's the best one for
    your job. If you don't know what "Complexity" or "O(n log n)" means, I
    would advise you to read _at least_ what wikipedia has to offer about
    these topics. You don't have to know how to prove a complexity, but you
    should understand, what it means :)

    Flo
    Flo 'Irian' Schaetz, Dec 22, 2006
    #3
  4. Flo 'Irian' Schaetz wrote:
    > And thus spoke gaurav v bagga...
    >
    >> I have key,value pair of structure and I want to store them in a
    >> collection. Which collection will be good with respect to retrieval
    >> speed.

    > ...
    >> I found out from net that HashMap is better then TreeMap in retrieval
    >> speed.

    >
    > NEVER EVER decide something like this based by throwing 14 values at it
    > and look how much "Date" it takes... Sorry, but it doesn't say much. I
    > wouldn't even trust such an experiment if you threw thousands of random
    > numbers at it :)
    >
    > A better way to do it...
    >
    > 1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
    > algorithms are used by these classes. In some cases you'll find the
    > complexity of these algorithms there.
    >
    > 2. Use the web or a good book to find out, which Algorithem has which
    > complexity for which method, if you didn't find it in the JavaDoc.
    >
    > THEN you can decide, which algorithm to use. If you do not understand an
    > algorithm, you will never be able to decide if it's the best one for
    > your job. If you don't know what "Complexity" or "O(n log n)" means, I
    > would advise you to read _at least_ what wikipedia has to offer about
    > these topics. You don't have to know how to prove a complexity, but you
    > should understand, what it means :)
    >
    > Flo


    I disagree somewhat with this advice. Big-O notation only tells you
    about the limiting behavior for large problems.

    If there is a very frequent need to retrieve from small collections, the
    performance could be dominated by overheads and constant factor
    differences that disappear in the limiting case. There is then no
    substitute for measurement.

    However, the measurement needs to be realistic. In particular, timing
    one retrieval makes no sense at all.

    Patricia
    Patricia Shanahan, Dec 22, 2006
    #4
  5. And thus spoke Patricia Shanahan...

    > I disagree somewhat with this advice. Big-O notation only tells you
    > about the limiting behavior for large problems.


    I just said "better" way, not "best" way :) But you're right, I simply
    asumed, that he has a somewhat "bigger" problem to solve and not just 14
    pairs, my fault. x*n can of course be bigger than n*n for small n, my
    fault, really.

    Anyway, understanding the algorithm seems like the best way to help
    making a decision. Taking two classes which seem to do aprox. what one
    wants to have done and than compare them by using "Date" isn't in my Top
    10, at least :)

    Flo
    Flo 'Irian' Schaetz, Dec 22, 2006
    #5
  6. gaurav v bagga wrote:
    > hi all,

    .....
    > r
    > 2405892
    > r
    > 832229
    >
    > it seems tree is better or is it that I am measuring the speed in wrong
    > way?


    You are measuring speed in a very wrong way. Just taking your test, but
    putting the work in separate methods and calling them multiple times, I got:

    r
    Hash 1437744
    r
    Tree 333823
    r
    Hash 151559
    r
    Tree 172166
    r
    Hash 2030422
    r
    Tree 182390
    r
    Hash 165632
    r
    Tree 181563
    r
    Hash 173731
    r
    Tree 177306
    r
    Hash 185852
    r
    Tree 177680
    r
    Hash 154439
    r
    Tree 183960


    Obviously, the times are very variable, probably due to issues such as
    cache misses, other programs running...

    To measure this properly, go back to the program in which this is a
    critical issue. Pick a typical list of items, and distribution of
    requests. Make sure your test program reproduces those. Measure many
    requests, and do the tests several times, alternating methods, so that
    you are not charging one method with initial overheads.

    The measured piece of work should take order seconds or tens of seconds,
    so that the odd cache miss or page fault cannot affect the results.

    Patricia
    Patricia Shanahan, Dec 22, 2006
    #6
  7. Patricia Shanahan wrote:
    > gaurav v bagga wrote:
    >> it seems tree is better or is it that I am measuring the speed in wrong
    >> way?

    >

    Yes, you are measuring it the wrong way.

    In any situation where the system is multi-tasking (and Java is always
    multitasking to a first approximation because there are at least two
    threads running) its meaningless to measure elapsed time. The only valid
    measure of efficiency for an task that doesn't involve i/o is the amount
    of CPU time it consumes because that, at least, is not affected by other
    processes unless the machine is so overloaded that every task switch
    involves paging or task swapping.

    Patricia's runs show interference by other processes very nicely.

    If you are using Linux or a UNIX you should at least use the "time"
    utility to measure the CPU time used. As you don't say what OS you're
    using I can't say more than that.

    > To measure this properly, go back to the program in which this is a
    > critical issue. Pick a typical list of items, and distribution of
    > requests. Make sure your test program reproduces those. Measure many
    > requests, and do the tests several times, alternating methods, so that
    > you are not charging one method with initial overheads.
    >

    Exactly. We can't advise you because we haven't any idea of:
    - the number of items you want to search
    - the relative size of the key and value in each item
    - the degree of key duplication in your data
    - the degree of key value clustering in your data
    - the complexity of the key comparisons [1]

    As Patricia said, the number of items and the key value distribution are
    both vital information which you didn't provide.

    The following assumes that all keys are not unique. It that's not true
    then you need a more complex approach that can handle duplicates.

    If the data is static the following is generally true: if you only do
    infrequent searches of under 10 items a simple sequential search is
    probably best because its overheads are minimal. With a few hundred
    items and many searches of static data its better to sort the data once
    its loaded and use a binary chop search: SortedMap will work this way
    provided that you build an unsorted map first and then construct the
    SortedMap from it. Adding unsorted data item by item to an empty
    SortedMap is likely to be very slow.

    If the data isn't static and is initially unordered then either a hash
    table (HashMap) or a red-black binary tree (TreeMap) is the way to go.
    If the HashMap hashing algorithm generates synonyms because of your key
    value distribution or because you've set the initial capacity too low
    its performance will degrade fast and the TreeMap would be better. If
    you need to read out the data in order then TreeMap is the only game in
    town.

    [1] This alone can invalidate a theoretical analysis. For instance, most
    people will swear blind that Quicksort will always blow a Replacement
    sort into the weeds. No less an authority than Niklaus Wirth says so
    without any qualification of this opinion so it *must* be so! But his
    analysis assumes that swapping items is much more expensive than
    comparing them. If your item swap method is very fast (e.g. swapping
    references to large objects rather than copying the objects about in an
    array) and the key comparison is slow (e.g. a lexicographic string
    comparison over multiple keys rather than comparing single integer keys)
    then you'll find in practice that a Replacement sort is several times
    faster than Quicksort. I know this because I've been there.

    > The measured piece of work should take order seconds or tens of seconds,
    > so that the odd cache miss or page fault cannot affect the results.
    >

    Yes. You need a sample of real data to test against so you get a
    representative key distribution and the sample has to be large enough to
    give a realistic performance estimation for live data. This means it
    *must* be within 2 - 3 orders of magnitude of the design data volume and
    preferably within one order of magnitude. By way of illustration, I've
    seen several databases that ran just fine on typical programmer test
    volumes (under 10-50 items in the biggest table) but that had unusably
    bad performance with a few tens of thousands of rows loaded.

    If all this is new to you, go and find a copy of Sedgewick: Algorithms.
    Its code examples are in Pascal, but any competent programmer should be
    able to translate that to Java in their head. There are also Java
    versions of this book - at a price.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
    Martin Gregorie, Dec 23, 2006
    #7
  8. Martin Gregorie wrote:
    > Patricia Shanahan wrote:
    >> gaurav v bagga wrote:
    >>> it seems tree is better or is it that I am measuring the speed in wrong
    >>> way?

    >>

    > Yes, you are measuring it the wrong way.
    >
    > In any situation where the system is multi-tasking (and Java is always
    > multitasking to a first approximation because there are at least two
    > threads running) its meaningless to measure elapsed time. The only valid
    > measure of efficiency for an task that doesn't involve i/o is the amount
    > of CPU time it consumes because that, at least, is not affected by other
    > processes unless the machine is so overloaded that every task switch
    > involves paging or task swapping.


    Yes and no. If the real objective is to measure CPU time it is fine to
    measure that. However, measuring CPU time does not just eliminate
    interference from other tasks, it also eliminates the job's own disk
    wait time.

    Also, CPU time can show strange effects when measuring multiple threads
    on a hyperthreaded processor.

    Ultimately, elapsed time is what matters.

    Here's what I've done in the past when I wanted an accurate measure of
    elapsed time for one job:

    1. Disconnect all networking.

    2. Only one user logged in.

    3. Kill all non-essential daemons. I used to know exactly what was
    needed to keep a Solaris system operational.

    4. Kill all non-essential user jobs. Typically, I had one shell and the
    job being measured, no window manager.

    4. Hands off keyboard and mouse for the duration of the run.

    However, for many cases that is overkill, and I just go with the time to
    run a reasonably long piece of work.

    Patricia
    Patricia Shanahan, Dec 23, 2006
    #8
  9. Patricia Shanahan wrote:
    > However, for many cases that is overkill, and I just go with the time to
    > run a reasonably long piece of work.


    I'd agree here. I'd tend to measure average elapsed time over a large
    number of jobs representative of the expected usage. The result should
    be a time measurement representative of what the users can expect -- and
    that number's more important than cycles or whatever.

    I'd even have the system in the kind of state, multitasking-wise, that's
    likely to be representative of the typical deployment environment.
    Something than runs fast when it has sole use of the CPU but becomes a
    pig in molasses due to cache misses or whatever when it runs in a
    real-world situation will get caught this way, for starters. So will
    something that makes the rest of the system unusable while it's
    "thinking", where that may be a showstopper for the users. (Such a
    something at minimum needs to use a separate, lower-priority thread for
    the "thinking" and maybe explicitly yield now and again. I've also seen
    systems slowed to a crawl by software that grovels over the whole
    filesystem for hours, because of contention for disk access rather than
    CPU use. The antivirus on my development box runs daily at 8 am,
    generally taking until 10 to finish, and the system's a pain in the ass
    to use during those two hours. Of course, a virus-riddled system would
    be an even bigger pain, so...)
    John Ersatznom, Dec 23, 2006
    #9
  10. gaurav v bagga wrote:
    >
    > r
    > 2405892
    > r
    > 832229


    Yeah, I get:

    r
    617475
    r
    187102

    Oh, but I switched and did TreeMap first...

    > it seems tree is better or is it that I am measuring the speed in wrong
    > way?


    Most modern JREs use adaptive compilers. They will start interpreting
    code and gradually compile bits and pieces, perhaps the same piece a
    number of times.

    So you need to be extremely careful. Assuming you care about 'hot
    spots', then your microbenchmarks should loop many times and alternate
    between algorithms. Better, use real code.

    One things microbenchmarks don't cover is seas of not very warm code.
    Apparently Swing is like this. For good performance there you need to
    write less code ("there is no code faster than no code").

    Either way you tend to get better performance, surprisingly, by keeping
    methods small (Swing is really bad at this). In particular keep
    low-level hot spots away from less well used code (Swing is really bad
    at this). And factor out common code as best you can (Swing is really
    bad at this).

    Tom Hawtin
    Thomas Hawtin, Dec 23, 2006
    #10
  11. John Ersatznom wrote:
    > Patricia Shanahan wrote:
    >> However, for many cases that is overkill, and I just go with the time to
    >> run a reasonably long piece of work.

    >
    > I'd agree here. I'd tend to measure average elapsed time over a large
    > number of jobs representative of the expected usage. The result should
    > be a time measurement representative of what the users can expect -- and
    > that number's more important than cycles or whatever.


    I agree when the objective is to find out if a job, as a whole, has
    acceptable performance. If the job is too slow, the next stage is
    hunt-the-bottlenecks.

    After the bottlenecks have been found, I often want to test alternative
    solutions with minimal interference from other jobs, so that relatively
    short runs will tell me which solution is faster.

    Patricia
    Patricia Shanahan, Dec 23, 2006
    #11
  12. Patricia Shanahan wrote:
    > Martin Gregorie wrote:
    >> Patricia Shanahan wrote:
    >>> gaurav v bagga wrote:
    >>>> it seems tree is better or is it that I am measuring the speed in wrong
    >>>> way?
    >>>

    >> Yes, you are measuring it the wrong way.
    >>
    >> In any situation where the system is multi-tasking (and Java is always
    >> multitasking to a first approximation because there are at least two
    >> threads running) its meaningless to measure elapsed time. The only
    >> valid measure of efficiency for an task that doesn't involve i/o is
    >> the amount of CPU time it consumes because that, at least, is not
    >> affected by other processes unless the machine is so overloaded that
    >> every task switch involves paging or task swapping.

    >
    > Yes and no. If the real objective is to measure CPU time it is fine to
    > measure that. However, measuring CPU time does not just eliminate
    > interference from other tasks, it also eliminates the job's own disk
    > wait time.
    >

    I believe I qualified my statement by saying that this was the better
    measurement for tasks that do not involve i/o.

    If i/o is involved I pick a reasonably sized data set - as you do - but
    tempered to be (hopefully) representative of the real life work and key
    mix and run it several times, discarding inconsistent run times.

    > Also, CPU time can show strange effects when measuring multiple threads
    > on a hyperthreaded processor.
    >

    I'll take you word for that. The oddest platform I've used for
    performance prediction was what used to be Digital UNIX (based on the
    Mach kernel) running on Alpha chips which had odd and non-obvious chunks
    of serialized code in its guts, like the lstat() SVC.

    > Ultimately, elapsed time is what matters.
    >
    > Here's what I've done in the past when I wanted an accurate measure of
    > elapsed time for one job:
    >
    > 1. Disconnect all networking.
    >
    > 2. Only one user logged in.
    >
    > 3. Kill all non-essential daemons. I used to know exactly what was
    > needed to keep a Solaris system operational.
    >
    > 4. Kill all non-essential user jobs. Typically, I had one shell and the
    > job being measured, no window manager.
    >
    > 4. Hands off keyboard and mouse for the duration of the run.
    >
    > However, for many cases that is overkill, and I just go with the time to
    > run a reasonably long piece of work.
    >

    Looks good to me if, as you say, slight overkill. Of course the real fun
    comes from tracking down and eliminating the bottlenecks....


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
    Martin Gregorie, Dec 23, 2006
    #12
  13. hi all,
    thanks for the advice but many things being discussed here are new to
    me so took time to reply.

    as asked ------>
    Exactly. We can't advise you because we haven't any idea of:
    - the number of items you want to search
    =not more than 15 for 70% of time 30 for 25% and 40 for 5%
    - the relative size of the key and value in each item(if guessed it
    right you mean string length?)
    =always less than 20-25
    - the degree of key duplication in your data
    =no duplicates
    - the degree of key value clustering in your data
    = dint get this
    - the complexity of the key comparisons [1]
    = went through[1] but things were not clear,will need time to
    assimilate [1]
    things new to me..... :)
    - OS
    = presently windows XP pro
    - static data?
    = data will be static most of the time (changes rarely)



    Assuming you care about 'hot
    spots', then your microbenchmarks should loop many times and alternate
    between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot
    spots"


    Obviously, the times are very variable, probably due to issues such as
    cache misses, other programs running... WHAT DO YOU MEAN BY "cache
    misses"

    well this discussion has tought me many diffrent perspectives for
    performance measurement
    once again thanks all, good learning for me :)


    regards
    gaurav
    gaurav v bagga, Dec 24, 2006
    #13
  14. gaurav v bagga wrote:
    ....
    > Assuming you care about 'hot
    > spots', then your microbenchmarks should loop many times and alternate
    > between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot
    > spots"


    A "hot spot" is a small piece of code that contributes significantly to
    the execution time.

    >
    >
    > Obviously, the times are very variable, probably due to issues such as
    > cache misses, other programs running... WHAT DO YOU MEAN BY "cache
    > misses"


    See http://en.wikipedia.org/wiki/Cache for a general introduction to
    caches. A "cache miss" is any load, store, or instruction fetch for
    which the processor has to go to a slower cache, or even memory, because
    the data is not in a level 1 cache.

    Patricia
    Patricia Shanahan, Dec 25, 2006
    #14
  15. hi Patricia ,

    thanks for reply

    regards
    gaurav
    gaurav v bagga, Dec 26, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dylan
    Replies:
    5
    Views:
    420
    Daniel T.
    Mar 22, 2005
  2. Pradeep
    Replies:
    2
    Views:
    679
    Patricia Shanahan
    Jan 24, 2007
  3. Øyvind Isaksen
    Replies:
    1
    Views:
    965
    Øyvind Isaksen
    May 18, 2007
  4. jimgardener

    which collection to use?

    jimgardener, Jun 24, 2008, in forum: Java
    Replies:
    3
    Views:
    260
    Mark Space
    Jun 24, 2008
  5. Hemant

    create collection of collection

    Hemant, Oct 22, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    421
    Gregory A. Beamer
    Oct 22, 2009
Loading...

Share This Page