which collection to use

gaurav v bagga · Dec 22, 2006

hi all,

I have key,value pair of structure and I want to store them in a
collection. Which collection will be good with respect to retrieval
speed.

I found out from net that HashMap is better then TreeMap in retrieval
speed.

HashMap<String, String> hm = new HashMap<String, String>();
hm.put("1", "r");
hm.put("2", "r");
hm.put("3", "r");
hm.put("4", "r");
hm.put("5", "r");
hm.put("6", "r");
hm.put("7", "rd");
hm.put("8", "r");
hm.put("9", "r");
hm.put("10", "r");
hm.put("11", "r");
hm.put("12", "r");
hm.put("13", "r");
hm.put("14", "rd");

long time=System.nanoTime();
System.out.println(hm.get("10"));
System.out.println(System.nanoTime()-time);

TreeMap<String, String> tm = new TreeMap<String, String>();
tm.put("1", "r");
tm.put("2", "r");
tm.put("3", "r");
tm.put("4", "r");
tm.put("5", "r");
tm.put("6", "r");
tm.put("7", "rd");
tm.put("8", "r");
tm.put("9", "r");
tm.put("10", "r");
tm.put("11", "r");
tm.put("12", "r");
tm.put("13", "r");
tm.put("14", "rd");

time=System.nanoTime();
System.out.println(tm.get("10"));
System.out.println(System.nanoTime()-time);

but this code of mine gives result as

r
2405892
r
832229

it seems tree is better or is it that I am measuring the speed in wrong
way?

regards
gaurav

Steve W. Jackson · Dec 22, 2006

"gaurav v bagga said:
hi all,

I have key,value pair of structure and I want to store them in a
collection. Which collection will be good with respect to retrieval
speed.

I found out from net that HashMap is better then TreeMap in retrieval
speed.

HashMap<String, String> hm = new HashMap<String, String>();
hm.put("1", "r");
hm.put("2", "r");
hm.put("3", "r");
hm.put("4", "r");
hm.put("5", "r");
hm.put("6", "r");
hm.put("7", "rd");
hm.put("8", "r");
hm.put("9", "r");
hm.put("10", "r");
hm.put("11", "r");
hm.put("12", "r");
hm.put("13", "r");
hm.put("14", "rd");

long time=System.nanoTime();
System.out.println(hm.get("10"));
System.out.println(System.nanoTime()-time);

TreeMap<String, String> tm = new TreeMap<String, String>();
tm.put("1", "r");
tm.put("2", "r");
tm.put("3", "r");
tm.put("4", "r");
tm.put("5", "r");
tm.put("6", "r");
tm.put("7", "rd");
tm.put("8", "r");
tm.put("9", "r");
tm.put("10", "r");
tm.put("11", "r");
tm.put("12", "r");
tm.put("13", "r");
tm.put("14", "rd");

time=System.nanoTime();
System.out.println(tm.get("10"));
System.out.println(System.nanoTime()-time);

but this code of mine gives result as

r
2405892
r
832229

it seems tree is better or is it that I am measuring the speed in wrong
way?

regards
gaurav

I doubt that your test is really meaningful, having so few actual values
and using String for your keys as you do.

TreeMap is a sorted map. HashMap is not. Based on that, it's logical
to expect some overhead on the TreeMap that could impact performance
somewhat. Whether and when it becomes meaningful probably depends
largely on how much data you've got and your retrieval patterns.

But your choice of one or the other should not be driven by your
perception of speed in retrieving their contents, but by your program's
needs. If you never need to iterate over your collection in order of
its keys, why use a TreeMap? Understand the difference between the two,
determine what your need is, and choose whichever best fits.

= Steve =

Flo 'Irian' Schaetz · Dec 22, 2006

And thus spoke gaurav v bagga...

I have key,value pair of structure and I want to store them in a
collection. Which collection will be good with respect to retrieval
speed. ....
I found out from net that HashMap is better then TreeMap in retrieval
speed.

NEVER EVER decide something like this based by throwing 14 values at it
and look how much "Date" it takes... Sorry, but it doesn't say much. I
wouldn't even trust such an experiment if you threw thousands of random
numbers at it

A better way to do it...

1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
algorithms are used by these classes. In some cases you'll find the
complexity of these algorithms there.

2. Use the web or a good book to find out, which Algorithem has which
complexity for which method, if you didn't find it in the JavaDoc.

THEN you can decide, which algorithm to use. If you do not understand an
algorithm, you will never be able to decide if it's the best one for
your job. If you don't know what "Complexity" or "O(n log n)" means, I
would advise you to read _at least_ what wikipedia has to offer about
these topics. You don't have to know how to prove a complexity, but you
should understand, what it means

Flo

Patricia Shanahan · Dec 22, 2006

Flo said:
And thus spoke gaurav v bagga...

NEVER EVER decide something like this based by throwing 14 values at it
and look how much "Date" it takes... Sorry, but it doesn't say much. I
wouldn't even trust such an experiment if you threw thousands of random
numbers at it

A better way to do it...

1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
algorithms are used by these classes. In some cases you'll find the
complexity of these algorithms there.

2. Use the web or a good book to find out, which Algorithem has which
complexity for which method, if you didn't find it in the JavaDoc.

THEN you can decide, which algorithm to use. If you do not understand an
algorithm, you will never be able to decide if it's the best one for
your job. If you don't know what "Complexity" or "O(n log n)" means, I
would advise you to read _at least_ what wikipedia has to offer about
these topics. You don't have to know how to prove a complexity, but you
should understand, what it means

Flo

I disagree somewhat with this advice. Big-O notation only tells you
about the limiting behavior for large problems.

If there is a very frequent need to retrieve from small collections, the
performance could be dominated by overheads and constant factor
differences that disappear in the limiting case. There is then no
substitute for measurement.

However, the measurement needs to be realistic. In particular, timing
one retrieval makes no sense at all.

Patricia

Flo 'Irian' Schaetz · Dec 22, 2006

And thus spoke Patricia Shanahan...

I disagree somewhat with this advice. Big-O notation only tells you
about the limiting behavior for large problems.

I just said "better" way, not "best" way

But you're right, I simply
asumed, that he has a somewhat "bigger" problem to solve and not just 14
pairs, my fault. x*n can of course be bigger than n*n for small n, my
fault, really.

Anyway, understanding the algorithm seems like the best way to help
making a decision. Taking two classes which seem to do aprox. what one
wants to have done and than compare them by using "Date" isn't in my Top
10, at least

Flo

Patricia Shanahan · Dec 22, 2006

gaurav said:
hi all, .....
r
2405892
r
832229

it seems tree is better or is it that I am measuring the speed in wrong
way?

You are measuring speed in a very wrong way. Just taking your test, but
putting the work in separate methods and calling them multiple times, I got:

r
Hash 1437744
r
Tree 333823
r
Hash 151559
r
Tree 172166
r
Hash 2030422
r
Tree 182390
r
Hash 165632
r
Tree 181563
r
Hash 173731
r
Tree 177306
r
Hash 185852
r
Tree 177680
r
Hash 154439
r
Tree 183960

Obviously, the times are very variable, probably due to issues such as
cache misses, other programs running...

To measure this properly, go back to the program in which this is a
critical issue. Pick a typical list of items, and distribution of
requests. Make sure your test program reproduces those. Measure many
requests, and do the tests several times, alternating methods, so that
you are not charging one method with initial overheads.

The measured piece of work should take order seconds or tens of seconds,
so that the odd cache miss or page fault cannot affect the results.

Patricia

Martin Gregorie · Dec 23, 2006

Yes, you are measuring it the wrong way.

In any situation where the system is multi-tasking (and Java is always
multitasking to a first approximation because there are at least two
threads running) its meaningless to measure elapsed time. The only valid
measure of efficiency for an task that doesn't involve i/o is the amount
of CPU time it consumes because that, at least, is not affected by other
processes unless the machine is so overloaded that every task switch
involves paging or task swapping.

Patricia's runs show interference by other processes very nicely.

If you are using Linux or a UNIX you should at least use the "time"
utility to measure the CPU time used. As you don't say what OS you're
using I can't say more than that.

To measure this properly, go back to the program in which this is a
critical issue. Pick a typical list of items, and distribution of
requests. Make sure your test program reproduces those. Measure many
requests, and do the tests several times, alternating methods, so that
you are not charging one method with initial overheads.

Exactly. We can't advise you because we haven't any idea of:
- the number of items you want to search
- the relative size of the key and value in each item
- the degree of key duplication in your data
- the degree of key value clustering in your data
- the complexity of the key comparisons [1]

As Patricia said, the number of items and the key value distribution are
both vital information which you didn't provide.

The following assumes that all keys are not unique. It that's not true
then you need a more complex approach that can handle duplicates.

If the data is static the following is generally true: if you only do
infrequent searches of under 10 items a simple sequential search is
probably best because its overheads are minimal. With a few hundred
items and many searches of static data its better to sort the data once
its loaded and use a binary chop search: SortedMap will work this way
provided that you build an unsorted map first and then construct the
SortedMap from it. Adding unsorted data item by item to an empty
SortedMap is likely to be very slow.

If the data isn't static and is initially unordered then either a hash
table (HashMap) or a red-black binary tree (TreeMap) is the way to go.
If the HashMap hashing algorithm generates synonyms because of your key
value distribution or because you've set the initial capacity too low
its performance will degrade fast and the TreeMap would be better. If
you need to read out the data in order then TreeMap is the only game in
town.

[1] This alone can invalidate a theoretical analysis. For instance, most
people will swear blind that Quicksort will always blow a Replacement
sort into the weeds. No less an authority than Niklaus Wirth says so
without any qualification of this opinion so it *must* be so! But his
analysis assumes that swapping items is much more expensive than
comparing them. If your item swap method is very fast (e.g. swapping
references to large objects rather than copying the objects about in an
array) and the key comparison is slow (e.g. a lexicographic string
comparison over multiple keys rather than comparing single integer keys)
then you'll find in practice that a Replacement sort is several times
faster than Quicksort. I know this because I've been there.

The measured piece of work should take order seconds or tens of seconds,
so that the odd cache miss or page fault cannot affect the results.

Yes. You need a sample of real data to test against so you get a
representative key distribution and the sample has to be large enough to
give a realistic performance estimation for live data. This means it
*must* be within 2 - 3 orders of magnitude of the design data volume and
preferably within one order of magnitude. By way of illustration, I've
seen several databases that ran just fine on typical programmer test
volumes (under 10-50 items in the biggest table) but that had unusably
bad performance with a few tens of thousands of rows loaded.

If all this is new to you, go and find a copy of Sedgewick: Algorithms.
Its code examples are in Pascal, but any competent programmer should be
able to translate that to Java in their head. There are also Java
versions of this book - at a price.

Patricia Shanahan · Dec 23, 2006

Martin said:
Yes, you are measuring it the wrong way.

In any situation where the system is multi-tasking (and Java is always
multitasking to a first approximation because there are at least two
threads running) its meaningless to measure elapsed time. The only valid
measure of efficiency for an task that doesn't involve i/o is the amount
of CPU time it consumes because that, at least, is not affected by other
processes unless the machine is so overloaded that every task switch
involves paging or task swapping.

Yes and no. If the real objective is to measure CPU time it is fine to
measure that. However, measuring CPU time does not just eliminate
interference from other tasks, it also eliminates the job's own disk
wait time.

Also, CPU time can show strange effects when measuring multiple threads
on a hyperthreaded processor.

Ultimately, elapsed time is what matters.

Here's what I've done in the past when I wanted an accurate measure of
elapsed time for one job:

1. Disconnect all networking.

2. Only one user logged in.

3. Kill all non-essential daemons. I used to know exactly what was
needed to keep a Solaris system operational.

4. Kill all non-essential user jobs. Typically, I had one shell and the
job being measured, no window manager.

4. Hands off keyboard and mouse for the duration of the run.

However, for many cases that is overkill, and I just go with the time to
run a reasonably long piece of work.

Patricia

John Ersatznom · Dec 23, 2006

Patricia said:
However, for many cases that is overkill, and I just go with the time to
run a reasonably long piece of work.

I'd agree here. I'd tend to measure average elapsed time over a large
number of jobs representative of the expected usage. The result should
be a time measurement representative of what the users can expect -- and
that number's more important than cycles or whatever.

I'd even have the system in the kind of state, multitasking-wise, that's
likely to be representative of the typical deployment environment.
Something than runs fast when it has sole use of the CPU but becomes a
pig in molasses due to cache misses or whatever when it runs in a
real-world situation will get caught this way, for starters. So will
something that makes the rest of the system unusable while it's
"thinking", where that may be a showstopper for the users. (Such a
something at minimum needs to use a separate, lower-priority thread for
the "thinking" and maybe explicitly yield now and again. I've also seen
systems slowed to a crawl by software that grovels over the whole
filesystem for hours, because of contention for disk access rather than
CPU use. The antivirus on my development box runs daily at 8 am,
generally taking until 10 to finish, and the system's a pain in the ass
to use during those two hours. Of course, a virus-riddled system would
be an even bigger pain, so...)

Thomas Hawtin · Dec 23, 2006

gaurav said:
r
2405892
r
832229

Yeah, I get:

r
617475
r
187102

Oh, but I switched and did TreeMap first...

it seems tree is better or is it that I am measuring the speed in wrong
way?

Most modern JREs use adaptive compilers. They will start interpreting
code and gradually compile bits and pieces, perhaps the same piece a
number of times.

So you need to be extremely careful. Assuming you care about 'hot
spots', then your microbenchmarks should loop many times and alternate
between algorithms. Better, use real code.

One things microbenchmarks don't cover is seas of not very warm code.
Apparently Swing is like this. For good performance there you need to
write less code ("there is no code faster than no code").

Either way you tend to get better performance, surprisingly, by keeping
methods small (Swing is really bad at this). In particular keep
low-level hot spots away from less well used code (Swing is really bad
at this). And factor out common code as best you can (Swing is really
bad at this).

Tom Hawtin

Patricia Shanahan · Dec 23, 2006

John said:
I'd agree here. I'd tend to measure average elapsed time over a large
number of jobs representative of the expected usage. The result should
be a time measurement representative of what the users can expect -- and
that number's more important than cycles or whatever.

I agree when the objective is to find out if a job, as a whole, has
acceptable performance. If the job is too slow, the next stage is
hunt-the-bottlenecks.

After the bottlenecks have been found, I often want to test alternative
solutions with minimal interference from other jobs, so that relatively
short runs will tell me which solution is faster.

Patricia

Martin Gregorie · Dec 23, 2006

Patricia said:
Yes and no. If the real objective is to measure CPU time it is fine to
measure that. However, measuring CPU time does not just eliminate
interference from other tasks, it also eliminates the job's own disk
wait time.

I believe I qualified my statement by saying that this was the better
measurement for tasks that do not involve i/o.

If i/o is involved I pick a reasonably sized data set - as you do - but
tempered to be (hopefully) representative of the real life work and key
mix and run it several times, discarding inconsistent run times.

Also, CPU time can show strange effects when measuring multiple threads
on a hyperthreaded processor.

I'll take you word for that. The oddest platform I've used for
performance prediction was what used to be Digital UNIX (based on the
Mach kernel) running on Alpha chips which had odd and non-obvious chunks
of serialized code in its guts, like the lstat() SVC.

Ultimately, elapsed time is what matters.

Here's what I've done in the past when I wanted an accurate measure of
elapsed time for one job:

1. Disconnect all networking.

2. Only one user logged in.

3. Kill all non-essential daemons. I used to know exactly what was
needed to keep a Solaris system operational.

4. Kill all non-essential user jobs. Typically, I had one shell and the
job being measured, no window manager.

4. Hands off keyboard and mouse for the duration of the run.

However, for many cases that is overkill, and I just go with the time to
run a reasonably long piece of work.

Looks good to me if, as you say, slight overkill. Of course the real fun
comes from tracking down and eliminating the bottlenecks....

gaurav v bagga · Dec 24, 2006

hi all,
thanks for the advice but many things being discussed here are new to
me so took time to reply.

as asked ------>
Exactly. We can't advise you because we haven't any idea of:
- the number of items you want to search
=not more than 15 for 70% of time 30 for 25% and 40 for 5%
- the relative size of the key and value in each item(if guessed it
right you mean string length?)
=always less than 20-25
- the degree of key duplication in your data
=no duplicates
- the degree of key value clustering in your data
= dint get this
- the complexity of the key comparisons [1]
= went through[1] but things were not clear,will need time to
assimilate [1]
things new to me.....

- OS
= presently windows XP pro
- static data?
= data will be static most of the time (changes rarely)

Assuming you care about 'hot
spots', then your microbenchmarks should loop many times and alternate
between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot
spots"

Obviously, the times are very variable, probably due to issues such as
cache misses, other programs running... WHAT DO YOU MEAN BY "cache
misses"

well this discussion has tought me many diffrent perspectives for
performance measurement
once again thanks all, good learning for me

regards
gaurav

Patricia Shanahan · Dec 25, 2006

gaurav v bagga wrote:
....

Assuming you care about 'hot
spots', then your microbenchmarks should loop many times and alternate
between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot
spots"

A "hot spot" is a small piece of code that contributes significantly to
the execution time.

Obviously, the times are very variable, probably due to issues such as
cache misses, other programs running... WHAT DO YOU MEAN BY "cache
misses"

See http://en.wikipedia.org/wiki/Cache for a general introduction to
caches. A "cache miss" is any load, store, or instruction fetch for
which the processor has to go to a slower cache, or even memory, because
the data is not in a level 1 cache.

Patricia

gaurav v bagga · Dec 26, 2006

hi Patricia ,

thanks for reply

regards
gaurav

I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Minimum Total Difficulty	0	Nov 15, 2023
Connected SQLite to my java program but information are not submitted	2	Aug 2, 2022
Dont work, it´s something whit the loops?	1	Jun 30, 2021
Esp8266 Code problem	0	Mar 10, 2023
Processing input from pipes	6	Nov 27, 2010
Adding adressing of IPv6 to program	1	Feb 16, 2023
On-air script	46	Jul 23, 2023

which collection to use

gaurav v bagga

Steve W. Jackson

Flo 'Irian' Schaetz

Patricia Shanahan

Flo 'Irian' Schaetz

Patricia Shanahan

Martin Gregorie

Patricia Shanahan

John Ersatznom

Thomas Hawtin

Patricia Shanahan

Martin Gregorie

gaurav v bagga

Patricia Shanahan

gaurav v bagga

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads