SortedMap question?

K

Knute Johnson

I have been operating under a mis-perception that a SortedMap would be
more quickly searched than an unsorted Map. More specifically that the
cost would come when adding an element to the SortedMap. I wrote a
little test program to measure the amount of time it took to look for
non-existent data in a Map and a SortedMap and it took somewhere between
twice and three times as long to look for the data in the SortedMap.

So am I know correct in thinking that the only real advantage of a
SortedMap is if you wish to iterate over the keys of the map in some order?

Thanks,

knute...
 
J

Joerg Meier

I have been operating under a mis-perception that a SortedMap would be
more quickly searched than an unsorted Map. More specifically that the
cost would come when adding an element to the SortedMap. I wrote a
little test program to measure the amount of time it took to look for
non-existent data in a Map and a SortedMap and it took somewhere between
twice and three times as long to look for the data in the SortedMap.

Considering Map is an Interface, I question your claim that you measured
its get(x) performance :p

Assuming you used HashMap, as far as I recall, its get(x) performance is
already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

Liebe Gruesse,
Joerg
 
R

Robert Klemme

I have been operating under a mis-perception that a SortedMap would be
more quickly searched than an unsorted Map. More specifically that the
cost would come when adding an element to the SortedMap. I wrote a
little test program to measure the amount of time it took to look for
non-existent data in a Map and a SortedMap and it took somewhere between
twice and three times as long to look for the data in the SortedMap.

The TreeMap is a tree implementation and you typically have O(log n) for
insertion and lookup operations. For a HashMap it's O(1) - as has been
stated already. So the expectation would be for the TreeMap to be
slower - at least asymptotically.
So am I know correct in thinking that the only real advantage of a
SortedMap is if you wish to iterate over the keys of the map in some order?

SortedMap also supports creation of various sub sets of the map based on
the key order as well as access to the min and max keys. The JavaDoc is
pretty comprehensive.

Kind regards

robert
 
A

Arne Vajhøj

I have been operating under a mis-perception that a SortedMap would be
more quickly searched than an unsorted Map. More specifically that the
cost would come when adding an element to the SortedMap. I wrote a
little test program to measure the amount of time it took to look for
non-existent data in a Map and a SortedMap and it took somewhere between
twice and three times as long to look for the data in the SortedMap.

So am I know correct in thinking that the only real advantage of a
SortedMap is if you wish to iterate over the keys of the map in some order?

Map and SortedMap are just interfaces.

HashMap get is O(1) but does not maintain order.

SortedMap get is O(logn) but does keep order.

So keeping order cost in performance.

Which makes sense to me.

Arne
 
M

markspace

So am I know correct in thinking that the only real advantage of a
SortedMap is if you wish to iterate over the keys of the map in some order?


What Robert said. Also HashMap might have to "grow" itself, resulting
re-inserting every single element already in the Map. I think TreeMap
has a more predictable O(ln n) insertion time for each insertion.

I might be wrong about that: a balanced tree could resort itself pretty
heavily too on any given insertion.
 
D

David Lamb

Considering Map is an Interface, I question your claim that you measured
its get(x) performance :p

Assuming you used HashMap, as far as I recall, its get(x) performance is
already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

The constant factors can matter a lot. 100 > 1 even though both are O(1).
 
D

David Lamb

The constant factors can matter a lot. 100 > 1 even though both are O(1).

Which, sigh, of course they're not, as Arne pointed out. O(logN) for
SortedMap, as I should have immediately picked up on from the name alone.
 
E

Eric Sosman

Considering Map is an Interface, I question your claim that you measured
its get(x) performance :p

Assuming you used HashMap, as far as I recall, its get(x) performance is
already at O(1) unless you mess up your hashcodes. Doesn't really go lower.

"HashMap is O(1)" is sort of true, but sweeps a lot of dirt
under the rug. Most obviously, there's the possibility of a bad
or unlucky hash distribution -- by no means a remote possibility,
as <http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/>
demonstrates. Even if the hash distribution is acceptably uniform,
there's also the matter of building the map in the first place: As
the insertions accumulate, HashMap may need to expand and re-insert
everything it already contains. With N=n1+n2+...+nk mappings, there
could be n1+(n1+n2)+...+(n1+n2+...+nk) = k*n1+(k-1)*n2+...+1*nk
insertions altogether (a good up-front estimate of N can help here).

Consider also the operation cost. Searching in a TreeMap of
N mappings takes about lg(N) comparisons -- but with String keys
the comparisons will be of varying difficulties. Comparisons near
the root will probably examine only the first few characters to
determine the result; only as the search approaches the leaves will
the Strings' tail characters come into play. On a successful search
HashMap must iterate over the key String's entire length twice: Once
to compute the hashCode(), and again to verify that a match has in
fact been found (comparisons against non-matching keys in the same
bucket will be quick, like those near the root of a TreeMap).

Finally, O() itself hides a lot of detail -- that is, after all,
its purpose. Given the choice between O(N*N) Algorithm X and O(1)
Algorithm Y, which would you choose? Might you change your mind
if you learned that X takes N*N nanoseconds while Y takes a constant
eighteen hours? Might you change your mind yet again if you learned
that N was ten million? O() alone is not enough to decide a question
of speed.

Still and all: HashMap makes a good "default" choice, and is
"likely" to outperform TreeMap over a "reasonable range" of inputs.
Turn to TreeMap when the order of the keys is significant -- for
example, when you need "closest match" for an absent key.
 
R

Robert Klemme

What Robert said. Also HashMap might have to "grow" itself, resulting
re-inserting every single element already in the Map. I think TreeMap
has a more predictable O(ln n) insertion time for each insertion.

I might be wrong about that: a balanced tree could resort itself pretty
heavily too on any given insertion.

TreeMap uses a red black tree:
http://docs.oracle.com/javase/6/docs/api/java/util/TreeMap.html

It has some characteristics which according to my memory avoid the
extreme rebalancing overhead of other binary trees:
http://en.wikipedia.org/wiki/Red_black_tree

Kind regards

robert
 
A

Arne Vajhøj

Finally, O() itself hides a lot of detail -- that is, after all,
its purpose. Given the choice between O(N*N) Algorithm X and O(1)
Algorithm Y, which would you choose? Might you change your mind
if you learned that X takes N*N nanoseconds while Y takes a constant
eighteen hours? Might you change your mind yet again if you learned
that N was ten million? O() alone is not enough to decide a question
of speed.

Obviously true.

But I would consider it very rare that a worse big-O
algorithm/data-structure is faster in a case where it has
a measurable impact on overall performance.
Still and all: HashMap makes a good "default" choice, and is
"likely" to outperform TreeMap over a "reasonable range" of inputs.
Turn to TreeMap when the order of the keys is significant -- for
example, when you need "closest match" for an absent key.

Yep.

Arne
 
A

Arne Vajhøj

Which, sigh, of course they're not, as Arne pointed out. O(logN) for
SortedMap, as I should have immediately picked up on from the name alone.

SortedMap is still just an interface.

Arne
 
E

Eric Sosman

Obviously true.

But I would consider it very rare that a worse big-O
algorithm/data-structure is faster in a case where it has
a measurable impact on overall performance.

That's why Quicksort implementations never use an O(N*N)
InsertionSort for small subfiles.

Oh -- Hey, wait ...
 
J

Joshua Cranmer

Obviously true.

But I would consider it very rare that a worse big-O
algorithm/data-structure is faster in a case where it has
a measurable impact on overall performance.

Well, we always do matrix multiplication using that O(n^2.3727)
algorithm... oh wait.

And we always use merge sort or heap sort instead of all the other
sorts... oh wait. Maybe radix sort if we're doing integers... oh wait.

I read a paper which said you could use log space to see if an
undirected graph was connected. The hidden constant is 3^16, and it
looks like you'd need a much larger constant to actually make it more
usable.

Hidden constants in big-O actually matters a lot.
 
A

Arne Vajhøj

That's why Quicksort implementations never use an O(N*N)
InsertionSort for small subfiles.

Oh -- Hey, wait ...

And it is not rate that this have a measurable impact on
overall performance?

Arne
 
A

Arne Vajhøj

Well, we always do matrix multiplication using that O(n^2.3727)
algorithm... oh wait.

OK. That is a good counter example.
And we always use merge sort or heap sort instead of all the other
sorts... oh wait.

I don't see the point here. Those have the same big O not
worse than quick sort.
I read a paper which said you could use log space to see if an
undirected graph was connected. The hidden constant is 3^16, and it
looks like you'd need a much larger constant to actually make it more
usable.

And that contradicts it being rare?

Arne
 
E

Eric Sosman

And it is not rate that this have a measurable impact on
overall performance?

(I guess that's "rare" and a typo.)

Every non-student Quicksort I've ever seen has used an O(N*N)
method for short subfiles. Every single one. That's the opposite
of "rare," I'd say.

Or look at String#indexOf(String). Some fast algorithms
like Boyer-Moore have already been mentioned in this thread, yet
Java's own implementation uses a naive method that is linear at
best, super-linear if unlucky (I think it can be O(M*N), where M
and N are the lengths of the two Strings). If the naive method
were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
better-O() algorithm, would you expect performance to improve?

The ordinary Quickselect algorithm finds a median (or other
quantile) in O(N) expected time but O(N*N) worst-case time. The
Blum-Floyd-Pratt-Rivest-Tarjan variant is guaranteed O(N) in all
cases. Which would you choose? (Hint: Wikipedia says of BFPRT
"Although this approach optimizes quite well, it is typically
outperformed in practice by the expected linear algorithm with
random pivot choices.")

If Algorithm F takes f(N) time while G takes g(N), we might
inspect the two functions and observe that O(f(N)) < O(g(N)).
This allows us to conclude that F will be faster than G *if* N
is sufficiently large -- but the big-Oh statement says nothing
about how large N must be for F to be faster. Nothing. F may
be faster than G for all N, or only for N > 100, or only for
N > 1E100. That's why we see mixed-strategy implementations so
frequently: The programmer will test the value of N and select F
or G depending on which is (believed to be) faster *for that N*.

To put it even more simply: F's greater asymptotic speed
probably comes at the expense of more complication than G, and
the extra complication takes time. For large enough N the time
is well-spent, but for small N it's very likely wasted.
 
J

Joshua Cranmer

I don't see the point here. Those have the same big O not
worse than quick sort.

We tend to use quicksort as often as merge sort, if not more, and
quicksort has O(N^2) worst-case time. As mentioned earlier,
quicksorts--including Java's implementation--often use insertion sort
for very small arrays (when you get down to 7-16 or so). Heapsort tends
to be extremely rare, despite being an O(n lg n) in-place sort, since
maintaining the heap tends to come with a relatively oppressive constant
factor.

Insertion sort, our favorite O(N^2) sort, is actually a very common
sorting algorithm, especially if you realize that keeping an array
sorted as you add elements to it tends to end up being an insertion sort.

Radix sort, which is O(N) for fixed-size elements (like primitive
integers), is quite rare in practice despite being asymptotically better
than quicksort or mergesort. A bucket sort would actually be
asymptotically superior in Java for bytes, shorts, and chars, too (256
or 64K integer arrays are not onerous in memory usage in modern computers).
And that contradicts it being rare?

No, it's asymptotically better space-wise than BFS or DFS (which are
both linear in space). But I highly doubt anyone has even bothered to
write an implementation in any computer language (the paper itself
didn't specify the whole algorithm concisely).
 
K

Knute Johnson

Or, if the key comparison is cheap, i.e. short strings or integer
comparisons, and the key population is very large TreeMap can become
attractive: with 8 million nodes you only need 23 comparisions to find a
miss. Growing the tree isn't too expensive either: rebalancing a balanced
red-black tree, which is necessary from time to time as it grows, is only
a matter of adjusting pointers.

That's what I thought, that the searching for a match would be very
quick once the data was in the TreeMap. The test code I wrote may not
have been any good. I created a map with Strings for the keys and
values. I could get about 2 million entries before I started running
into virtual memory issues slowing things down. I searched for
non-existent elements. Using both a HashMap and a TreeMap, the TreeMap
was significantly slower than the HashMap. I even tried a
ConcurrentHashMap and multiple threads to do the search to see if that
would speed things up. While that technique was better than TreeMap it
was still slower than a plain HashMap.

So for the actual case that I am working on, I have a collection of
about 5000 elements and am using an Integer as the key. Data is rarely
changed but often accessed. There should never be a case where the
searched for key will not exist. What would you use, a HashMap or a
TreeMap?

Thanks,

knute...
 
A

Arne Vajhøj

That's what I thought, that the searching for a match would be very
quick once the data was in the TreeMap. The test code I wrote may not
have been any good. I created a map with Strings for the keys and
values. I could get about 2 million entries before I started running
into virtual memory issues slowing things down. I searched for
non-existent elements. Using both a HashMap and a TreeMap, the TreeMap
was significantly slower than the HashMap. I even tried a
ConcurrentHashMap and multiple threads to do the search to see if that
would speed things up. While that technique was better than TreeMap it
was still slower than a plain HashMap.

So for the actual case that I am working on, I have a collection of
about 5000 elements and am using an Integer as the key. Data is rarely
changed but often accessed. There should never be a case where the
searched for key will not exist. What would you use, a HashMap or a
TreeMap?

If it is important then you should measure!

But I would still expect HashMap to be faster than TreeMap.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top