SortedMap question?

L

Lew

Knute said:
That's what I thought, that the searching for a match would be very
quick once the data was in the TreeMap. The test code I wrote may not
have been any good. I created a map with Strings for the keys and
values. I could get about 2 million entries before I started running
into virtual memory issues slowing things down. I searched for
non-existent elements. Using both a HashMap and a TreeMap, the TreeMap
was significantly slower than the HashMap. I even tried a
ConcurrentHashMap and multiple threads to do the search to see if that
would speed things up. While that technique was better than TreeMap it
was still slower than a plain HashMap.

So for the actual case that I am working on, I have a collection of
about 5000 elements and am using an Integer as the key. Data is rarely
changed but often accessed. There should never be a case where the
searched for key will not exist. What would you use, a HashMap or a
TreeMap?

That would depend entirely on whether I needed to iterate the map in
sorted order. There is little reason to prefer 'TreeMap' absent a sorting
requirement except, as pointed out upthread, if you need guaranteed O(log n)
performance for gets rather than the probabilistic promise of hashing.
Such scenarios would require specialized knowledge of the data sets to be
mapped.

Kudos for your efforts to obtain objective performance evidence.
 
A

Arne Vajhøj

(I guess that's "rare" and a typo.)
Yes.

Every non-student Quicksort I've ever seen has used an O(N*N)
method for short subfiles. Every single one. That's the opposite
of "rare," I'd say.

Yes.

But "rare occurrence" and "rare measurable impact on overall
performance" is not quite the same.
Or look at String#indexOf(String). Some fast algorithms
like Boyer-Moore have already been mentioned in this thread, yet
Java's own implementation uses a naive method that is linear at
best, super-linear if unlucky (I think it can be O(M*N), where M
and N are the lengths of the two Strings). If the naive method
were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
better-O() algorithm, would you expect performance to improve?

I would not expect it to become significant worse.

Of course I could be wrong.

Joshua already found the matrix multiplication example.
If Algorithm F takes f(N) time while G takes g(N), we might
inspect the two functions and observe that O(f(N)) < O(g(N)).
This allows us to conclude that F will be faster than G *if* N
is sufficiently large -- but the big-Oh statement says nothing
about how large N must be for F to be faster. Nothing. F may
be faster than G for all N, or only for N > 100, or only for
N > 1E100. That's why we see mixed-strategy implementations so
frequently: The programmer will test the value of N and select F
or G depending on which is (believed to be) faster *for that N*.

To put it even more simply: F's greater asymptotic speed
probably comes at the expense of more complication than G, and
the extra complication takes time. For large enough N the time
is well-spent, but for small N it's very likely wasted.

All true.

And if one knows the algorithms full characteristics and one
knows ones input data well, then one can pick the best algorithm
with certainty.

But if one really does not know, then I would go for the best
big-O, because even if it is not the fastest then most likely
it is not that much worse than the best (maybe +100%), if one
goes for a worse big-O and it is not the fastest then you
can easily be killed performance wise (x100).

Arne
 
A

Arne Vajhøj

We tend to use quicksort as often as merge sort, if not more, and
quicksort has O(N^2) worst-case time. As mentioned earlier,
quicksorts--including Java's implementation--often use insertion sort
for very small arrays (when you get down to 7-16 or so). Heapsort tends
to be extremely rare, despite being an O(n lg n) in-place sort, since
maintaining the heap tends to come with a relatively oppressive constant
factor.

I thought is was just x2.

Which of course is important enough to prefer quicksort, but
not a disaster.

And still not what I was talking about.
Radix sort, which is O(N) for fixed-size elements (like primitive
integers), is quite rare in practice despite being asymptotically better
than quicksort or mergesort. A bucket sort would actually be
asymptotically superior in Java for bytes, shorts, and chars, too (256
or 64K integer arrays are not onerous in memory usage in modern computers).

But is it due to high constant that they are not used or because
memory was more limited when Java was invented?

NB: a quick look in some Java sources indicate that Java actually do
use counting sort for small arrays of byte/char/short.

Arne
 
D

David Lamb

SortedMap is still just an interface.

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()
 
A

Arne Vajhøj

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()

Better is difficult for the general case. Worse is easy.

Arne
 
E

Eric Sosman

If it is important then you should measure!
Aaay-men!

But I would still expect HashMap to be faster than TreeMap.

Yes. A HashMap with 5000 keys and the default 0.75 load
factor will want 5000/0.75 ~= 6667 buckets, which it will round
up to 8192. With normal luck the 5000 keys will land in ~3743
buckets, about two-thirds of which will hold just one key. So
a typical search involves a few arithmetic operations to locate
the bucket, then about 5000/3743 ~= 1.3 Integer#equals() calls.

Searching a TreeMap of 5000 keys, on the other hand, will
use about lg(5000) ~= 11.3 Integer#compareTo() calls. Uses of
equals() and compareTo() have different costs (equals() needs to
work with Objects and check their classes before comparing, but
compareTo() needs to produce a three-way answer rather than a
simpler boolean ...), but any differences are almost certainly
swamped by the disparity in call counts.

If the key values are not too widely scattered, other
possibilities exist. A BitSet can answer "present or absent"
queries very quickly and without taking much space, or a simple
array can serve as a Map stand-in. But follow Arne's advice
when making your choice: Measure It!
 
J

Jim Janney

Eric Sosman said:
Or look at String#indexOf(String). Some fast algorithms
like Boyer-Moore have already been mentioned in this thread, yet
Java's own implementation uses a naive method that is linear at
best, super-linear if unlucky (I think it can be O(M*N), where M
and N are the lengths of the two Strings). If the naive method
were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
better-O() algorithm, would you expect performance to improve?

Boyer-Moore and its descendants invest more time up front building the
jump table, in return for faster searching once the table is built.
That pays off if you're searching for the same pattern across a large
volume of text. But for the common case where strings are relatively
short, I'd expect the brute-force (not naive) approach to be the better
choice.
 
J

Joshua Cranmer

Boyer-Moore and its descendants invest more time up front building the
jump table, in return for faster searching once the table is built.
That pays off if you're searching for the same pattern across a large
volume of text. But for the common case where strings are relatively
short, I'd expect the brute-force (not naive) approach to be the better
choice.

Well, it really pays off when the following conditions hold:
1. The string to search for is long.
2. The string to search for repeats a largeish prefix in itself later
(for some of the later descendants)
3. You are likely to find *a lot* of partial matches of the string (so
full backtracking measurably hurts).

I suspect English words in English text don't satisfy these conditions
very well, and I presume that (or similar scenarios) would explain a lot
of String.indexOf searches.

Where this would pay off would be something like trying to find DNA
subsequences: you have a 1-in-4 chance of partially matching a given
letter, your target string is likely to be lengthy and have lots of
similarities within it, and the text being searched is extremely long.

There are other penalties to advanced substring finding algorithms than
just the build-the-lookup table penalty. The most important is that
using the jump table has a minor microarchitectural penalty. Other
penalties include increased amount of memory you need in the cache and
increased surface for bugs.
 
D

Daniel Pitts

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()
That special case happens to be called BitSet, which doesn't implement
SortedMap :)
 
E

Eric Sosman

That special case happens to be called BitSet, which doesn't implement
SortedMap :)

Another name for that special case is Value[] -- which
doesn't actually implement SortedMap, but does all that's needed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top