SortedMap question?

Lew · Nov 26, 2012

Knute said:
That's what I thought, that the searching for a match would be very
quick once the data was in the TreeMap. The test code I wrote may not
have been any good. I created a map with Strings for the keys and
values. I could get about 2 million entries before I started running
into virtual memory issues slowing things down. I searched for
non-existent elements. Using both a HashMap and a TreeMap, the TreeMap
was significantly slower than the HashMap. I even tried a
ConcurrentHashMap and multiple threads to do the search to see if that
would speed things up. While that technique was better than TreeMap it
was still slower than a plain HashMap.

So for the actual case that I am working on, I have a collection of
about 5000 elements and am using an Integer as the key. Data is rarely
changed but often accessed. There should never be a case where the
searched for key will not exist. What would you use, a HashMap or a
TreeMap?

That would depend entirely on whether I needed to iterate the map in
sorted order. There is little reason to prefer 'TreeMap' absent a sorting
requirement except, as pointed out upthread, if you need guaranteed O(log n)
performance for gets rather than the probabilistic promise of hashing.
Such scenarios would require specialized knowledge of the data sets to be
mapped.

Kudos for your efforts to obtain objective performance evidence.

Arne Vajhøj · Nov 26, 2012

(I guess that's "rare" and a typo.)
Yes.

Every non-student Quicksort I've ever seen has used an O(N*N)
method for short subfiles. Every single one. That's the opposite
of "rare," I'd say.

Yes.

But "rare occurrence" and "rare measurable impact on overall
performance" is not quite the same.

Or look at String#indexOf(String). Some fast algorithms
like Boyer-Moore have already been mentioned in this thread, yet
Java's own implementation uses a naive method that is linear at
best, super-linear if unlucky (I think it can be O(M*N), where M
and N are the lengths of the two Strings). If the naive method
were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
better-O() algorithm, would you expect performance to improve?

I would not expect it to become significant worse.

Of course I could be wrong.

Joshua already found the matrix multiplication example.

If Algorithm F takes f(N) time while G takes g(N), we might
inspect the two functions and observe that O(f(N)) < O(g(N)).
This allows us to conclude that F will be faster than G *if* N
is sufficiently large -- but the big-Oh statement says nothing
about how large N must be for F to be faster. Nothing. F may
be faster than G for all N, or only for N > 100, or only for
N > 1E100. That's why we see mixed-strategy implementations so
frequently: The programmer will test the value of N and select F
or G depending on which is (believed to be) faster *for that N*.

To put it even more simply: F's greater asymptotic speed
probably comes at the expense of more complication than G, and
the extra complication takes time. For large enough N the time
is well-spent, but for small N it's very likely wasted.

All true.

And if one knows the algorithms full characteristics and one
knows ones input data well, then one can pick the best algorithm
with certainty.

But if one really does not know, then I would go for the best
big-O, because even if it is not the fastest then most likely
it is not that much worse than the best (maybe +100%), if one
goes for a worse big-O and it is not the fastest then you
can easily be killed performance wise (x100).

Arne

Arne VajhÃ¸j · Nov 26, 2012

We tend to use quicksort as often as merge sort, if not more, and
quicksort has O(N^2) worst-case time. As mentioned earlier,
quicksorts--including Java's implementation--often use insertion sort
for very small arrays (when you get down to 7-16 or so). Heapsort tends
to be extremely rare, despite being an O(n lg n) in-place sort, since
maintaining the heap tends to come with a relatively oppressive constant
factor.

I thought is was just x2.

Which of course is important enough to prefer quicksort, but
not a disaster.

And still not what I was talking about.

Radix sort, which is O(N) for fixed-size elements (like primitive
integers), is quite rare in practice despite being asymptotically better
than quicksort or mergesort. A bucket sort would actually be
asymptotically superior in Java for bytes, shorts, and chars, too (256
or 64K integer arrays are not onerous in memory usage in modern computers).

But is it due to high constant that they are not used or because
memory was more limited when Java was invented?

NB: a quick look in some Java sources indicate that Java actually do
use counting sort for small arrays of byte/char/short.

Arne

David Lamb · Nov 26, 2012

SortedMap is still just an interface.

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()

Joshua Cranmer · Nov 26, 2012

Fascinating. Reference /please/ ;-)

<http://research.microsoft.com/pubs/148550/sl.pdf> appears to be one of
the papers.

Arne Vajhøj · Nov 26, 2012

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()

Better is difficult for the general case. Worse is easy.

Arne

Eric Sosman · Nov 26, 2012

If it is important then you should measure!
Aaay-men!

But I would still expect HashMap to be faster than TreeMap.

Yes. A HashMap with 5000 keys and the default 0.75 load
factor will want 5000/0.75 ~= 6667 buckets, which it will round
up to 8192. With normal luck the 5000 keys will land in ~3743
buckets, about two-thirds of which will hold just one key. So
a typical search involves a few arithmetic operations to locate
the bucket, then about 5000/3743 ~= 1.3 Integer#equals() calls.

Searching a TreeMap of 5000 keys, on the other hand, will
use about lg(5000) ~= 11.3 Integer#compareTo() calls. Uses of
equals() and compareTo() have different costs (equals() needs to
work with Objects and check their classes before comparing, but
compareTo() needs to produce a three-way answer rather than a
simpler boolean ...), but any differences are almost certainly
swamped by the disparity in call counts.

If the key values are not too widely scattered, other
possibilities exist. A BitSet can answer "present or absent"
queries very quickly and without taking much space, or a simple
array can serve as a Map stand-in. But follow Arne's advice
when making your choice: Measure It!

Jim Janney · Nov 26, 2012

Eric Sosman said:
Or look at String#indexOf(String). Some fast algorithms
like Boyer-Moore have already been mentioned in this thread, yet
Java's own implementation uses a naive method that is linear at
best, super-linear if unlucky (I think it can be O(M*N), where M
and N are the lengths of the two Strings). If the naive method
were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
better-O() algorithm, would you expect performance to improve?

Boyer-Moore and its descendants invest more time up front building the
jump table, in return for faster searching once the table is built.
That pays off if you're searching for the same pattern across a large
volume of text. But for the common case where strings are relatively
short, I'd expect the brute-force (not naive) approach to be the better
choice.

Joshua Cranmer · Nov 26, 2012

Boyer-Moore and its descendants invest more time up front building the
jump table, in return for faster searching once the table is built.
That pays off if you're searching for the same pattern across a large
volume of text. But for the common case where strings are relatively
short, I'd expect the brute-force (not naive) approach to be the better
choice.

Well, it really pays off when the following conditions hold:
1. The string to search for is long.
2. The string to search for repeats a largeish prefix in itself later
(for some of the later descendants)
3. You are likely to find *a lot* of partial matches of the string (so
full backtracking measurably hurts).

I suspect English words in English text don't satisfy these conditions
very well, and I presume that (or similar scenarios) would explain a lot
of String.indexOf searches.

Where this would pay off would be something like trying to find DNA
subsequences: you have a 1-in-4 chance of partially matching a given
letter, your target string is likely to be lengthy and have lots of
similarities within it, and the text being searched is extremely long.

There are other penalties to advanced substring finding algorithms than
just the build-the-lookup table penalty. The most important is that
using the jump table has a minor microarchitectural penalty. Other
penalties include increased amount of memory you need in the cache and
increased surface for bugs.

Daniel Pitts · Nov 26, 2012

Maybe so, but I suspect programmers would be unhappy if an
implementation didn't actually sort anything. I suppose there might be
special cases; sortedMap from small integers to anything could be O(1)
for get()

That special case happens to be called BitSet, which doesn't implement
SortedMap

Eric Sosman · Nov 27, 2012

That special case happens to be called BitSet, which doesn't implement
SortedMap

Another name for that special case is Value[] -- which
doesn't actually implement SortedMap, but does all that's needed.

SortedMap: getting value for largest key less or equal a givenvalue...	5	Aug 1, 2012
Getting keys from a SortedMap	3	Mar 27, 2006
SortedMap: getting value for largest key less or equal a given	5	Aug 2, 2012
SortedMap: getting value for largest key less or equal a given	2	Aug 4, 2012
How to iterate a collection fast and memory-efficient?	21	Jan 22, 2005
asyncio question	0	Mar 13, 2014
[PAID][REMOTE] Hiring programmer/dev for indie game	2	Feb 19, 2023
Real Java Interview Questions	27	Nov 15, 2007

SortedMap question?

Lew

Arne Vajhøj

Arne VajhÃ¸j

David Lamb

Joshua Cranmer

Arne Vajhøj

Eric Sosman

Jim Janney

Joshua Cranmer

Daniel Pitts

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads