best collection to use for fast lookups

Rob Baxter · Aug 20, 2004

I have a situation where I will need to load a large number
(potentially several thousand) of relatively small objects into an in
memory collection. Once the objects are initially loaded the
collection can assumed to be static (ie no more objects will be
added/removed).

This collection will be used to lookup objects based on a few string
and integer attributes. My primary concern here is to optimize to
speed of the lookups as much as possible as i will potentially have to
do several hundred lookups per second. Given this criteria, can anyone
suggest the optimal container framework I should use to store the
objects?

I am considering using an Array to store the objects and then
implementing the java.util.Comparator interface for the objects and
passing that to the Array.sort to presort them. Then using the
Array.binarySearch (and the Comparator again) method to perform the
lookup.

How would this approach compare performance with a hash table or some
other collection? Any advice or recomendations would be greatly
appreciated. TIA,

</rob>

Paul Lutus · Aug 20, 2004

Rob said:
I have a situation where I will need to load a large number
(potentially several thousand) of relatively small objects into an in
memory collection. Once the objects are initially loaded the
collection can assumed to be static (ie no more objects will be
added/removed).

This collection will be used to lookup objects based on a few string
and integer attributes.

"A few string and integer attributes." Perhaps an example. If you cannot
specify the exact desired data in advance, the fastest search would be a
binary search on a sorted list. If you CAN specify the exact desired data
in advance, a hashtable would be faster.

My primary concern here is to optimize to
speed of the lookups as much as possible as i will potentially have to
do several hundred lookups per second. Given this criteria, can anyone
suggest the optimal container framework I should use to store the
objects?

I am considering using an Array to store the objects and then
implementing the java.util.Comparator interface for the objects and
passing that to the Array.sort to presort them. Then using the
Array.binarySearch (and the Comparator again) method to perform the
lookup.

How would this approach compare performance with a hash table or some
other collection?

They require and meet different criteria, so they are not really comparable.
For the hashtable approach, the table doesn't have to be sorted, but for
the binary search it does.

If, as seems likely, you are expecting to enter a search string that
approximates the data to be found, you are limited to binary search.

Nick Pomfret · Aug 21, 2004

I would use a Map, or more specifically a number of Maps.

If you have an object that you want to be able to look up by a string value
and a number value I would create your own collection with its own
get(String) and get(int) methods. The collection might contain two HashMaps
each indexed differently:

class MyBean {
private String name;
private int age;

public MyBean(String name, int age) {
this.name = name;
this.age = age;
}

public String getName() {
return name;
}

public int getAge() {
return age;
}
}

class MyCollection {
private Map beansByName = new HashMap();
private Map beansByAge = new HashMap();

public void add(MyBean value) {
beansByAge.put(new Integer(value.getAge()), value);
beansByName.put(value.getName(), value);
}

public MyBean get(int age) {
return (MyBean) beansByAge.get(new Integer(age));
}

public MyBean get(String name) {
return (MyBean) beansByName.get(name);
}
}

--

* --------------------------------------------------*

For feature-rich JTables visit
http://www.tabletoolkit.com

Frank · Aug 21, 2004

Rob said:
How would this approach compare performance with a hash table or some
other collection? Any advice or recomendations would be greatly
appreciated. TIA,

</rob>

The complexity of a hash table lookup is O(1), constant time, while for
binary search it's O(log N) iirc.

If i get you correctly, what you want to do is look up one or multiple(a
Set/List of) objects using some form of key, hash tables are what you want.

If you need to dig out ranges, say where "abc" < MyClass.getLastName() <
"def", you'll need a sorted structure; either a tree, or since your data
is static, a sorted array. SortedMap's subMap(K fromKey, K toKey) would
be convenient? Tree lookup is ~ O(log N) as well, i think.

Personally, I'd go with prebuilt container classes all the way. Hacking
your own with arrays is just a waste of time.

Real containers do take up a little more memory than arrays, but the
memory cost of the containers themselves are always neglible in the big
picture.
And if your app actually should be pushing the physical memory limits,
you're better off with a hash map anyway, as it's more paging friendly
than binary search / tree lookups.

Not about to give a whole data structures & algorithms 101 here; post
again for specifics.
And do stay away from the HashTable class, it has a lot of overhead due
to synchronization - go with HashMap instead.

David Segall · Aug 21, 2004

I have a situation where I will need to load a large number
(potentially several thousand) of relatively small objects into an in
memory collection. Once the objects are initially loaded the
collection can assumed to be static (ie no more objects will be
added/removed).

This collection will be used to lookup objects based on a few string
and integer attributes. My primary concern here is to optimize to
speed of the lookups as much as possible as i will potentially have to
do several hundred lookups per second. Given this criteria, can anyone
suggest the optimal container framework I should use to store the
objects?

I am considering using an Array to store the objects and then
implementing the java.util.Comparator interface for the objects and
passing that to the Array.sort to presort them. Then using the
Array.binarySearch (and the Comparator again) method to perform the
lookup.

How would this approach compare performance with a hash table or some
other collection? Any advice or recomendations would be greatly
appreciated. TIA,

</rob>

This does not answer your Java question but if the collection is
really static you can probably create a perfect hash function. By
"really static" I mean that it is worth preprocessing with a program
like Gperf (http://directory.fsf.org/gperf.html).

Chris Uppal · Aug 22, 2004

Rob said:
How would this approach compare performance with a hash table or some
other collection? Any advice or recomendations would be greatly
appreciated. TIA,

It's quite complicated if you want a general purpose solution.

If I were in this position then I'd start by looking for a the lightweight,
in-memory, SQL databases. I know that such things exist, but can't make any
recommendations, since I've never had to use one.

If that's not suitable (and BTW, "several hundred per second" is /not/ a taxing
performance target) then you'll have to roll your own. Things to consider
(some of these have been mentioned before in this thread):

[Assuming the list is of people with first name, second name and age]

1) Do you only need to be able to locate items based on /all/ their attributes
at once ? Or do you need to be able to find all people with a given second
name, but with any first name.

2) Do you need to be able search on any of the "columns" ? Or can you assume
that only some of them (or even only one of them) is significant for searching
?

3) Do you need to find ranges ? E.g. all people more than 64 years old.

4) Is there any skew in the answers to these questions ? E.g. if you often
need to search by last name, but will only occasionally need to search by age.

5) Will there be duplicates in the list ? E.g. do you need to search for all
people with second name = "Smith". If so then will there be duplicates in the
list even if you are searching on all the (searchable) attributes at once ?
(E.g more than one person named "John Smith" with age = 45).

[There are undoubtedly more things to consider, but the above is a start].

For some combinations of answers, the implementation can be very simple (e.g.
if you always search on all of first name, last name and age, and there are no
duplicates, then a hash table is /ideal/). Other combinations could be very
messy, and you'd be much better off punting and leaving it to a database.

-- chris

Rob Baxter · Aug 23, 2004

Hi guys, thanks for all the replies. To clarify a few points that were
raised..

The objects in the collection have 3 attributes, one is a an integer
and two are strings. The int value and one of the strings combine to
form a unique key which can be used to identify the object. I will be
recieving many different int/string pairs representing these two
attributes. What I need to do is use this key to retrieve the
corresponding object from the collection so I can get the value of the
2nd string. There will be no range queries and the "get" method should
never return more than one object.

Based on the earlier comments, it sounds like a HashMap is the way to
go. In that case, if I am careful to not perform any lookups until I
am sure the map is initialized, can I get away with not synchronizing
access to the map (since it will not change)?

Michael Borgwardt · Aug 23, 2004

Rob said:
The objects in the collection have 3 attributes, one is a an integer
and two are strings. The int value and one of the strings combine to
form a unique key which can be used to identify the object. I will be
recieving many different int/string pairs representing these two
attributes. What I need to do is use this key to retrieve the
corresponding object from the collection so I can get the value of the
2nd string. There will be no range queries and the "get" method should
never return more than one object.

Based on the earlier comments, it sounds like a HashMap is the way to
go.

You'll need to put the two parts of the key in a separate class and
overwrite that class's equals() and hashCode() methods. Doing this
both correctly and efficiently can be tricky, and (as has been
discussed here hundreds of times before) if done incorrectly, the
Map will behave very wrongly.

In that case, if I am careful to not perform any lookups until I
am sure the map is initialized, can I get away with not synchronizing
access to the map (since it will not change)?

Yes, that should be OK.

John C. Bollinger · Aug 23, 2004

Rob said:
Based on the earlier comments, it sounds like a HashMap is the way to
go. In that case, if I am careful to not perform any lookups until I
am sure the map is initialized, can I get away with not synchronizing
access to the map (since it will not change)?

Yes. It is not necessary to synchronize a HashMap during a read if you
can be confident that it cannot be modified during the process.

John Bollinger
(e-mail address removed)

Frank · Aug 24, 2004

Rob said:
Based on the earlier comments, it sounds like a HashMap is the way to
go. In that case, if I am careful to not perform any lookups until I
am sure the map is initialized, can I get away with not synchronizing
access to the map (since it will not change)?

Synchronization is really a non-issue unless you're using multiple
threads, and even so I'd be using HashMap, if nothing else than for
consistensy with the other map classes.

Normally your locks would be coarser than on just a single container,
and in the few cases where synching on the map made sense, I'd decorate
it using Collections.synchronizedMap(). Same argument goes for ArrayList
vs. Vector, btw.

Chris Uppal · Aug 24, 2004

Rob said:
Based on the earlier comments, it sounds like a HashMap is the way to
go. In that case, if I am careful to not perform any lookups until I
am sure the map is initialized, can I get away with not synchronizing
access to the map (since it will not change)?

Assuming that you create and populate the hash table in one thread during
startup, and /then/ create the threads that will share access to it, then that
is fine[*].

If the other threads are already running when you populate it (doing their own
startup and initialisation perhaps, or because you were using lazy
initialisation) then you must ensure that each thread (including the one that
created it) has passed through a synchronised section before they make use of
it. The simplest way to do that is to "synchronized" the method they use to
get the reference to the hash table. Once they have the reference, they can
use it without further synchronisation.

[*] Technically, I believe it is possible to create an implementation of
HashMap for which this is not the case; it is not clear to me that the HashMap
contract forbids it. However the Sun implementation is safe and I doubt if
anyone would produce an implementation that differed in such an essential way
from the reference.]

-- chris

bugbear · Aug 24, 2004

Michael said:
You'll need to put the two parts of the key in a separate class and
overwrite that class's equals() and hashCode() methods. Doing this
both correctly and efficiently can be tricky, and (as has been
discussed here hundreds of times before) if done incorrectly, the
Map will behave very wrongly.

Or you could use the Apache collections MultiKeyMap class
or MultiKey class.

They're very nice; you "separate" class may be better in this
instance, but I though I'd mention an alternative.

BugBear

table lookups	1	May 20, 2005
What Collection to use?	9	Jan 29, 2007
which collection to use	14	Dec 22, 2006
calling System.gc to suggest garbage collection	6	Nov 26, 2009
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Disk Backed Collection/DB for Extremely Large Datasets - Best Option	5	Aug 10, 2005
How to iterate a collection fast and memory-efficient?	21	Jan 22, 2005
Pythonification of the asterisk-based collection packing/unpacking syntax	121	Dec 17, 2011

best collection to use for fast lookups

Rob Baxter

Paul Lutus

Nick Pomfret

Frank

David Segall

Chris Uppal

Rob Baxter

Michael Borgwardt

John C. Bollinger

Frank

Chris Uppal

bugbear

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads