Hash or Tree

W

Wojtek

I have a HashMap which contains about 2K items. The key is a string,
and the value is a string. The HashMap is written to once upon
application start, then read many, many times. It holds I18N info, and
is used for every Web page.

Would it be faster (more efficient) to use a TreeMap? Or some other
type of collection?
 
J

Jack

Wojtek a écrit :
I have a HashMap which contains about 2K items. The key is a string, and
the value is a string. The HashMap is written to once upon application
start, then read many, many times. It holds I18N info, and is used for
every Web page.

I you want to make it faster, you can precompute the hashCode of all the
keys, in subclass of String:

public int hashCode() {
if (hashDone)
return hash;
else
return super.hashCode();
}
 
S

Stefan Ram

Eric Sosman said:
measure them, and eliminate guesswork.

Yes, but this will give an answer for a specific
implementation of Java under a specific processor and
operating system with specific JVM options used only.

Opposite results might be obtained under other
implementations, environments, and options.

If one is distributing his program as source code (as in »free
software« or in a tutorial or so), and even under other
circumstances, he can not know nor control the Java
implementation or JVM used to run it.

Also, the results of future implementations might differ.

So, one also might consider to use the data structure that
seems to be reasonable given the knowledge one has about
the algorithms employed.

A measurement is worth the effort, if run time turns out to be
an issue and if the target enviroment can be controlled or is
known.
 
R

Roedy Green

Would it be faster (more efficient) to use a TreeMap? Or some other
type of collection?

HashMaps are when you need to look up by precise key. TreeMaps are
for when you need to look up by approximate key, or to find the next
key.

HashMaps are simpler and faster.
 
S

Stefan Ram

Yes, but this will give an answer for a specific
implementation of Java under a specific processor and
operating system with specific JVM options used only.

Not to forget: ... and for specific test data.

The performance of a sorted data structure might
vary depending on the test data used. For example,
test data read might already be nearly sorted, or
unsorted or nearly inversely sorted.

To choose relevant test data, one needs to have an
idea about the statistics of the data in the
relevant application cases.
 
M

Mark Space

Wojtek said:
I have a HashMap which contains about 2K items. The key is a string, and
the value is a string. The HashMap is written to once upon application
start, then read many, many times. It holds I18N info, and is used for
every Web page.

Would it be faster (more efficient) to use a TreeMap? Or some other type
of collection?

If you have a working program with data, why not just test it and see?
I'd time it externally (CPU time from the command line) as well as run a
shorter test case in a profiler.
 
N

Neil Coffey

Wojtek said:
I have a HashMap which contains about 2K items. The key is a string, and
the value is a string. The HashMap is written to once upon application
start, then read many, many times. It holds I18N info, and is used for
every Web page.

Would it be faster (more efficient) to use a TreeMap? Or some other type
of collection?

On paper, it will be less efficient.

HashMap is designed to give you "constant" access time in the
average case, whatever the number of elements in the map. That is,
on average, a seek will involve calculating the hash code of the key
being sought, going to the corresponding bucket in the hash table
and performing exactly one comparison to confirm that the item in
that bucket does indeed correspond to the key being sought. (Occasionaly
or if you have an unlucky key distribution, there will be more items
in the bucket and more comparisons will be made.)

A TreeMap stores its data in a binary tree structure, and so after
computing the hash code of the key being sought, a seek will on average
have to "go down an extra level in the tree" (=make an extra
comparison) for each doubling of the number of elements in the map.

The trade-off for this is that the TreeMap offers extra functionality:
the keys are maintained in sorted order, so that it is efficient to
enumerate a particular range of keys in order.

In general, go for the minimal implementation that offers the
functionality you need. Unless you need to enumerate ranges of keys
in order, stick to HashMap.

Neil
 
L

Lasse Reichstein Nielsen

Wojtek said:
I have a HashMap which contains about 2K items. The key is a string,
and the value is a string. The HashMap is written to once upon
application start, then read many, many times. It holds I18N info, and
is used for every Web page.

Would it be faster (more efficient) to use a TreeMap? Or some other
type of collection?

I doubt TreeMap is faster than HashMap. To traverse the tree, you need
to compare to each string key on the path. With 2K nodes that's ~19
expected comparisons, and as you go down the tree, you are more and
more likely to have common prefixes of the strings, making comparison
take longer. The final matching comparison, if the key exists, will
require traversing the entire string, just as an equality test.

HashMap on the other hand needs only compute the hash code of the
query key once, and then do very few equality comparisons (which start
by comparing hash codes). With 2K elements, the default HashMap would
have a table size of 4096, so there will be likely be very few collisions.
I.e., an expected number of equality comparisons slightly over one.

Or, in other words, lookup is logarithmic in a TreeSet but constant
time in a HashSet (given properly distributed hash codes and table
size).

The advantage of TreeSet over HashSet is not in speed, but in the
invariant that the elements are sorted. It allows you to iterate in
sorted order, or find smallest or largest elements more efficiently.
If you don't need that, use HashSet for efficiency.

The String.hashCode() method and HashSet defaults are pretty good for
most uses.

/L
 
M

Mike Schilling

Eric said:
Question: This "subclass of String" that you mention: How
do you extend a `final' class?

A quick patch to rt.jar should do the trick ...
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jack schreef:
| Wojtek a écrit :
|> I have a HashMap which contains about 2K items. The key is a string,
|> and the value is a string. The HashMap is written to once upon
|> application start, then read many, many times. It holds I18N info, and
|> is used for every Web page.
|
| I you want to make it faster, you can precompute the hashCode of all the
| keys, in subclass of String:
|
| public int hashCode() {
| if (hashDone)
| return hash;
| else
| return super.hashCode();
| }

No need for that, String does it already (except if the hash code is 0).

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFIUOhAe+7xMGD3itQRAnGTAJ0ZfpNn3C85JehI6qrkUJ8pPc6CpQCdFfqF
n95uFbBUPG1jre71TSnjwTA=
=mzVo
-----END PGP SIGNATURE-----
 
P

Philipp

Wojtek said:
I have a HashMap which contains about 2K items. The key is a string, and
the value is a string. The HashMap is written to once upon application
start, then read many, many times. It holds I18N info, and is used for
every Web page.

Would it be faster (more efficient) to use a TreeMap? Or some other type
of collection?

As you seem to known the number of objects you want to store in advance,
may I draw your attention on this line of the HashMap doc:

"If many mappings are to be stored in a HashMap instance, creating it
with a sufficiently large capacity will allow the mappings to be stored
more efficiently than letting it perform automatic rehashing as needed
to grow the table."


Also, if you want to optimize further, and you know you have a well
distributed hashCode (which is the case for the String class), you may
copy the code of HashMap and remove the call to the hash(int h) method
which serves only to prevent very bad hashes to alter the HashMap's
performance prohibitively.

Although, as others will say, don't optimize early... Is access speed
really a bottleneck?

HTH
Phil
 
W

Wojtek

Thank-you for all for responding

Philipp wrote :
As you seem to known the number of objects you want to store in advance, may
I draw your attention on this line of the HashMap doc:

"If many mappings are to be stored in a HashMap instance, creating it with a
sufficiently large capacity will allow the mappings to be stored more
efficiently than letting it perform automatic rehashing as needed to grow the
table."

Ah yes, I seem to have forgetten about that :)
Also, if you want to optimize further, and you know you have a well
distributed hashCode (which is the case for the String class), you may copy
the code of HashMap and remove the call to the hash(int h) method which
serves only to prevent very bad hashes to alter the HashMap's performance
prohibitively.

If I do not use the call, how could its removal speed thihgs up? I only
fill the HashMap up once, when the app starts. It NEVER gets any more
values put in.
Although, as others will say, don't optimize early... Is access speed really
a bottleneck?

It is being used to hold key/language pairs for making the appliction
internationalization (I18N) aware. So EVERY page hit uses the HashMap
mutilple times:

title
msg to the user about what the page is about
prompt -- help
prompt -- help
propmt -- help
button, button, button
GMT date link to top of page
lcl date copyright

So this small page has 15 language elements, so 15 calls to the
HashMap. The HasMap is the single most used part of the entire app.

I create the menu on app startup and cache it for each language as that
would add 70 or so language elements if it was not cached.

So yes, I have spent a little time trying to make it faster.
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Wojtek schreef:
| Thank-you for all for responding
|
| Philipp wrote :
|> Wojtek wrote:
|>> I have a HashMap which contains about 2K items. The key is a string,
|>> and the value is a string. The HashMap is written to once upon
|>> application start, then read many, many times. It holds I18N info,
|>> and is used for every Web page.
|>>
|>> Would it be faster (more efficient) to use a TreeMap? Or some other
|>> type of collection?
|>>
|>
|> As you seem to known the number of objects you want to store in
|> advance, may I draw your attention on this line of the HashMap doc:
|>
|> "If many mappings are to be stored in a HashMap instance, creating it
|> with a sufficiently large capacity will allow the mappings to be
|> stored more efficiently than letting it perform automatic rehashing as
|> needed to grow the table."
|
| Ah yes, I seem to have forgetten about that :)
|
|> Also, if you want to optimize further, and you know you have a well
|> distributed hashCode (which is the case for the String class), you may
|> copy the code of HashMap and remove the call to the hash(int h)
|> method which serves only to prevent very bad hashes to alter the
|> HashMap's performance prohibitively.
|
| If I do not use the call, how could its removal speed thihgs up? I only
| fill the HashMap up once, when the app starts. It NEVER gets any more
| values put in.
|
|> Although, as others will say, don't optimize early... Is access speed
|> really a bottleneck?
|
| It is being used to hold key/language pairs for making the appliction
| internationalization (I18N) aware. So EVERY page hit uses the HashMap
| mutilple times:

You do know this is provided for by Java, right? Of course you do.

http://java.sun.com/developer/technicalArticles/Intl/ResourceBundles/
http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.html

and have a look at Eclipse’s Source → Externalize strings…

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFIUS81e+7xMGD3itQRAoTMAJ49ZvnqKBVj4EtkZb0C30O9my4vCQCeNxwk
uKPhXJs34B6SzjenDtKu9oo=
=f5cl
-----END PGP SIGNATURE-----
 
P

Philipp

Wojtek said:
Thank-you for all for responding

Philipp wrote :

If I do not use the call, how could its removal speed thihgs up? I only
fill the HashMap up once, when the app starts. It NEVER gets any more
values put in.

The internal code of HashMap for get(Object key) (which is the method
you call very often) looks like this:

public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}



Note in particular that the hashcode of the key is hashed once again by
the hash(int) method. If you have a well distributed hashCode, you can
forget about this step and implement a get which uses the hashCode of
the String directly.

I don't know if this will speed things up significantly, although if you
happen to measure it, please post your results.

Phil
 
W

Wojtek

Hendrik Maryns wrote :

Well yes. But they also use a HashMap.
and have a look at Eclipse’s Source → Externalize strings…
Never neede to. The app was designed with I18N in mind, so no need for
an external tool to manage this.

Besides, the key are designed to hold information so a button key would
look like:
cmn.button.cancel

where as the prompt for login:
person.prompt.login
person.promt.loginex

the ex stands for explanation

So a tool like Eclipse would only produce a mess (from our standpoint).
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Wojtek schreef:
| Hendrik Maryns wrote :
|>
|> You do know this is provided for by Java, right? Of course you do.
|>
|> http://java.sun.com/developer/technicalArticles/Intl/ResourceBundles/
|> http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.html
|
| Well yes. But they also use a HashMap.
|
|> and have a look at Eclipse’s Source → Externalize strings…
|>
| Never neede to. The app was designed with I18N in mind, so no need for
| an external tool to manage this.

Um, did you even have a look at it? It is a wizard which you run once,
which sets up the needed ResourceBundles and some code around it.
That’s all. You fully control the names of the keys.

Hey, it’s you who’s reinventing the wheel, not me.

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFIUTzre+7xMGD3itQRAot+AJ9lDcpQVjqHcwik65rWth42ohbZKwCggV6I
j5dlEZtZMXCir+1sN+WOr6M=
=iZvR
-----END PGP SIGNATURE-----
 
W

Wojtek

Philipp wrote :
Wojtek said:
Thank-you for all for responding

Philipp wrote :

If I do not use the call, how could its removal speed thihgs up? I only
fill the HashMap up once, when the app starts. It NEVER gets any more
values put in.

The internal code of HashMap for get(Object key) (which is the method you
call very often) looks like this:

public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}



Note in particular that the hashcode of the key is hashed once again by the
hash(int) method. If you have a well distributed hashCode, you can forget
about this step and implement a get which uses the hashCode of the String
directly.

Hmmm, store the key hashcode, then use it directly. This bears thinking
on.

We are already using an object (LangKey) to store the key, so storing
the hashcode for each key is trivial. But HashMap does not expose any
method which can get a value from a passed hash code.

BUT, I could place each key during the loading process into an
ArrayList, then when done create an array from it, then run through the
array and set the index value into the LangKey. Then use that index
value to directly access the array to get the value.
I don't know if this will speed things up significantly, although if you
happen to measure it, please post your results.

I will need to use a VERY fine-grained timer run over the same page
sequence. System.nanoTime() here we come...
 
R

Roger Lindsjö

Wojtek said:
It is being used to hold key/language pairs for making the appliction
internationalization (I18N) aware. So EVERY page hit uses the HashMap
mutilple times:

title
msg to the user about what the page is about
prompt -- help
prompt -- help
propmt -- help
button, button, button
GMT date link to top of page
lcl date copyright

So this small page has 15 language elements, so 15 calls to the HashMap.
The HasMap is the single most used part of the entire app.

I create the menu on app startup and cache it for each language as that
would add 70 or so language elements if it was not cached.

So yes, I have spent a little time trying to make it faster.

But is it actually a bottleneck? I wrote the small program below which
creates a map with 5000 entries, which it then iterates over, and adding
the string lengths (to make sure not too much is optimized away by
hotspot). I start one thread per cpy, and on my Core 2 Quad 2.4 GHz I
get about 40 million lookups (including the &, the sum and length()) per
core.

<sscce>
import java.util.HashMap;
import java.util.Map;

public class HashMapSpeed implements Runnable {

private static final int RUNS = 50;
private static final int LOOPS = 10000000;
private Map<String,String> map;
private String[] keys;

public static void main(String...arg) {
HashMapSpeed hms = new HashMapSpeed();
for (int i = 0; i < Runtime.getRuntime().availableProcessors();
i ++) {
new Thread(hms).start();
}
}

public void run() {
for(int i = 0; i < RUNS; i ++) {
long start = System.nanoTime();
int sum = doIt(LOOPS);
long delta = System.nanoTime() - start;
System.err.format("Loops/s %.0f dummy %s%n", LOOPS *
1000000000d / delta, sum);
}
}

public HashMapSpeed() {
keys = new String[5000];
map = new HashMap<String,String>();
for (int i = 0; i < keys.length; i ++) {
keys = "key-"+i;
map.put(keys, "Some fairly Long String " + i);
}
}

public int doIt(int times) {
int sum = 0;
for (int i = 0; i < times; i++) {
sum += map.get(keys[i % 0xfff]).length();
}
return sum;
}
}
</sscce>
 
N

Neil Coffey

Wojtek said:
So this small page has 15 language elements, so 15 calls to the HashMap.
The HasMap is the single most used part of the entire app. [...]
So yes, I have spent a little time trying to make it faster.

Silly question, but have you profiled it and found it to be "slow"? Or,
put another way, take up significant proportion of the overall CPU time
used to process a page request from start to finish?-- and is your
server actually strapped for CPU time? For the sake of argument, let's
say rendering a page takes 50 lookups and you get 20 page requests
per second, that's still "only" 1000 lookups per second. Unless you're
running your web server on a ZX Spectrum (and if you're getting a
serious volume of requests, you're probably using something reasonably
beefy in the first place...), this is probably a blip on the
horizon. And if in the meantime you're doing some I/O, database requests
etc, you may as well give your CPUs something to do while they're
waiting...

That said... there may be a possible improvement if the strings you're
currently looking up are *dynamic*. That is if you're doing something
like:

message = messages.get(languageCode + "menu");

In this case, there may be a slight performance gain in changing things
so that you always look up *static* strings. So you start with a map-
of-maps:

Map<String,Map<String,String>> messagesByCountry =
new HashMap<String,new IdentityHashMap<String,String>>;

Then when serving a page, you do one initial lookup to get the map for
your language, and after that, your keys are static strings:

Map<String,String> messages = messagesByCountry.get(langCode);
...
String menuStr = messages.get("menu");

If your strings are all static, you can use an IdentityHashMap (not
actually sure if this will give much gain since your static strings will
have cached hash codes anyway, and IIRC String.equals() tests for
object equality as the first test). If the keys that you're using to
*populate* the map at startup aren't static (i.e. not "embedded" in
the source code but read from a file etc), in any case
remember to call intern() on key string and put the interned version
into the map -- this way you will be literally putting in the same
String object as the one you later use to look up, and the equality
test will be fast.

I think it would be a horrid design for not much gain, but you could
also have a map of language code to array-of-strings, and have each
message type have a known position in the array. Probably not worth
doing unless you really *are* running your server on a ZX Spectrum,
though.

I repeat, though: if you haven't profiled your code and found the hash
map gets to actually be significant, you may risk putting a lot of work
in changing your code for no tangible gain.

Neil
 
W

Wojtek

Hendrik Maryns wrote :
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Wojtek schreef:
| Hendrik Maryns wrote :
|>
|> You do know this is provided for by Java, right? Of course you do.
|>
|> http://java.sun.com/developer/technicalArticles/Intl/ResourceBundles/
|> http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.html
|
| Well yes. But they also use a HashMap.
|
|> and have a look at Eclipse’s Source → Externalize strings…
|>
| Never neede to. The app was designed with I18N in mind, so no need for
| an external tool to manage this.

Um, did you even have a look at it? It is a wizard which you run once,
which sets up the needed ResourceBundles and some code around it.
That’s all. You fully control the names of the keys.

We found that people tended to mis-spell the keys, which meant the
wrong text, or no text was being displayed.

So we created a class named LangKey which holds the key and the text
pattern. The text format method only accepts a LangKey plus an Object
array for parameters.

It is not possible for keying mistakes with this method as the compiler
catches them.
Hey, it’s you who’s reinventing the wheel, not me.

Yup.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top