Collections API--counterintuitive behavior

S

Sierra Bravo

Dear all
I'm adding 80368 words (sorted alphabetically) into different data
structures and then serializing the entire container into a file. The
times that I got are shown below (in millseconds), the first column
being the time taken to populate the container (by reading each word
from the wordlist and adding) and the second, for serializing the
entire object.

The original project was to figure out if restoring from a serialized
container was faster than parsing words into a freshly created
container.

The results are counterintuitive for me--the synchronized (and
supposedly-obsolete) vector beats the non-synchronized hashset by
almost a factor of two for both operations! What gives?

Tree 884 2430
Linkedhash 703 2387
Hash 660 2355
Vector 392 987
Array 381 913

sb
-------------

import java.io.*;
import java.util.*;

public class DictCreate {
String word;
Set<String> set;
long t;
boolean flag = true;

public void create() throws Exception {
int count=0;

t = System.currentTimeMillis();
show("Opening word list");
set = new HashSet<String>();

BufferedReader reader = new BufferedReader(new
FileReader("dict.txt"));
ObjectOutputStream out = new ObjectOutputStream(
new FileOutputStream("words.ser"));

while (true) {
word = reader.readLine();
if (word == null) break;
set.add(word);
count++;
}

show(""+count+" words added. Beginning serialization...");
out.writeObject(set);
show("List serialized");
out.close();
reader.close();
}

public void show(String s) {
System.err.println(s+"["+(System.currentTimeMillis()-t)+"]");
}

public static void main(String s[]) throws Exception {
DictCreate d = new DictCreate();
d.create();
}
}
 
M

Mike Schilling

Sierra Bravo said:
Dear all
I'm adding 80368 words (sorted alphabetically) into different data
structures and then serializing the entire container into a file. The
times that I got are shown below (in millseconds), the first column
being the time taken to populate the container (by reading each word
from the wordlist and adding) and the second, for serializing the
entire object.

The original project was to figure out if restoring from a serialized
container was faster than parsing words into a freshly created
container.

The results are counterintuitive for me--the synchronized (and
supposedly-obsolete) vector beats the non-synchronized hashset by
almost a factor of two for both operations! What gives?

Tree 884 2430
Linkedhash 703 2387
Hash 660 2355
Vector 392 987
Array 381 913

Vectors and HashSets are quite different things. A Vector is. in effect, a
synchronized ArrayList, and I expect you'll find the Vector somewhat slower
then ArrayList due to synchronization overhead.
 
R

Roedy Green

The results are counterintuitive for me--the synchronized (and
supposedly-obsolete) vector beats the non-synchronized hashset by
almost a factor of two for both operations! What gives?

Tree 884 2430
Linkedhash 703 2387
Hash 660 2355
Vector 392 987
Array 381 913

When you read a HashSet back it recomputes the hashes, and rebuilds
the lookup structure. All that get written out are the raw objects,
the capacity and load factor. This takes longer than vector's just
reading an array of Objects.

here is the code for HashSet.readObject

/**
* Reconstitute the <tt>HashSet</tt> instance from a stream (that
is,
* deserialize it).
*/
private void readObject(java.io_ObjectInputStream s)
throws java.io.IOException, ClassNotFoundException {
// Read in any hidden serialization magic
s.defaultReadObject();

// Read in HashMap capacity and load factor and create backing
HashMap
int capacity = s.readInt();
float loadFactor = s.readFloat();
map = (((HashSet)this) instanceof LinkedHashSet ?
new LinkedHashMap<E,Object>(capacity, loadFactor) :
new HashMap<E,Object>(capacity, loadFactor));

// Read in size
int size = s.readInt();

// Read in all elements in the proper order.
for (int i=0; i<size; i++) {
E e = (E) s.readObject();
map.put(e, PRESENT);
}
}



--
Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes
 
S

Sierra Bravo

For completeness, I'm adding the results for ArrayList...(LinkedList is
too slow, and therefore I haven't considered it)

Tree 884 2430
Linkedhash 703 2387
Hash 660 2355
Vector 392 987
Array 381 913
ArrayList 612 2506

In general, List should be faster than Set since it does not have the
referential integrity constraint that Set has (of no duplicates). But
ArrayList turns out to be the slowest for serialization...!

But how come the add operation is way faster for Vector despite it
being synchronized? Is it implemented in some kind of native code?

sb.
 
M

Mike Schilling

Sierra Bravo said:
For completeness, I'm adding the results for ArrayList...(LinkedList is
too slow, and therefore I haven't considered it)

Tree 884 2430
Linkedhash 703 2387
Hash 660 2355
Vector 392 987
Array 381 913
ArrayList 612 2506

In general, List should be faster than Set since it does not have the
referential integrity constraint that Set has (of no duplicates). But
ArrayList turns out to be the slowest for serialization...!

But how come the add operation is way faster for Vector despite it
being synchronized? Is it implemented in some kind of native code?

You need to double check your code (or post it and let the rest of us help
you with that.) The numbers for ArrayList make no sense.
 
T

Thomas Hawtin

Mike said:
You need to double check your code (or post it and let the rest of us help
you with that.) The numbers for ArrayList make no sense.

Particularly as the readObject implementation for Vector is

s.defaultWriteObject();


Some rules for microbenchmarking:

o Remember results may not reflect operation of a real application.
o Take several measurements from each run. Some may be dominated by
HotSpot (or similar) recompiling.
o Only measure the code you want to measure. Typically don't include
any IO.
o Don't put everything into one method. HotSpot (and similar) are
designed to compile real code, so they may well perform badly on Mickey
Mouse code.

Tom Hawtin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top