Performance of hash_set vs. Java

A

Alex Gerdemann

Hello,

I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.

Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a hash_set
of unique nodes (std::string's). The words are then sorted and numbered by
copying the hash_set into a vector and sorting. Then, I run a series of
binary searches on the vector to allocate lines in a matrix for each
element.

I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.

Also, Java provides a HashSet class that can hash any kind of element it may
contain. Since, the STL only hashes selected items I had to convert my
std::strings to const char*'s for hashing.

Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this have
to allocate new memory or just point to somewhere inside the object? I'm
kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.

On the plus side there is a great direct sparse linear solver for C. I've
only been able to find a iterative sparse solver for Java. This makes the
C++ code overall much faster than Java, but I suspect that the parse stage
should run at least as fast or faster.

Thanks,

Alex Gerdemann
University of Illinois at Urbana-Champaign

For reference, I've included a sketch of the code

So I did something like the following:

struct hashSet {
__gnu_css::hash<const char*> h;

bool operator()(std::string& s) const {
return h(s.ctr());
}
};

inline int getIndex(const std::vector<std::string>& list, const std::string&
item) {
std::vector<std::string>::const_iterator i =
std::lower_bound(list.begin(), list.end(), item);
if(i==list.end()) {
return -(static_cast<int>(i-list.begin())+1);
} else if(*i != item) {
return -(static_cast<int>(i-list.begin())+1);
}
return static_cast<int>(i-list.begin());
}

__gnu_cxx::hash_set<std::string, hashString> nodeSet;
std::vector<std::string> nodeVector;

while(!done) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert(nextNode);
}
std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
std::sort(nodeVector.begin(),nodeVector.end());
while(!done) {
//read line from previously stored words
//more error checking
thisNode = some word in line
getIndex(nodeVector, thisNode);
//allocate an object representing element containing proper indices
representing nodes
//push pointer to element into a vector
}
 
?

=?ISO-8859-1?Q?=22Nils_O=2E_Sel=E5sdal=22?=

Alex said:
Hello,
struct hashSet {
__gnu_css::hash<const char*> h;

bool operator()(std::string& s) const {
return h(s.ctr());
}
};

inline int getIndex(const std::vector<std::string>& list, const std::string&
item) {
std::vector<std::string>::const_iterator i =
std::lower_bound(list.begin(), list.end(), item);
if(i==list.end()) {
return -(static_cast<int>(i-list.begin())+1);
} else if(*i != item) {
return -(static_cast<int>(i-list.begin())+1);
}
return static_cast<int>(i-list.begin());
}

__gnu_cxx::hash_set<std::string, hashString> nodeSet;
std::vector<std::string> nodeVector;
When you use std::string , you will have *alot* of string copying
going on. Java will just pass its "refrece"/"pointer" along.
Try convert lists, etc. to hold std::string* (pointers )rather
 
T

Tom Widmer

Hello,

I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.

No, it probably won't be unless you use a better class than
std::string to hold the strings. java.lang.String is in several ways
more efficient than typical implementations of std::string.
Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a hash_set
of unique nodes (std::string's). The words are then sorted and numbered by
copying the hash_set into a vector and sorting. Then, I run a series of
binary searches on the vector to allocate lines in a matrix for each
element.

I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.

Also, Java provides a HashSet class that can hash any kind of element it may
contain. Since, the STL only hashes selected items I had to convert my
std::strings to const char*'s for hashing.

The main benefit Java has in hashing is that Strings cache their
hashcodes, so they need only be calculated once. std::string knows
nothing about hashing, and therefore doesn't perform this
optimization.
Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

How long did it take just to read in the file? I don't know much about
GCC's hash_set, but I assume it is configurable for bucket size, etc.
You might want to tweak this. What version of GCC are you using?
Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this have
to allocate new memory or just point to somewhere inside the object?

Sometimes is has to do the former, due to reference counting problems,
but usually it just returns a pointer to the storage. Tracing in using
the debugger would tell you what's going on in the hash calls.

I'm
kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.

Java just iterates over the whole String, generating the hash value,
and then caches it for future calls.
On the plus side there is a great direct sparse linear solver for C. I've
only been able to find a iterative sparse solver for Java. This makes the
C++ code overall much faster than Java, but I suspect that the parse stage
should run at least as fast or faster.

Thanks,

Alex Gerdemann
University of Illinois at Urbana-Champaign

For reference, I've included a sketch of the code

So I did something like the following:

struct hashSet {
__gnu_css::hash<const char*> h;

bool operator()(std::string& s) const {
return h(s.ctr());
}

I doubt that is your bottleneck, but if the strings are long, then
Java's hashcode caching might be giving it a major advantage.
};

inline int getIndex(const std::vector<std::string>& list, const std::string&
item) {
std::vector<std::string>::const_iterator i =
std::lower_bound(list.begin(), list.end(), item);
if(i==list.end()) {
return -(static_cast<int>(i-list.begin())+1);
} else if(*i != item) {
return -(static_cast<int>(i-list.begin())+1);
}
return static_cast<int>(i-list.begin());
}

__gnu_cxx::hash_set<std::string, hashString> nodeSet;
std::vector<std::string> nodeVector;

while(!done) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert(nextNode);

The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.

Here you definitely want:
nodeVector.reserve(nodeSet.size());
std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
std::sort(nodeVector.begin(),nodeVector.end());

Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.
while(!done) {
//read line from previously stored words
//more error checking
thisNode = some word in line
getIndex(nodeVector, thisNode);
//allocate an object representing element containing proper indices
representing nodes
//push pointer to element into a vector
}

Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.

Tom
 
I

Ivan Vecerina

Alex Gerdemann said:
I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.
This can often be the case. Some native Java features can be more efficient
that the C++ equivalents (which sometimes are designed for more
flexibility).
This is especially true for C++ streams.
Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a
hash_set of unique nodes (std::string's). The words are then sorted and
numbered by copying the hash_set into a vector and sorting. Then, I run a
series of binary searches on the vector to allocate lines in a matrix for
each element.
As someone already pointed out, copying strings can be expensive in C++.
An alternative would be to keep the originally read std::string,
then use only a "const char*" pointer obtained using the string::c_str()
member function (I'd do this rather than use a std::string*).
I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.
This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.
Also, Java provides a HashSet class that can hash any kind of element it
may contain. Since, the STL only hashes selected items I had to convert
my std::strings to const char*'s for hashing.
All C++ hash_set implementations I've seen allow users to provide
a custom hash function as an additional parameter. But this shouldn't
be a big problem.
Other than that, my code in the Java and C++ is quite similar. However,
the parse of a 100k line file took around a second for Java, but more like
17s for C++ . In fact, using std::set instead of hash_set actually
improved performance to about 15s in C++.
Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the size
of your hash table. Also, as previously pointed out, beware of std::string
instances being copied.
Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this
have to allocate new memory or just point to somewhere inside the object?
I'm kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.
Are you sure that the hashing is the performance bottleneck?
C++ i/o streams could also be a likely cause for the weak performance.

All in all, I am very confident that better performance can be achieved
in C++ than in Java. But in some cases, especially string and file i/o,
this can require extra work :(

Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.

Cheers -Ivan
 
A

Alex Gerdemann

Tom said:
How long did it take just to read in the file? I don't know much about
GCC's hash_set, but I assume it is configurable for bucket size, etc.
You might want to tweak this. What version of GCC are you using?

I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
how much time is spent just reading the file, as I process it line by line,
doing some work between each I/O call.
The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.

Since you asked here's exactly what I did:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std::string> thisLine;

do {
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));
}
}

Here you definitely want:
nodeVector.reserve(nodeSet.size());


Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.

I added the reserve call which saved me a couple of tenths of a second.
Nice, but must not be the major bottleneck. What is a "reference counted"
class?
Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.

I went ahead and tried this, but can't quite understand the results. The
total time it computes is way off. (It says its 8s vs. the actual 14s).
Also, it thinks that only 30% of run time was spent in the main() function.
This can't possibly be right.

-Alex Gerdemann
University of Illinois Urbana-Champaign
 
A

Alex Gerdemann

Ivan said:
This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.

Well the specific negative value returned is unnecessary, but I do need to
know if the string isn't found. It's kind of annoying that the function
doesn't return this information because to find it myself I have to:

1) check that the returned iterator does not point to set.end()
2) check that the iterator points to what it claims.

Certainly, the search algorithm knows internally if it found what it was
looking for. Having to do that job again will certainly cost at least some
time.

The reason the node may not be in the list has to do with the details of
circuit analysis. Specifically, the matrix describing the system should not
include a row or column representing the ground node. So, my program
collects all the nodes the user defines, searches for the ground node, and
erases it from the list. That way, I can sort the list, and use the index
of each node to allocate a line in the matrix. Since I've collected all the
nodes, I know that if a particular node isn't found in the list, it must be
the ground node. It's kind of an ugly mechanism, I guess, but I currently
don't have a better idea.
Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the size
of your hash table. Also, as previously pointed out, beware of std::string
instances being copied.

I would like to convert the vector to store pointers to strings, rather than
the strings themselves, but then I cannot search use the built in sort, and
binary searches to find a particular string efficiently.
Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.

I didn't write my own hash table. I though hash_set handled this problem on
its own. Not coming from a CS background, I don't know how to write a good
hash scheme, and it seems like this is the sort of thing that should be
provided by the built in libraries. Since it works faster, for now, I've
switched back to the regular (tree?) set. I want to use a set so repeat
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases.

On the I/O code, I posted this in my other reply, but here's another copy:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std::string> thisLine;

do {
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));
}
}

Thanks for the tips,

-Alex Gerdemann
University of Illinois Urbana-Champaign
 
I

Ivan Vecerina

Alex Gerdemann said:
Well the specific negative value returned is unnecessary, but I do need to
know if the string isn't found. It's kind of annoying that the function
doesn't return this information because to find it myself I have to:

1) check that the returned iterator does not point to set.end()
2) check that the iterator points to what it claims.

Certainly, the search algorithm knows internally if it found what it was
looking for. Having to do that job again will certainly cost at least
some
time.
Actually lower_bound may will not test that the hit value is actually
equal to the one being looked for (because it only uses '<' for comparison)
but I agree the interface is cumbersome.
std::equal_range could be an alternative...
I would like to convert the vector to store pointers to strings, rather
than
the strings themselves, but then I cannot search use the built in sort,
and
binary searches to find a particular string efficiently.

Actually you can, but you will need to provide these algorithms with an
additional parameter, which is a comparison function:

// add this declaration
struct StrPtrCompare
{
bool operator()(char const* a, char const* b) const
{ return std::strcmp( a, b ) < 0; }
};

//and from your function call call:
std::sort( vect.begin(), vect.end(), StrPtrCompare() );

I didn't write my own hash table.

I did not suggest you should. Just some implementations of hash_set
have an equivalent of vector::reserve() to preallocate a larger
table (and avoid later reallocations). But no big deal...
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster
insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases.
What may be faster is to first copy and sort everything into a vector,
then look for (contiguous) duplicate items and remove them.
[ since you need a sorted vector anyway, the hash may be redundant ]
On the I/O code, I posted this in my other reply, but here's another copy:

std::vector< std::vector< std::string > > lines;

Unfortunately, such a data structure can be inefficient in C++,
and involve many object copies an memory allocations.
But there are a few tricks that can help....
std::string line;
std::vector<std::string> thisLine;

do {
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);

Instead of the last two lines, the following will avoid
an intermediate copy of the objects and sub-objects:
lines.push_back( std::vector<std::string>() );
split( line, lines.back() );

Also, it will help if you call
lines.reserve( someGuessOfTheNumberOfInputLines );
prior to reading the file.
Alternatively, you could use a different type:
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>&
words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));
}
}

Now not talking about efficiency, all the input code above probably
could be simplified as follows (including <sstream> and <iterator>):

while( std::getline(file,line) )
{
lines.push_back( std::vector<std::string>() ); // add empty object
lines.back().assign(
std::istream_iterator<std::string>( istringstream(line) )
std::istream_iterator<std::string>() );
}

This is the C++ input code I would start with.
[ for the rest, the tips above - use char* after this input code
and skip the hash if possible - should improve execution speed ]


If the reading code remains a critical bottleneck, the
ultimate solution would be to:
1) skip iostreams, and use memory mapping or a single fread
to bring the whole file into memory
2) parse the file in-place, and add null chars at the end of
each 'word', so I can use a simple char* to access all
strings in-place, without any memory allocation.
That would take me an extra day of programming and be less
portable/maintainable, but be blazingly fast...



Well... I hope some of this stuff will be useful.
[I'm a bit in a rush]

Cheers,
Ivan
 
I

Ivan Vecerina

I (Ivan Vecerina) wrote in message news:[email protected]...
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster
insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases.
What may be faster is to first copy and sort everything into a vector,
then look for (contiguous) duplicate items and remove them.
[ since you need a sorted vector anyway, the hash may be redundant ]
NB: however this only makes sense if there are few duplicate strings to
detect.
Well... I hope some of this stuff will be useful.
And of course it is difficult to suggest solutions
without seeing the big picture...

Regards
-Ivan
 
T

Tom Widmer

I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
how much time is spent just reading the file, as I process it line by line,
doing some work between each I/O call.

Ok, I'm 95% sure that gcc 3.3 uses a copy-on-write (COW) string
implementation, with reference counting (gcc 3.4 certainly does). This
means that copying a string doesn't require that any memory is
allocated - the copy shares representation with the original and only
creates its own unique copy when it might modify it. This also means
that copying strings is cheap, as long as those strings are only
accessed through const member functions after the copy has occurred.
Since you asked here's exactly what I did:

std::vector< std::vector< std::string > > lines;

You definitely want:
lines.reserve(estimateOfNumberOfLines);
std::string line;
std::vector<std::string> thisLine;

do {

line.reserve(enoughForALine); //may well help
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);

The above line is going to be slow, since it involves copying the
whole vector. Instead, you might do:
//avoid coping the vector:
lines.resize(lines.size() + 1); //add extra default element
lines.back().swap(thisLine); //swap it with the current element
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();

words.reserve(50); //say
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));

That might be slightly more efficient as:

words.push_back(std::string(line, firstMark, lastMark - firstMark));

but I doubt it will make much difference with a COW string
implementation.
}
}



I added the reserve call which saved me a couple of tenths of a second.
Nice, but must not be the major bottleneck. What is a "reference counted"
class?

See above.
I went ahead and tried this, but can't quite understand the results. The
total time it computes is way off. (It says its 8s vs. the actual 14s).
Also, it thinks that only 30% of run time was spent in the main() function.
This can't possibly be right.

It might not be counting time spent in system functions, such as IO
(system vs user time?).

If you want to optimize the above, I'd do this:

Write a simple immutable string type that is initialized with a
pointer and a length, but allocate no memory and does nothing in the
destructor. Include in the string a cached hashcode value, so that the
hashcode need only be calculated once. e.g.

class mystring
{
char const* m_ptr;
std::size_t m_length;
mutable unsigned long m_hashCode;
public:
mystring(char const* ptr, std::size_t length)
:m_ptr(ptr), m_length(length),
m_hashCode(static_cast<unsigned long>(-1))
{
}

unsigned long hashCode() const
{
if (m_hashCode == static_cast<unsigned long>(-1))
{
//calculate hashCode (copy java.lang.String code?)
}
return m_hashCode;
}

char operator[](std::size_t index) const
{
return m_ptr[index];
}

//operator==, <, etc.
//compiler generated destructor, copy, assignment are fine.
};

Read the entire file into a vector<char>.

Iterate over the vector, adding creating "mystring"s pointing into the
vector for each word. You might also consider replacing ' ' characters
with '\0's, so that you can add c_str() method to mystring that simply
looks like this:
char const* c_str() const
{
return m_ptr;
}

Operate on this new vector<vector<mystring> >.

That should be much much faster, since the memory allocation overhead
will be vastly decreased. If you don't need a vector of lines, just
have an overall vector of words for another speed up. Essentially,
optimization in non-numerical C++ is often about reducing the number
of calls to "new" and "delete", which are often even slower than the
Java versions (new + gc).

Tom
 
K

Karl Heinz Buchegger

Alex said:
[snip]
Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

Hmm. Just a quick question:
Are you sure the C++ optimizer run over your code and you are not
timing a debug version?

Especially with container templates the optimizer can often dramatically
reduce execution time.
 
T

Tom Widmer

Alex said:
[snip]
Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

Hmm. Just a quick question:
Are you sure the C++ optimizer run over your code and you are not
timing a debug version?

Especially with container templates the optimizer can often dramatically
reduce execution time.

Good point. For GCC, use -O3 as a minimum (you can add architecture
specific optimizations too if you like).

Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top