Performance of hash_set vs. Java

Discussion in 'C++' started by Alex Gerdemann, Oct 11, 2004.

  1. Hello,

    I have spent a bunch of time converting a Java program I wrote to C++ in
    order to improve performance, and have found that it is not necessarily
    faster.

    Specifically, I'm writing a circuit simulator, so I must first parse the
    netlist file. The program goes through an input file, and makes a hash_set
    of unique nodes (std::string's). The words are then sorted and numbered by
    copying the hash_set into a vector and sorting. Then, I run a series of
    binary searches on the vector to allocate lines in a matrix for each
    element.

    I've modified STL's lower_bound() to my getIndex() which functions like
    Java's equivalent by returning negative indices when the element is not
    found.

    Also, Java provides a HashSet class that can hash any kind of element it may
    contain. Since, the STL only hashes selected items I had to convert my
    std::strings to const char*'s for hashing.

    Other than that, my code in the Java and C++ is quite similar. However, the
    parse of a 100k line file took around a second for Java, but more like 17s
    for C++ . In fact, using std::set instead of hash_set actually improved
    performance to about 15s in C++.

    Any idea why there would be such a difference? In particular, I suspect
    that my hash function may be slow from the call to c_str(). Does this have
    to allocate new memory or just point to somewhere inside the object? I'm
    kind of curious what Java actually uses for its hash. Any general
    performace/other criticisms on my scheme are certainly welcome.

    On the plus side there is a great direct sparse linear solver for C. I've
    only been able to find a iterative sparse solver for Java. This makes the
    C++ code overall much faster than Java, but I suspect that the parse stage
    should run at least as fast or faster.

    Thanks,

    Alex Gerdemann
    University of Illinois at Urbana-Champaign

    For reference, I've included a sketch of the code

    So I did something like the following:

    struct hashSet {
    __gnu_css::hash<const char*> h;

    bool operator()(std::string& s) const {
    return h(s.ctr());
    }
    };

    inline int getIndex(const std::vector<std::string>& list, const std::string&
    item) {
    std::vector<std::string>::const_iterator i =
    std::lower_bound(list.begin(), list.end(), item);
    if(i==list.end()) {
    return -(static_cast<int>(i-list.begin())+1);
    } else if(*i != item) {
    return -(static_cast<int>(i-list.begin())+1);
    }
    return static_cast<int>(i-list.begin());
    }

    __gnu_cxx::hash_set<std::string, hashString> nodeSet;
    std::vector<std::string> nodeVector;

    while(!done) {
    //read a line of file
    //split the line into words and push into a vector
    //check for syntax errors
    nextNode = some word in line;
    nodeSet.insert(nextNode);
    }
    std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
    std::sort(nodeVector.begin(),nodeVector.end());
    while(!done) {
    //read line from previously stored words
    //more error checking
    thisNode = some word in line
    getIndex(nodeVector, thisNode);
    //allocate an object representing element containing proper indices
    representing nodes
    //push pointer to element into a vector
    }
    Alex Gerdemann, Oct 11, 2004
    #1
    1. Advertising

  2. Alex Gerdemann wrote:
    > Hello,
    > struct hashSet {
    > __gnu_css::hash<const char*> h;
    >
    > bool operator()(std::string& s) const {
    > return h(s.ctr());
    > }
    > };
    >
    > inline int getIndex(const std::vector<std::string>& list, const std::string&
    > item) {
    > std::vector<std::string>::const_iterator i =
    > std::lower_bound(list.begin(), list.end(), item);
    > if(i==list.end()) {
    > return -(static_cast<int>(i-list.begin())+1);
    > } else if(*i != item) {
    > return -(static_cast<int>(i-list.begin())+1);
    > }
    > return static_cast<int>(i-list.begin());
    > }
    >
    > __gnu_cxx::hash_set<std::string, hashString> nodeSet;
    > std::vector<std::string> nodeVector;

    When you use std::string , you will have *alot* of string copying
    going on. Java will just pass its "refrece"/"pointer" along.
    Try convert lists, etc. to hold std::string* (pointers )rather
    =?ISO-8859-1?Q?=22Nils_O=2E_Sel=E5sdal=22?=, Oct 11, 2004
    #2
    1. Advertising

  3. Alex Gerdemann

    Tom Widmer Guest

    On Mon, 11 Oct 2004 11:56:38 GMT, "Alex Gerdemann"
    <> wrote:

    >Hello,
    >
    >I have spent a bunch of time converting a Java program I wrote to C++ in
    >order to improve performance, and have found that it is not necessarily
    >faster.


    No, it probably won't be unless you use a better class than
    std::string to hold the strings. java.lang.String is in several ways
    more efficient than typical implementations of std::string.

    >Specifically, I'm writing a circuit simulator, so I must first parse the
    >netlist file. The program goes through an input file, and makes a hash_set
    >of unique nodes (std::string's). The words are then sorted and numbered by
    >copying the hash_set into a vector and sorting. Then, I run a series of
    >binary searches on the vector to allocate lines in a matrix for each
    >element.
    >
    >I've modified STL's lower_bound() to my getIndex() which functions like
    >Java's equivalent by returning negative indices when the element is not
    >found.
    >
    >Also, Java provides a HashSet class that can hash any kind of element it may
    >contain. Since, the STL only hashes selected items I had to convert my
    >std::strings to const char*'s for hashing.


    The main benefit Java has in hashing is that Strings cache their
    hashcodes, so they need only be calculated once. std::string knows
    nothing about hashing, and therefore doesn't perform this
    optimization.

    >Other than that, my code in the Java and C++ is quite similar. However, the
    >parse of a 100k line file took around a second for Java, but more like 17s
    >for C++ . In fact, using std::set instead of hash_set actually improved
    >performance to about 15s in C++.


    How long did it take just to read in the file? I don't know much about
    GCC's hash_set, but I assume it is configurable for bucket size, etc.
    You might want to tweak this. What version of GCC are you using?

    >Any idea why there would be such a difference? In particular, I suspect
    >that my hash function may be slow from the call to c_str(). Does this have
    >to allocate new memory or just point to somewhere inside the object?


    Sometimes is has to do the former, due to reference counting problems,
    but usually it just returns a pointer to the storage. Tracing in using
    the debugger would tell you what's going on in the hash calls.

    I'm
    >kind of curious what Java actually uses for its hash. Any general
    >performace/other criticisms on my scheme are certainly welcome.


    Java just iterates over the whole String, generating the hash value,
    and then caches it for future calls.

    >On the plus side there is a great direct sparse linear solver for C. I've
    >only been able to find a iterative sparse solver for Java. This makes the
    >C++ code overall much faster than Java, but I suspect that the parse stage
    >should run at least as fast or faster.
    >
    >Thanks,
    >
    >Alex Gerdemann
    >University of Illinois at Urbana-Champaign
    >
    >For reference, I've included a sketch of the code
    >
    >So I did something like the following:
    >
    >struct hashSet {
    > __gnu_css::hash<const char*> h;
    >
    > bool operator()(std::string& s) const {
    > return h(s.ctr());
    > }


    I doubt that is your bottleneck, but if the strings are long, then
    Java's hashcode caching might be giving it a major advantage.

    >};
    >
    >inline int getIndex(const std::vector<std::string>& list, const std::string&
    >item) {
    > std::vector<std::string>::const_iterator i =
    >std::lower_bound(list.begin(), list.end(), item);
    > if(i==list.end()) {
    > return -(static_cast<int>(i-list.begin())+1);
    > } else if(*i != item) {
    > return -(static_cast<int>(i-list.begin())+1);
    > }
    > return static_cast<int>(i-list.begin());
    >}
    >
    >__gnu_cxx::hash_set<std::string, hashString> nodeSet;
    >std::vector<std::string> nodeVector;
    >
    >while(!done) {
    > //read a line of file
    > //split the line into words and push into a vector
    > //check for syntax errors
    > nextNode = some word in line;
    > nodeSet.insert(nextNode);


    The above code may be where your main bottleneck is. How do you read
    the line and then split it into words? If you are creating a number of
    temporary strings, that will be slowing you down. You might consider
    using a fixed vector or two and reusing them, to avoid temporaries.

    >}


    Here you definitely want:
    nodeVector.reserve(nodeSet.size());

    >std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
    >std::sort(nodeVector.begin(),nodeVector.end());


    Both of those operations should be fast if you are using a reference
    counted string class. I believe GCC 3+ uses such a beast.

    >while(!done) {
    > //read line from previously stored words
    > //more error checking
    > thisNode = some word in line
    > getIndex(nodeVector, thisNode);
    > //allocate an object representing element containing proper indices
    >representing nodes
    > //push pointer to element into a vector
    >}


    Overall, I think your best bet would be to benchmark smaller bits of
    the code or put it through a profiler (-gprof IIRC, which I probably
    don't), to find out where the bottleneck actually lies.

    Tom
    Tom Widmer, Oct 11, 2004
    #3
  4. "Alex Gerdemann" <> wrote in message
    news:Wxuad.369933$Fg5.53609@attbi_s53...
    > I have spent a bunch of time converting a Java program I wrote to C++ in
    > order to improve performance, and have found that it is not necessarily
    > faster.

    This can often be the case. Some native Java features can be more efficient
    that the C++ equivalents (which sometimes are designed for more
    flexibility).
    This is especially true for C++ streams.

    > Specifically, I'm writing a circuit simulator, so I must first parse the
    > netlist file. The program goes through an input file, and makes a
    > hash_set of unique nodes (std::string's). The words are then sorted and
    > numbered by copying the hash_set into a vector and sorting. Then, I run a
    > series of binary searches on the vector to allocate lines in a matrix for
    > each element.

    As someone already pointed out, copying strings can be expensive in C++.
    An alternative would be to keep the originally read std::string,
    then use only a "const char*" pointer obtained using the string::c_str()
    member function (I'd do this rather than use a std::string*).

    > I've modified STL's lower_bound() to my getIndex() which functions like
    > Java's equivalent by returning negative indices when the element is not
    > found.

    This wrapper is an unnecessary Java-ism, but ok,
    it should be irrelevant to performance.

    > Also, Java provides a HashSet class that can hash any kind of element it
    > may contain. Since, the STL only hashes selected items I had to convert
    > my std::strings to const char*'s for hashing.

    All C++ hash_set implementations I've seen allow users to provide
    a custom hash function as an additional parameter. But this shouldn't
    be a big problem.

    > Other than that, my code in the Java and C++ is quite similar. However,
    > the parse of a 100k line file took around a second for Java, but more like
    > 17s for C++ . In fact, using std::set instead of hash_set actually
    > improved performance to about 15s in C++.

    Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
    has to be implementation-specific.
    However, you may want to see if you can pre-allocate or predefine the size
    of your hash table. Also, as previously pointed out, beware of std::string
    instances being copied.

    > Any idea why there would be such a difference? In particular, I suspect
    > that my hash function may be slow from the call to c_str(). Does this
    > have to allocate new memory or just point to somewhere inside the object?
    > I'm kind of curious what Java actually uses for its hash. Any general
    > performace/other criticisms on my scheme are certainly welcome.

    Are you sure that the hashing is the performance bottleneck?
    C++ i/o streams could also be a likely cause for the weak performance.

    All in all, I am very confident that better performance can be achieved
    in C++ than in Java. But in some cases, especially string and file i/o,
    this can require extra work :(

    Unfortunately, the code you posted shows very little about the file
    reading code and hash table, so we can't really help there.
    Regarding the vector and sorting only, I can say that using const char*
    instead of std::string is likely to improve execution speed.

    Cheers -Ivan
    --
    http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
    Ivan Vecerina, Oct 11, 2004
    #4
  5. Tom wrote:
    >>Other than that, my code in the Java and C++ is quite similar. However,
    >>the
    >>parse of a 100k line file took around a second for Java, but more like 17s
    >>for C++ . In fact, using std::set instead of hash_set actually improved
    >>performance to about 15s in C++.

    >
    > How long did it take just to read in the file? I don't know much about
    > GCC's hash_set, but I assume it is configurable for bucket size, etc.
    > You might want to tweak this. What version of GCC are you using?
    >


    I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
    how much time is spent just reading the file, as I process it line by line,
    doing some work between each I/O call.

    >>while(!done) {
    >> //read a line of file
    >> //split the line into words and push into a vector
    >> //check for syntax errors
    >> nextNode = some word in line;
    >> nodeSet.insert(nextNode);

    >
    > The above code may be where your main bottleneck is. How do you read
    > the line and then split it into words? If you are creating a number of
    > temporary strings, that will be slowing you down. You might consider
    > using a fixed vector or two and reusing them, to avoid temporaries.
    >


    Since you asked here's exactly what I did:

    std::vector< std::vector< std::string > > lines;
    std::string line;
    std::vector<std::string> thisLine;

    do {
    std::getline(file,line);
    if (line.length() != 0) {
    split(line,thisLine);
    lines.push_back(thisLine);
    //process the line
    } while(!netlist.eof());

    inline void split(const std::string& line, std::vector<std::string>& words)
    {
    unsigned int firstMark = 0, lastMark = 0;
    words.clear();
    while(lastMark!=std::string::npos) {
    firstMark=line.find_first_not_of(" ",lastMark);
    if(firstMark==std::string::npos) break;
    lastMark=line.find_first_of(" ",firstMark);
    words.push_back(line.substr(firstMark,lastMark-firstMark));
    }
    }


    >>}

    >
    > Here you definitely want:
    > nodeVector.reserve(nodeSet.size());
    >
    >>std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
    >>std::sort(nodeVector.begin(),nodeVector.end());

    >
    > Both of those operations should be fast if you are using a reference
    > counted string class. I believe GCC 3+ uses such a beast.
    >


    I added the reserve call which saved me a couple of tenths of a second.
    Nice, but must not be the major bottleneck. What is a "reference counted"
    class?

    > Overall, I think your best bet would be to benchmark smaller bits of
    > the code or put it through a profiler (-gprof IIRC, which I probably
    > don't), to find out where the bottleneck actually lies.


    I went ahead and tried this, but can't quite understand the results. The
    total time it computes is way off. (It says its 8s vs. the actual 14s).
    Also, it thinks that only 30% of run time was spent in the main() function.
    This can't possibly be right.

    -Alex Gerdemann
    University of Illinois Urbana-Champaign
    Alex Gerdemann, Oct 12, 2004
    #5
  6. Ivan wrote:
    >> I've modified STL's lower_bound() to my getIndex() which functions like
    >> Java's equivalent by returning negative indices when the element is not
    >> found.

    > This wrapper is an unnecessary Java-ism, but ok,
    > it should be irrelevant to performance.


    Well the specific negative value returned is unnecessary, but I do need to
    know if the string isn't found. It's kind of annoying that the function
    doesn't return this information because to find it myself I have to:

    1) check that the returned iterator does not point to set.end()
    2) check that the iterator points to what it claims.

    Certainly, the search algorithm knows internally if it found what it was
    looking for. Having to do that job again will certainly cost at least some
    time.

    The reason the node may not be in the list has to do with the details of
    circuit analysis. Specifically, the matrix describing the system should not
    include a row or column representing the ground node. So, my program
    collects all the nodes the user defines, searches for the ground node, and
    erases it from the list. That way, I can sort the list, and use the index
    of each node to allocate a line in the matrix. Since I've collected all the
    nodes, I know that if a particular node isn't found in the list, it must be
    the ground node. It's kind of an ugly mechanism, I guess, but I currently
    don't have a better idea.

    >> Also, Java provides a HashSet class that can hash any kind of element it
    >> Other than that, my code in the Java and C++ is quite similar. However,
    >> the parse of a 100k line file took around a second for Java, but more
    >> like 17s for C++ . In fact, using std::set instead of hash_set actually
    >> improved performance to about 15s in C++.

    > Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
    > has to be implementation-specific.
    > However, you may want to see if you can pre-allocate or predefine the size
    > of your hash table. Also, as previously pointed out, beware of std::string
    > instances being copied.


    I would like to convert the vector to store pointers to strings, rather than
    the strings themselves, but then I cannot search use the built in sort, and
    binary searches to find a particular string efficiently.

    > Unfortunately, the code you posted shows very little about the file
    > reading code and hash table, so we can't really help there.
    > Regarding the vector and sorting only, I can say that using const char*
    > instead of std::string is likely to improve execution speed.


    I didn't write my own hash table. I though hash_set handled this problem on
    its own. Not coming from a CS background, I don't know how to write a good
    hash scheme, and it seems like this is the sort of thing that should be
    provided by the built in libraries. Since it works faster, for now, I've
    switched back to the regular (tree?) set. I want to use a set so repeat
    copies of nodes will not be duplicated when added to the set. I actually
    don't do any of the searches until after the set is copied into a vector.
    Given that, should I actually expect the hash_set to have a faster insertion
    time? Not having a CS background, I just read Java's documentation which
    says that a hash set is faster in most cases.

    On the I/O code, I posted this in my other reply, but here's another copy:

    std::vector< std::vector< std::string > > lines;
    std::string line;
    std::vector<std::string> thisLine;

    do {
    std::getline(file,line);
    if (line.length() != 0) {
    split(line,thisLine);
    lines.push_back(thisLine);
    //process the line
    } while(!netlist.eof());

    inline void split(const std::string& line, std::vector<std::string>& words)
    {
    unsigned int firstMark = 0, lastMark = 0;
    words.clear();
    while(lastMark!=std::string::npos) {
    firstMark=line.find_first_not_of(" ",lastMark);
    if(firstMark==std::string::npos) break;
    lastMark=line.find_first_of(" ",firstMark);
    words.push_back(line.substr(firstMark,lastMark-firstMark));
    }
    }

    Thanks for the tips,

    -Alex Gerdemann
    University of Illinois Urbana-Champaign
    Alex Gerdemann, Oct 12, 2004
    #6
  7. "Alex Gerdemann" <> wrote in message
    news:%WTad.234454$D%.43700@attbi_s51...
    > Ivan wrote:
    >>> I've modified STL's lower_bound() to my getIndex() which functions like
    >>> Java's equivalent by returning negative indices when the element is not
    >>> found.

    >> This wrapper is an unnecessary Java-ism, but ok,
    >> it should be irrelevant to performance.

    >
    > Well the specific negative value returned is unnecessary, but I do need to
    > know if the string isn't found. It's kind of annoying that the function
    > doesn't return this information because to find it myself I have to:
    >
    > 1) check that the returned iterator does not point to set.end()
    > 2) check that the iterator points to what it claims.
    >
    > Certainly, the search algorithm knows internally if it found what it was
    > looking for. Having to do that job again will certainly cost at least
    > some
    > time.

    Actually lower_bound may will not test that the hit value is actually
    equal to the one being looked for (because it only uses '<' for comparison)
    but I agree the interface is cumbersome.
    std::equal_range could be an alternative...

    >>> Also, Java provides a HashSet class that can hash any kind of element it
    >>> Other than that, my code in the Java and C++ is quite similar. However,
    >>> the parse of a 100k line file took around a second for Java, but more
    >>> like 17s for C++ . In fact, using std::set instead of hash_set actually
    >>> improved performance to about 15s in C++.

    >> Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
    >> has to be implementation-specific.
    >> However, you may want to see if you can pre-allocate or predefine the
    >> size
    >> of your hash table. Also, as previously pointed out, beware of
    >> std::string
    >> instances being copied.

    >
    > I would like to convert the vector to store pointers to strings, rather
    > than
    > the strings themselves, but then I cannot search use the built in sort,
    > and
    > binary searches to find a particular string efficiently.


    Actually you can, but you will need to provide these algorithms with an
    additional parameter, which is a comparison function:

    // add this declaration
    struct StrPtrCompare
    {
    bool operator()(char const* a, char const* b) const
    { return std::strcmp( a, b ) < 0; }
    };

    //and from your function call call:
    std::sort( vect.begin(), vect.end(), StrPtrCompare() );


    >> Unfortunately, the code you posted shows very little about the file
    >> reading code and hash table, so we can't really help there.
    >> Regarding the vector and sorting only, I can say that using const char*
    >> instead of std::string is likely to improve execution speed.

    >
    > I didn't write my own hash table.


    I did not suggest you should. Just some implementations of hash_set
    have an equivalent of vector::reserve() to preallocate a larger
    table (and avoid later reallocations). But no big deal...

    > copies of nodes will not be duplicated when added to the set. I actually
    > don't do any of the searches until after the set is copied into a vector.
    > Given that, should I actually expect the hash_set to have a faster
    > insertion
    > time? Not having a CS background, I just read Java's documentation which
    > says that a hash set is faster in most cases.

    What may be faster is to first copy and sort everything into a vector,
    then look for (contiguous) duplicate items and remove them.
    [ since you need a sorted vector anyway, the hash may be redundant ]

    > On the I/O code, I posted this in my other reply, but here's another copy:
    >
    > std::vector< std::vector< std::string > > lines;


    Unfortunately, such a data structure can be inefficient in C++,
    and involve many object copies an memory allocations.
    But there are a few tricks that can help....

    > std::string line;
    > std::vector<std::string> thisLine;
    >
    > do {
    > std::getline(file,line);
    > if (line.length() != 0) {
    > split(line,thisLine);
    > lines.push_back(thisLine);


    Instead of the last two lines, the following will avoid
    an intermediate copy of the objects and sub-objects:
    lines.push_back( std::vector<std::string>() );
    split( line, lines.back() );

    Also, it will help if you call
    lines.reserve( someGuessOfTheNumberOfInputLines );
    prior to reading the file.
    Alternatively, you could use a different type:
    std::vector< std::vector< std::string > > lines;
    // will be faster on some platforms.

    > //process the line
    > } while(!netlist.eof());
    >
    > inline void split(const std::string& line, std::vector<std::string>&
    > words)
    > {
    > unsigned int firstMark = 0, lastMark = 0;
    > words.clear();
    > while(lastMark!=std::string::npos) {
    > firstMark=line.find_first_not_of(" ",lastMark);
    > if(firstMark==std::string::npos) break;
    > lastMark=line.find_first_of(" ",firstMark);
    > words.push_back(line.substr(firstMark,lastMark-firstMark));
    > }
    > }


    Now not talking about efficiency, all the input code above probably
    could be simplified as follows (including <sstream> and <iterator>):

    while( std::getline(file,line) )
    {
    lines.push_back( std::vector<std::string>() ); // add empty object
    lines.back().assign(
    std::istream_iterator<std::string>( istringstream(line) )
    std::istream_iterator<std::string>() );
    }

    This is the C++ input code I would start with.
    [ for the rest, the tips above - use char* after this input code
    and skip the hash if possible - should improve execution speed ]


    If the reading code remains a critical bottleneck, the
    ultimate solution would be to:
    1) skip iostreams, and use memory mapping or a single fread
    to bring the whole file into memory
    2) parse the file in-place, and add null chars at the end of
    each 'word', so I can use a simple char* to access all
    strings in-place, without any memory allocation.
    That would take me an extra day of programming and be less
    portable/maintainable, but be blazingly fast...



    Well... I hope some of this stuff will be useful.
    [I'm a bit in a rush]

    Cheers,
    Ivan
    --
    http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
    Ivan Vecerina, Oct 12, 2004
    #7
  8. I (Ivan Vecerina) wrote in message news:ckh97b$sna$...
    > > copies of nodes will not be duplicated when added to the set. I

    actually
    > > don't do any of the searches until after the set is copied into a

    vector.
    > > Given that, should I actually expect the hash_set to have a faster
    > > insertion
    > > time? Not having a CS background, I just read Java's documentation

    which
    > > says that a hash set is faster in most cases.

    > What may be faster is to first copy and sort everything into a vector,
    > then look for (contiguous) duplicate items and remove them.
    > [ since you need a sorted vector anyway, the hash may be redundant ]

    NB: however this only makes sense if there are few duplicate strings to
    detect.

    > Well... I hope some of this stuff will be useful.

    And of course it is difficult to suggest solutions
    without seeing the big picture...

    Regards
    -Ivan
    --
    http://ivan.vecerina.com/contact/?subject=NG_POST <- e-mail contact form
    Ivan Vecerina, Oct 13, 2004
    #8
  9. Alex Gerdemann

    Tom Widmer Guest

    On Tue, 12 Oct 2004 16:28:16 GMT, "Alex Gerdemann"
    <> wrote:

    >I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
    >how much time is spent just reading the file, as I process it line by line,
    >doing some work between each I/O call.


    Ok, I'm 95% sure that gcc 3.3 uses a copy-on-write (COW) string
    implementation, with reference counting (gcc 3.4 certainly does). This
    means that copying a string doesn't require that any memory is
    allocated - the copy shares representation with the original and only
    creates its own unique copy when it might modify it. This also means
    that copying strings is cheap, as long as those strings are only
    accessed through const member functions after the copy has occurred.

    >>>while(!done) {
    >>> //read a line of file
    >>> //split the line into words and push into a vector
    >>> //check for syntax errors
    >>> nextNode = some word in line;
    >>> nodeSet.insert(nextNode);

    >>
    >> The above code may be where your main bottleneck is. How do you read
    >> the line and then split it into words? If you are creating a number of
    >> temporary strings, that will be slowing you down. You might consider
    >> using a fixed vector or two and reusing them, to avoid temporaries.
    >>

    >
    >Since you asked here's exactly what I did:
    >
    >std::vector< std::vector< std::string > > lines;


    You definitely want:
    lines.reserve(estimateOfNumberOfLines);

    >std::string line;
    >std::vector<std::string> thisLine;
    >
    >do {


    line.reserve(enoughForALine); //may well help

    > std::getline(file,line);
    > if (line.length() != 0) {
    > split(line,thisLine);
    > lines.push_back(thisLine);


    The above line is going to be slow, since it involves copying the
    whole vector. Instead, you might do:
    //avoid coping the vector:
    lines.resize(lines.size() + 1); //add extra default element
    lines.back().swap(thisLine); //swap it with the current element

    > //process the line
    >} while(!netlist.eof());
    >
    >inline void split(const std::string& line, std::vector<std::string>& words)
    >{
    > unsigned int firstMark = 0, lastMark = 0;
    > words.clear();


    words.reserve(50); //say

    > while(lastMark!=std::string::npos) {
    > firstMark=line.find_first_not_of(" ",lastMark);
    > if(firstMark==std::string::npos) break;
    > lastMark=line.find_first_of(" ",firstMark);
    > words.push_back(line.substr(firstMark,lastMark-firstMark));


    That might be slightly more efficient as:

    words.push_back(std::string(line, firstMark, lastMark - firstMark));

    but I doubt it will make much difference with a COW string
    implementation.

    > }
    >}
    >
    >
    >>>}

    >>
    >> Here you definitely want:
    >> nodeVector.reserve(nodeSet.size());
    >>
    >>>std::copy(nodeSet.begin(),node.end(),std::back_inserter(nodeVector));
    >>>std::sort(nodeVector.begin(),nodeVector.end());

    >>
    >> Both of those operations should be fast if you are using a reference
    >> counted string class. I believe GCC 3+ uses such a beast.
    >>

    >
    >I added the reserve call which saved me a couple of tenths of a second.
    >Nice, but must not be the major bottleneck. What is a "reference counted"
    >class?


    See above.

    >
    >> Overall, I think your best bet would be to benchmark smaller bits of
    >> the code or put it through a profiler (-gprof IIRC, which I probably
    >> don't), to find out where the bottleneck actually lies.

    >
    >I went ahead and tried this, but can't quite understand the results. The
    >total time it computes is way off. (It says its 8s vs. the actual 14s).
    >Also, it thinks that only 30% of run time was spent in the main() function.
    >This can't possibly be right.


    It might not be counting time spent in system functions, such as IO
    (system vs user time?).

    If you want to optimize the above, I'd do this:

    Write a simple immutable string type that is initialized with a
    pointer and a length, but allocate no memory and does nothing in the
    destructor. Include in the string a cached hashcode value, so that the
    hashcode need only be calculated once. e.g.

    class mystring
    {
    char const* m_ptr;
    std::size_t m_length;
    mutable unsigned long m_hashCode;
    public:
    mystring(char const* ptr, std::size_t length)
    :m_ptr(ptr), m_length(length),
    m_hashCode(static_cast<unsigned long>(-1))
    {
    }

    unsigned long hashCode() const
    {
    if (m_hashCode == static_cast<unsigned long>(-1))
    {
    //calculate hashCode (copy java.lang.String code?)
    }
    return m_hashCode;
    }

    char operator[](std::size_t index) const
    {
    return m_ptr[index];
    }

    //operator==, <, etc.
    //compiler generated destructor, copy, assignment are fine.
    };

    Read the entire file into a vector<char>.

    Iterate over the vector, adding creating "mystring"s pointing into the
    vector for each word. You might also consider replacing ' ' characters
    with '\0's, so that you can add c_str() method to mystring that simply
    looks like this:
    char const* c_str() const
    {
    return m_ptr;
    }

    Operate on this new vector<vector<mystring> >.

    That should be much much faster, since the memory allocation overhead
    will be vastly decreased. If you don't need a vector of lines, just
    have an overall vector of words for another speed up. Essentially,
    optimization in non-numerical C++ is often about reducing the number
    of calls to "new" and "delete", which are often even slower than the
    Java versions (new + gc).

    Tom
    Tom Widmer, Oct 13, 2004
    #9
  10. Alex Gerdemann wrote:
    >

    [snip]
    > Other than that, my code in the Java and C++ is quite similar. However, the
    > parse of a 100k line file took around a second for Java, but more like 17s
    > for C++ . In fact, using std::set instead of hash_set actually improved
    > performance to about 15s in C++.


    Hmm. Just a quick question:
    Are you sure the C++ optimizer run over your code and you are not
    timing a debug version?

    Especially with container templates the optimizer can often dramatically
    reduce execution time.

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 13, 2004
    #10
  11. Alex Gerdemann

    Tom Widmer Guest

    On Wed, 13 Oct 2004 15:59:23 +0200, Karl Heinz Buchegger
    <> wrote:

    >Alex Gerdemann wrote:
    >>

    >[snip]
    >> Other than that, my code in the Java and C++ is quite similar. However, the
    >> parse of a 100k line file took around a second for Java, but more like 17s
    >> for C++ . In fact, using std::set instead of hash_set actually improved
    >> performance to about 15s in C++.

    >
    >Hmm. Just a quick question:
    >Are you sure the C++ optimizer run over your code and you are not
    >timing a debug version?
    >
    >Especially with container templates the optimizer can often dramatically
    >reduce execution time.


    Good point. For GCC, use -O3 as a minimum (you can add architecture
    specific optimizations too if you like).

    Tom
    Tom Widmer, Oct 13, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vasileios
    Replies:
    4
    Views:
    475
    Rolf Magnus
    Nov 4, 2003
  2. Abhijit Ray
    Replies:
    1
    Views:
    597
    John Harrison
    Apr 27, 2004
  3. Bart Blommerde

    using hash_set in gcc3.3

    Bart Blommerde, Oct 13, 2004, in forum: C++
    Replies:
    5
    Views:
    518
    John Harrison
    Oct 13, 2004
  4. Timo Qvist

    STL hash_set problem, SGI impl

    Timo Qvist, Nov 18, 2004, in forum: C++
    Replies:
    1
    Views:
    869
    Victor Bazarov
    Nov 18, 2004
  5. eric
    Replies:
    3
    Views:
    1,305
    Michael DOUBEZ
    Jul 12, 2011
Loading...

Share This Page