read/parse flat file / performance / boost::tokenizer

K

Knackeback

task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char> >::eek:perator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)

11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const

7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx::eek:perator!=<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)

problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas
 
J

John Harrison

Knackeback said:
task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool

boost::char_separator said:
::eek:perator()<__gnu_cxx::__normal_iterator<char const*,
std::allocator<char> > > said:
(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
11.99 0.85 0.35 24903838 0.00 0.00
boost::char_separator said:
7.09 1.06 0.21 28508346 0.00 0.00 bool
__gnu_cxx::eek:perator!=<char const*, std::basic_string<char,
(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > const&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas

All your performance bottlenecks seem to be from within the boost tokenizer
library. The obvious answer then is to replace that code with your own
custom code. The tokenizer library is a generic tokenizer, you have a
specific requirements to solve, so you should be able to beat the
performance of boost by taking advantage of the specific knowledge you have
about your application.

john
 
K

Knackeback

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.
BTW, I compiled my program with g++ and icc (Intels C++ compiler for Linux).
The icc compiled code was five times faster and the compile warnings from icc
are very fine. Good work !

Thomas
 
J

John Harrison

Knackeback said:
Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.

That's exactly what I mean. For instance boost will probably create a string
for each token, but you throw some of those tokens away. Your custom code
will only create a string for the tokens you actually need.

Also looking at your original code it seems that after extracting a token,
you add the delimiter back in to the key you are building up. That would be
another improvement, for your purposes a token can include the trailing
delimiter.

john
 
K

Knackeback

Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !
 
J

John Harrison

Knackeback said:
Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !

Don't take that as an argument against boost tokenizer. It still does its
job, and presumably does it efficiently (I haven't looked at the code).

What I liked about your post was that you did things the right way round.
First you got a working solution using general purpose tools available to
you, then you decided that it wasn't fast enough so you looked to replace
general purpose code with hand crafted code. That's the way it should be
done.

And of course many times, the hand crafted code isn't necessary at all.

john
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top