Performance File Parsing

T

Thomas Kowalski

Hi,
I have to parse a plain, ascii text file (on local HD). Since the file
might be many millions lines long I want to improve the efficiency of
my parsing process. The resulting data structure shall look like this
the following:

class A {
...
int value;
}

class B {
...
std::vector<A> as; // NOTE: can't be a pointer
}

// All objects of Typ B in bs have objects of Typ A with the same
value;
class C {
...
std::vector<B*> bs;
}

In the text-file each (not empty) line of represents one object A. We
know the total number of As in advance, but don't know how many As will
be part of each B. Each empty line represents the creation of a new B
to with the next As should be attached. The objects of Typ B are
grouped to collections in object of Typ C using some value in its A
objects. Means all A that are part of be have the same value. If an two
Bs have As with different values, they belong to different Cs.

I know in advance (its written in the first line of text) how many
instances of A are inside the file. But unfortunatly I have no
information about the structure of the tree (the objects A and B) in
advance.

Currently there are three methodes in my mind to create the tree. In
all cases the C structure (usually just a few hundred) will be created
afterwards by grouping the Bs (a few thousands)

1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.

2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.

3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.

Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?

Thanks in advance,
Thomas Kowalski
 
J

Jerry Coffin

[ ... parsing a text file ]
1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.

Bad idea. Basic rule: processing is a lot faster than I/O, so you have
to save a LOT of processing to justify even a little extra I/O. Reading
a multi-million line file twice to save a few dynamic allocations is
_almost_ certain to be a loss.
2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.

Quite true -- whether it works out well is likely to depend heavily upon
whether you have _free_ memory available. Assuming a virtual memory
system, if most memory is in use, you'll (usually) be writing out other
data to make room for the new data, which means more I/O (about which,
see above).
3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.

This would be the most obvious and easiest approach. If you have working
code already, it's probably a lot like this though...
Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?

Asynch I/O can make a lot of sense for access to a local HD as well.
Though a local drive will (normally) give you faster access than a
network, it's still usually a lot slower than processing.

I suppose I should also add that IF you have some insanely fast RAID
array (say, data striped across a dozen disks or so) it's barely
possible to have I/O that actually can keep up with the processing, so
your priorities would change -- but at least IME, this is fairly rare.

As with most questions about optimization, the real answer is to profile
your code, and see where it's spending its time, then work at optimizing
the parts that are slow. Just for example, if only 1% of the time is
spent doing memory allocation, and 90+ % is spent reading from the disk,
then using a two-pass algorithm to optimize memory allocation will
almost certainly slow things down.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top