Performance File Parsing

Thomas Kowalski · Aug 18, 2006

Hi,
I have to parse a plain, ascii text file (on local HD). Since the file
might be many millions lines long I want to improve the efficiency of
my parsing process. The resulting data structure shall look like this
the following:

class A {
...
int value;
}

class B {
...
std::vector<A> as; // NOTE: can't be a pointer
}

// All objects of Typ B in bs have objects of Typ A with the same
value;
class C {
...
std::vector<B*> bs;
}

In the text-file each (not empty) line of represents one object A. We
know the total number of As in advance, but don't know how many As will
be part of each B. Each empty line represents the creation of a new B
to with the next As should be attached. The objects of Typ B are
grouped to collections in object of Typ C using some value in its A
objects. Means all A that are part of be have the same value. If an two
Bs have As with different values, they belong to different Cs.

I know in advance (its written in the first line of text) how many
instances of A are inside the file. But unfortunatly I have no
information about the structure of the tree (the objects A and B) in
advance.

Currently there are three methodes in my mind to create the tree. In
all cases the C structure (usually just a few hundred) will be created
afterwards by grouping the Bs (a few thousands)

1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.

2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.

3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.

Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?

Thanks in advance,
Thomas Kowalski

Jerry Coffin · Aug 18, 2006

[ ... parsing a text file ]

1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.

Bad idea. Basic rule: processing is a lot faster than I/O, so you have
to save a LOT of processing to justify even a little extra I/O. Reading
a multi-million line file twice to save a few dynamic allocations is
_almost_ certain to be a loss.

2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.

Quite true -- whether it works out well is likely to depend heavily upon
whether you have _free_ memory available. Assuming a virtual memory
system, if most memory is in use, you'll (usually) be writing out other
data to make room for the new data, which means more I/O (about which,
see above).

3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.

This would be the most obvious and easiest approach. If you have working
code already, it's probably a lot like this though...

Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?

Asynch I/O can make a lot of sense for access to a local HD as well.
Though a local drive will (normally) give you faster access than a
network, it's still usually a lot slower than processing.

I suppose I should also add that IF you have some insanely fast RAID
array (say, data striped across a dozen disks or so) it's barely
possible to have I/O that actually can keep up with the processing, so
your priorities would change -- but at least IME, this is fairly rare.

As with most questions about optimization, the real answer is to profile
your code, and see where it's spending its time, then work at optimizing
the parts that are slow. Just for example, if only 1% of the time is
spent doing memory allocation, and 90+ % is spent reading from the disk,
then using a two-pass algorithm to optimize memory allocation will
almost certainly slow things down.

Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
File Read Performance Issue	17	Aug 17, 2013
Parsing (a Series of) Variables	21	Mar 24, 2013
Parsing Numeric Data	2	Nov 8, 2012
Php combine identical lines in text file	4	Oct 11, 2023
Java with JSON file	3	Nov 5, 2023
Database schema for file organizer.	1	May 17, 2022
Vectors vs Arrays performance	24	Jan 2, 2010

Performance File Parsing

Thomas Kowalski

Jerry Coffin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads