T
Thomas Kowalski
Hi,
I have to parse a plain, ascii text file (on local HD). Since the file
might be many millions lines long I want to improve the efficiency of
my parsing process. The resulting data structure shall look like this
the following:
class A {
...
int value;
}
class B {
...
std::vector<A> as; // NOTE: can't be a pointer
}
// All objects of Typ B in bs have objects of Typ A with the same
value;
class C {
...
std::vector<B*> bs;
}
In the text-file each (not empty) line of represents one object A. We
know the total number of As in advance, but don't know how many As will
be part of each B. Each empty line represents the creation of a new B
to with the next As should be attached. The objects of Typ B are
grouped to collections in object of Typ C using some value in its A
objects. Means all A that are part of be have the same value. If an two
Bs have As with different values, they belong to different Cs.
I know in advance (its written in the first line of text) how many
instances of A are inside the file. But unfortunatly I have no
information about the structure of the tree (the objects A and B) in
advance.
Currently there are three methodes in my mind to create the tree. In
all cases the C structure (usually just a few hundred) will be created
afterwards by grouping the Bs (a few thousands)
1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.
2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.
3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.
Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?
Thanks in advance,
Thomas Kowalski
I have to parse a plain, ascii text file (on local HD). Since the file
might be many millions lines long I want to improve the efficiency of
my parsing process. The resulting data structure shall look like this
the following:
class A {
...
int value;
}
class B {
...
std::vector<A> as; // NOTE: can't be a pointer
}
// All objects of Typ B in bs have objects of Typ A with the same
value;
class C {
...
std::vector<B*> bs;
}
In the text-file each (not empty) line of represents one object A. We
know the total number of As in advance, but don't know how many As will
be part of each B. Each empty line represents the creation of a new B
to with the next As should be attached. The objects of Typ B are
grouped to collections in object of Typ C using some value in its A
objects. Means all A that are part of be have the same value. If an two
Bs have As with different values, they belong to different Cs.
I know in advance (its written in the first line of text) how many
instances of A are inside the file. But unfortunatly I have no
information about the structure of the tree (the objects A and B) in
advance.
Currently there are three methodes in my mind to create the tree. In
all cases the C structure (usually just a few hundred) will be created
afterwards by grouping the Bs (a few thousands)
1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.
2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.
3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.
Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?
Thanks in advance,
Thomas Kowalski