parsing an ifstream to get some specific text

T

toton

Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

.... DATAS
.....
..START
....
.....
..START
....
......
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir
 
O

Ondra Holub

toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir

Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]

- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
- You can use system dependent solution - memory mapped file
 
T

toton

Ondra said:
toton napsal:

Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
_pos is std::vector<pos_type> again, pos_type is usually int. so _pos
can also be treated as std::vector<int>.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
..START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements. Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
..START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?
- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
- You can use system dependent solution - memory mapped file
Don't know any C++ library for it. Boost is also not providing any mmap
file .
 
O

Ondra Holub

toton napsal:
_pos is std::vector<pos_type> again, pos_type is usually int. so _pos
can also be treated as std::vector<int>.

Yes, vector can be used from the functional point of view, but it may
be less effective for this kind of use, because vector has some
preallocated amount of memory and when it is exceeded, it must
reallocate it and it may lead to copying of items from old area to new
one. List does not need it. That's why I suggested not to use vector.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
.START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements.

It is simillar as the parsing of XML. XML is usualy parsed either with
DOM like parser or with SAX parser.

DOM (typically) loads whole document into memory and then works with
it. Then you can simply access any element, but data are stored in
memory. It is simpler for working with, but less effective for large
documents.

SAX (typically) reads document and during reading calls some methods,
which process the currently read data. It is not as simple for use as
DOM, but it is better and more effective for large documents.
Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
.START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?

You could deal with filebuf (implement your own inherited class from
streambuf), but I do not think it would be usefull (too much effort and
no big effect).

If you do not use C files (FILE* from stdio.h or cstdio), you should
disable synchronization of C++ iostreams with FILE* with method
sync_with_stdio of iostream. If you do it, you get the responsibility,
that nobody uses FILE* for your files (even no library).
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
Don't know any C++ library for it. Boost is also not providing any mmap
file .

There is no such standard C++ library, you have to use API of your OS
or some library, which supports many platforms and wraps platform
dependent code in it's functions (for example ACE).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top