Parsing large files

A

aditya.raghunath

hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??

TIA
Aditya
 
P

Phlip

aditya.raghunath said:
I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??

Getline is reading them as strings, copying each string. That wastes time
both allocating a random sized block of memory, then copying in the CPU. A
hard drive has a DMA channel that its driver can exploit, but strings
probably can't use this.

Then, your OS and possibly your C++ are buffering the file ahead of the
string. This is partly because the read-write head keeps flying over the
file, so the drive buffer might as well take it in, and partly because some
Standard Library systems also buffer the file.

One way to fix this is not use getline(), and not copy the string. You
should stream each byte of your file into your program, and your program
should use a state table to parse and figure out what to do with each one.
This technique makes better use of the read-ahead buffers, and it ought to
lead to a better design.

Another way is to use OS-specific functions (which are off-topic here), to
map the file into memory. Then you can point into the file with a real C++
pointer. If you can then run this pointer from one end of the file to the
other, you should accurately exploit the DMA channel between the hard drive
and the CPU. Then, if your pointer instead skips around, you will at least
only use the OS's virtual paging mechanism to read and write the actual
file, with no intervening OS or C++ buffers.

Then next way is to use OS-specific functions that batch together many
commands to the driver of your hard drive. Obviously only an OS-specific
newsgroup can even advise you about these situations.
 
J

Jerry Coffin

hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.

You might try opening them with fopen and reading them with fgets
instead. There are quite a few standard libraries for which that
provides a substantial speed improvement.

There are also quite a few platform-dependent optimizations. For
example, on Windows you can often gain a substantial amount of speed by
opening files in binary (untranslated) mode, but doing the same on UNIX
or anything very similar normally won't make any difference at all.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top