aditya.raghunath said:
I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??
Getline is reading them as strings, copying each string. That wastes time
both allocating a random sized block of memory, then copying in the CPU. A
hard drive has a DMA channel that its driver can exploit, but strings
probably can't use this.
Then, your OS and possibly your C++ are buffering the file ahead of the
string. This is partly because the read-write head keeps flying over the
file, so the drive buffer might as well take it in, and partly because some
Standard Library systems also buffer the file.
One way to fix this is not use getline(), and not copy the string. You
should stream each byte of your file into your program, and your program
should use a state table to parse and figure out what to do with each one.
This technique makes better use of the read-ahead buffers, and it ought to
lead to a better design.
Another way is to use OS-specific functions (which are off-topic here), to
map the file into memory. Then you can point into the file with a real C++
pointer. If you can then run this pointer from one end of the file to the
other, you should accurately exploit the DMA channel between the hard drive
and the CPU. Then, if your pointer instead skips around, you will at least
only use the OS's virtual paging mechanism to read and write the actual
file, with no intervening OS or C++ buffers.
Then next way is to use OS-specific functions that batch together many
commands to the driver of your hard drive. Obviously only an OS-specific
newsgroup can even advise you about these situations.