Efficiently reading large blocks from file

persres · Jan 10, 2010

Hello,
I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

I have never been comfortable using these ifstreams as I always
assumed they were slow. Perhaps not. Could anyone advice?.

Am I better off using some Win32 API to read a file block?. Or can I
just read the say 2 MB into a buffer?

Then how do I read the the integers into the vector?. I need a really
efficient way.
So, far I only the following way -

std::vector<unsigned int> myvec;
//myvec.reserve(const_buf_ints);
std::ifstream infile("abc");

std::copy(std::istream_iterator<int>(infile),
std::istream_iterator<int>(), back_inserter(myvec));

infile.close();

However, even this is not right for me because I only want to read the
first few 1000 numbers and not the whole file. Please help.

To summarize -
1) Need to read a lot (say 100,000) of integers from a huge file
repeatedly (say with over a billion integers). What is the best way to
do that.

Thank you very much

Rune Allnor · Jan 10, 2010

Hello,
I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

So this is a text file?

I have never been comfortable using these ifstreams as I always
assumed they were slow. Perhaps not. Could anyone advice?.

Text formats are slow. If speed is a concern, convert
the file to binary format.

Rune

persres · Jan 10, 2010

(e-mail address removed):

If I needed to do this really efficiently, I would create a file mapping
into memory (not portable!) and use strtol() function (part of C++ by
heritage). All these C++ stream facilities are nice and shiny, but I
would not be sure if all the abstraction layers are optimized away
properly.

One cannot probably map the whole file of a billion integers in memory at
once, at least not in a 32-bit program. However, if one just needs a
relatively small portion of it, and one knows the position and the upper
bound of the portion size, then one should be able to create a mapping
view which contains the whole portion. This would make the processing a
bit simpler.

hth
Paavo

Ok. that probably is what I will have to do eventually. For now, could
you please advice on what is the best way using C++/ STL?. I need to
get something working quick and am not familiar with memory mapping
etc (and that may not be portable).

Is using fread better. Or should I use binary mode and copy everything
to buffer?. or shall I use istream_iterator? or << ?. or perhaps most
of them are the same and doesn't matter much?.
Thanks

Rune Allnor · Jan 10, 2010

Ok. that probably is what I will have to do eventually. For now, could
you please advice on what is the best way using C++/ STL?. I need to
get something working quick and am not familiar with memory mapping
etc (and that may not be portable).

One very fast way (from the programmer's perspective) is to
use std::getline() to count to the start of the sequence of
numbers you want to load, and then use operator>> or std::atoi()
to load the numbers. Depending on how much you trust whatever
program generated the numbers, you might want to include
error checking and format validation, which will slow you
seriously down.

This still involves converting the numbers from text to binary
format, which will be one bottleneck you can avoid by converting
the file to binary first.

Is using fread better. Or should I use binary mode and copy everything
to buffer?.

Provided you have a large enough buffer - maybe. But again, you
still have to convert from text to binary.

or shall I use istream_iterator? or << ?. or perhaps most
of them are the same and doesn't matter much?.

Converting between text and binary formats is common
among all the approaches. If you work on the same file
several times, convert it to binary *once* up front,
and save time later.

Rune

Michael Tsang · Jan 11, 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

Use binary format for efficiency. For portability, you need to decide on how
the numbers are stored internally. For example, "4 bytes per integer in
little-endian and 8 bits per byte" is a well-defined format that would not
cause ambiguity. In this case, you can really read block by block - reading
4000000 bytes into the memory is the same as reading 1000000 integers. The
format mentioned above is the native format for x86 processor, even though
converting endianness is a very fast job:

void convert_endian(
unsigned char *array, // byte array
size_t size, // size of an integer in bytes
size_t n // no of integers to change
) {
for(size_t i = 0; i != n; ++i) for(size_t j = 0; j != size >> 1; ++j)
swap(array[size * i + j][size * (i + 1) - 1 - j]);
}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAktLOgcACgkQm4klUUKw07D3tgCfaW+r01BMqbcekyiaieDf9CxE
zhgAnic8kb04b+IRpK2UGFiqWbLMcZOH
=bu0v
-----END PGP SIGNATURE-----

How works with large integers ?	0	Aug 16, 2022
reading files in blocks	3	Dec 9, 2010
Efficiently Read file Headers	0	Mar 17, 2009
Efficiently reading a string from a specific point in a file	7	May 11, 2007
reading from a file	3	Oct 7, 2010
Reading an array from file?	25	Aug 4, 2009
safely reading large files	10	May 21, 2008
Comparing Multiple Strings More Efficiently	7	Jun 21, 2010

Efficiently reading large blocks from file

persres

Rune Allnor

persres

Rune Allnor

Michael Tsang

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads