Efficiently reading large blocks from file

P

persres

Hello,
I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

I have never been comfortable using these ifstreams as I always
assumed they were slow. Perhaps not. Could anyone advice?.

Am I better off using some Win32 API to read a file block?. Or can I
just read the say 2 MB into a buffer?

Then how do I read the the integers into the vector?. I need a really
efficient way.
So, far I only the following way -

std::vector<unsigned int> myvec;
//myvec.reserve(const_buf_ints);
std::ifstream infile("abc");

std::copy(std::istream_iterator<int>(infile),
std::istream_iterator<int>(), back_inserter(myvec));

infile.close();


However, even this is not right for me because I only want to read the
first few 1000 numbers and not the whole file. Please help.

To summarize -
1) Need to read a lot (say 100,000) of integers from a huge file
repeatedly (say with over a billion integers). What is the best way to
do that.

Thank you very much
 
R

Rune Allnor

Hello,
      I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

So this is a text file?
I have never been comfortable using these ifstreams as I always
assumed they were slow. Perhaps not. Could anyone advice?.

Text formats are slow. If speed is a concern, convert
the file to binary format.

Rune
 
P

persres

(e-mail address removed):














If I needed to do this really efficiently, I would create a file mapping
into memory (not portable!) and use strtol() function (part of C++ by
heritage). All these C++ stream facilities are nice and shiny, but I
would not be sure if all the abstraction layers are optimized away
properly.

One cannot probably map the whole file of a billion integers in memory at
once, at least not in a 32-bit program. However, if one just needs a
relatively small portion of it, and one knows the position and the upper
bound of the portion size, then one should be able to create a mapping
view which contains the whole portion. This would make the processing a
bit simpler.

hth
Paavo

Ok. that probably is what I will have to do eventually. For now, could
you please advice on what is the best way using C++/ STL?. I need to
get something working quick and am not familiar with memory mapping
etc (and that may not be portable).

Is using fread better. Or should I use binary mode and copy everything
to buffer?. or shall I use istream_iterator? or << ?. or perhaps most
of them are the same and doesn't matter much?.
Thanks
 
R

Rune Allnor

Ok. that probably is what I will have to do eventually. For now, could
you please advice on what is the best way using C++/ STL?. I need to
get something working quick and am not familiar with memory mapping
etc (and that may not be portable).

One very fast way (from the programmer's perspective) is to
use std::getline() to count to the start of the sequence of
numbers you want to load, and then use operator>> or std::atoi()
to load the numbers. Depending on how much you trust whatever
program generated the numbers, you might want to include
error checking and format validation, which will slow you
seriously down.

This still involves converting the numbers from text to binary
format, which will be one bottleneck you can avoid by converting
the file to binary first.
Is using fread better. Or should I use binary mode and copy everything
to buffer?.

Provided you have a large enough buffer - maybe. But again, you
still have to convert from text to binary.
or shall I use istream_iterator? or << ?. or perhaps most
of them are the same and doesn't matter much?.

Converting between text and binary formats is common
among all the approaches. If you work on the same file
several times, convert it to binary *once* up front,
and save time later.

Rune
 
M

Michael Tsang

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
I couldn't find a clear answer to this question. I need a very
efficient way to read large amount of data (integers - separated by
white space and newlines). It has over a billion integers. Need to
read block by block. What is the best way to do it.

Use binary format for efficiency. For portability, you need to decide on how
the numbers are stored internally. For example, "4 bytes per integer in
little-endian and 8 bits per byte" is a well-defined format that would not
cause ambiguity. In this case, you can really read block by block - reading
4000000 bytes into the memory is the same as reading 1000000 integers. The
format mentioned above is the native format for x86 processor, even though
converting endianness is a very fast job:

void convert_endian(
unsigned char *array, // byte array
size_t size, // size of an integer in bytes
size_t n // no of integers to change
) {
for(size_t i = 0; i != n; ++i) for(size_t j = 0; j != size >> 1; ++j)
swap(array[size * i + j][size * (i + 1) - 1 - j]);
}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAktLOgcACgkQm4klUUKw07D3tgCfaW+r01BMqbcekyiaieDf9CxE
zhgAnic8kb04b+IRpK2UGFiqWbLMcZOH
=bu0v
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top