Efficiently reading a string from a specific point in a file

R

random guy

Hi,

I'm writing a program which creates an index of text files. For each
file it
processes, the program records the start and end positions (as
returned by
tellg()) of sections of interest, and then some time later uses these
positions
to read the interesting sections from the file.

When reading the sections, I'm currently using get() to read
characters from the
file one by one and concatenating them to what has already been read.
However, I
guess this will be fairly inefficient if the text to extract is long.

Is there a more efficient way to do this, perhaps using an existing
library
function? I'd imagine that this question has been asked before, but
when
googling for answers I could only find solutions for reading entire
files
completely; I can't do that because the files are too large to store
in memory.

My code is below; any advice would be gratefully received!

#include <iostream>
#include <string>
#include <fstream>


std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end) {

in.seekg(start);

std::string s;

while (in.tellg() != end) {
s += in.get(); // Not very efficient?
}

return s;
}

int main(void) {

std::ifstream in("test_file", std::ios_base::binary);

// Hard-coded positions below; these would normally be returned
from tellg()
std::cout << "\"" << get_string(in, 10, 19) << "\"" << std::endl;

return 0;
}
 
?

=?iso-8859-1?q?Erik_Wikstr=F6m?=

Hi,

I'm writing a program which creates an index of text files. For each
file it
processes, the program records the start and end positions (as
returned by
tellg()) of sections of interest, and then some time later uses these
positions
to read the interesting sections from the file.

When reading the sections, I'm currently using get() to read
characters from the
file one by one and concatenating them to what has already been read.
However, I
guess this will be fairly inefficient if the text to extract is long.

Is there a more efficient way to do this, perhaps using an existing
library
function? I'd imagine that this question has been asked before, but
when
googling for answers I could only find solutions for reading entire
files
completely; I can't do that because the files are too large to store
in memory.

My code is below; any advice would be gratefully received!

#include <iostream>
#include <string>
#include <fstream>

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end) {

in.seekg(start);

std::string s;

while (in.tellg() != end) {
s += in.get(); // Not very efficient?
}

return s;

}

You can do something like this:

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
in.get(s, end - start + 1);
return std::string(s);
}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading. I'm not sure what will happen if it reaches EOF.
 
R

Richard Herring

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];

no corresponding delete[] ...

use std::vector said:
in.get(s, end - start + 1);
return std::string(s);

}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.

If you know exactly how many characters you want to read, use in.read().
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];

no corresponding delete[] ...

use std::vector said:
in.get(s, end - start + 1);
return std::string(s);

}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.

If you know exactly how many characters you want to read, use in.read().

No, read() is for unformated data (binary) get() should be used for text.
 
J

James Kanze

On 2007-05-11 17:28, Richard Herring wrote:
In message said:
...
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
no corresponding delete[] ...
use std::vector<char> s(end - start + 1);
in.get(s, end - start + 1);
return std::string(s);
}
Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.
If you know exactly how many characters you want to read, use in.read().
No, read() is for unformated data (binary) get() should be used for text.

What makes you say that? read() works perfectly well for text.

Note, however, that there is not necessarily a relationship
between the number of characters, and the difference end -
start, converted to an integral type. It will probably work
under Unix, but will certainly result in two many characters
under Windows, and on some systems, it may result in nothing
even remotely usable.

Also, of course, on a lot of systems, you can't necessarily
allocate a buffer this big anyway.
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

On 2007-05-11 17:28, Richard Herring wrote:
In message <[email protected]>,
...
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
no corresponding delete[] ...
use std::vector<char> s(end - start + 1);
in.get(s, end - start + 1);
return std::string(s);
}
Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.
If you know exactly how many characters you want to read, use in.read().
No, read() is for unformated data (binary) get() should be used for text.

What makes you say that? read() works perfectly well for text.

Note, however, that there is not necessarily a relationship
between the number of characters, and the difference end -
start, converted to an integral type. It will probably work
under Unix, but will certainly result in two many characters
under Windows, and on some systems, it may result in nothing
even remotely usable.

Well, you can of course use whichever one you like, but with get() you
get the null-character at the end of the array for free, which you don't
with read().
 
J

James Kanze

On 2007-05-11 21:56, James Kanze wrote:

[...]
Well, you can of course use whichever one you like, but with get() you
get the null-character at the end of the array for free, which you don't
with read().

He's using it to construct a string, so he doesn't need the null
character.

FWIW: the next version of the standard will allow reading the
string "in place". Something like:

std::string result ;
result.resize( size ) ;
if ( ! in.get( &result[ 0 ], result.size( 0 ) ) {
result.resize( in.gcount() ) ;
}

This will also work with all current implementations, and since
it will be standard in the future, the probability of an
implementation changing so that it won't work is pretty small.

The real problem in his code, of course, was the arithmetic on
streampos, which isn't guaranteed to give anything usable for
other than positionning in a file. (In particular, under most
systems---Unix is the only exception I know of---the difference
between two streampos will *not* result in the number of char
that can be read between those two positions. Under Windows,
the number will typically be somewhat larger, and on other
systems, who knows.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top