Modifying the contents of a file

J

Jason Heyes

I would like to modify the contents of a file, replacing all occurances of
one string with another. I wrote these functions:

bool read_file(std::string name, std::string &s);
bool write_file(std::string name, const std::string &s);
void find_replace(std::string &s, std::string first, std::string second);

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

Have I got the right idea? If not, what is wrong? Thanks.
 
R

Robbie Hatley

I would like to modify the contents of a file, replacing all occurances of
one string with another. I wrote these functions:

bool read_file(std::string name, std::string &s);
bool write_file(std::string name, const std::string &s);
void find_replace(std::string &s, std::string first, std::string second);

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

Have I got the right idea? If not, what is wrong? Thanks.

If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.

Of course, you can open a file for read-write, and operate
on it one line at a time; but I prefer not to leave files
open that long. What happens if there's a problem? You bump
the reset button, or Windows crashes? You could loose data.
So I tend to suck all the data into RAM, close the file,
operate on the data, open the output file, write, close.

I've written programs that do what you describe, and my basic
modus operandi has been something like this (mixed C++ and
pseudocode):

std::ifstream IFS (InputFile);
std::string Buffer;
std::list<std::string> Text;
while (IFS)
{
getline(IFS, Buffer);
if(IFS.eof()) break;
Text.push_back(Buffer);
}
IFS.close();
typedef std::list<std::string>::const_iterator LSCI;
for (LSCI i = Text.begin(); i != Text.end(); ++i)
{
// iterate through Text one line at a time, replacing
// stuff at will. (*i) will always refer to the
// "current" line of text.
}

// Now write the processed text to an output file:
std::eek:fstream OFS (OutputFile);
for (LSCI i = Text.begin(); OFS && i != Text.end(); ++i)
{
OFS << (*i) << endl;
}
OFS.close();

Something like that should work for you, I think.


--
Cheers,
Robbie Hatley
Tustin, CA, USA
email: lonewolfintj at pacbell dot net
web: home dot pacbell dot net slant earnur slant
 
R

Robbie Hatley

Actually, there's one glaring error in my last post:
When iterating through a list of strings, altering them,
one can NOT use a const_iterator; a regular iterator
must be used instead. (The const_iterator is fine for
writing strings to file, because the list is not being
altered.) So I should have written:

std::ifstream IFS (InputFile);
std::string Buffer;
std::list<std::string> Text;
while (IFS)
{
getline(IFS, Buffer);
if(IFS.eof()) break;
Text.push_back(Buffer);
}
IFS.close();

// Note NON-const iterator so we can WRITE to list:
typedef std::list<std::string>::iterator LSI;
for (LSI i = Text.begin(); i != Text.end(); ++i)
{
// iterate through Text one line at a time, replacing
// stuff at will. (*i) will always refer to the
// "current" line of text.
}

// Now write the processed text to an output file:
std::eek:fstream OFS (OutputFile);
typedef std::list<std::string>::const_iterator LSCI;
for (LSCI i = Text.begin(); OFS && i != Text.end(); ++i)
{
OFS << (*i) << endl;
}
OFS.close();

--
Cheers,
Robbie Hatley
Tustin, CA, USA
email: lonewolfintj at pacbell dot net
web: home dot pacbell dot net slant earnur slant
 
P

Phil Endecott

Yes, I think you do have the right idea.
If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.

I don't think there's anything wrong with reading the entire file into a
single string. If your processing needed to treat the file as a
sequence of lines then a container<string> might be appropriate, but in
this case I think it is an unnecessary complication.

--Phil.
 
J

Jason Heyes

Robbie Hatley said:
If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.

I read the whole file into a single string instead of a sequence. Here is
the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}
 
P

Panjandrum

Jason said:
I read the whole file into a single string instead of a sequence. Here is
the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();

Why do you duplicate the file contents so often?
return true;

Why no error checking?
 
J

Jason Heyes

Panjandrum said:
Why do you duplicate the file contents so often?


Why no error checking?

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}
 
P

Panjandrum

Jason said:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

Much better! Just one unnecessary copy left (file -> stringstream ->
string).
 
A

Alan Johnson

Panjandrum said:
Much better! Just one unnecessary copy left (file -> stringstream ->
string).

This would be my preferred solution:

#include <iterator>

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}
 
P

Panjandrum

Alan said:
This would be my preferred solution:

#include <iterator>

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}

Another interesting solution! But you omit error handling and give
false respone in case of error ('return true'). I also prefer Jason
Heyes' swap-after-success idiom and I'd try to reserve enough space in
the string to avoid unnecessary reallocations.
 
P

Phil Endecott

I'm curious about the amount of copying that is really going on in the
read-from-file-to-string functions that are being posted here. My hope
is that the libraries have a smart string implementation that allows us
to write "naive" code for problems like this while still getting decent
performance. So I'm testing them.

Here's the test platform: /usr/share/dict/words is just under 1
megabyte on my old Celeron-500 Linux PC. I have plenty of RAM though,
so these tests will not be testing the I/O bandwidth after the first
run. Compiler is gcc 3.3.6, optimisation -O9. My main program does a
superficial check that the string is good, and in the process proves
that nothing has been entirely optimised out.

None of these programs should do better than dd to /dev/null, which is a
rather low-level copy program being asked to throw away its output:

$ time dd if=/usr/share/dict/words of=/dev/null bs=8192
110+1 records in
110+1 records out
906950 bytes transferred in 0.080980 seconds (11199684 bytes/sec)

real 0m0.088s
user 0m0.004s
sys 0m0.010s

As you can see the time taken is not really measurable.

OK, now for the first function, from Jason Heyes:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.164s
user 0m0.039s
sys 0m0.053s

Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

Next up this, using assign and stream iterators, from Alan Johnson:

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}

$ time ./tst /usr/share/dict/words

real 0m2.320s
user 0m1.174s
sys 0m1.048s

Wow! A quick check with some larger files suggests this is > O(n).

OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);
} while(read_this_time>0 && in.good());
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.140s
user 0m0.025s
sys 0m0.034s


Any other suggestions? How does this compare with other compilers /
platforms? Does anyone have any other functions they'd like to compare?

--Phil.
 
J

Jason Heyes

Phil Endecott said:
I'm curious about the amount of copying that is really going on in the
read-from-file-to-string functions that are being posted here. My hope is
that the libraries have a smart string implementation that allows us to
write "naive" code for problems like this while still getting decent
performance. So I'm testing them.

Here's the test platform: /usr/share/dict/words is just under 1 megabyte
on my old Celeron-500 Linux PC. I have plenty of RAM though, so these
tests will not be testing the I/O bandwidth after the first run. Compiler
is gcc 3.3.6, optimisation -O9. My main program does a superficial check
that the string is good, and in the process proves that nothing has been
entirely optimised out.

None of these programs should do better than dd to /dev/null, which is a
rather low-level copy program being asked to throw away its output:

$ time dd if=/usr/share/dict/words of=/dev/null bs=8192
110+1 records in
110+1 records out
906950 bytes transferred in 0.080980 seconds (11199684 bytes/sec)

real 0m0.088s
user 0m0.004s
sys 0m0.010s

As you can see the time taken is not really measurable.

OK, now for the first function, from Jason Heyes:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.164s
user 0m0.039s
sys 0m0.053s

Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

Next up this, using assign and stream iterators, from Alan Johnson:

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}

$ time ./tst /usr/share/dict/words

real 0m2.320s
user 0m1.174s
sys 0m1.048s

Wow! A quick check with some larger files suggests this is > O(n).

OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);
} while(read_this_time>0 && in.good());
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.140s
user 0m0.025s
sys 0m0.034s


Any other suggestions? How does this compare with other compilers /
platforms? Does anyone have any other functions they'd like to compare?

Very interesting results. I was surprised at how Alan Johnson's function
performed. I would like to see this next function compared. It is naively
implemented:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

s = "";
char c;
while (in.get(c))
s += c;
return true;
}

I predict a time similar to Alan Johnson's.

Your block-reading function has similar performance to my original function.
Why is that? Are both functions doing the same thing?
 
P

Panjandrum

Phil said:
Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

std::string tmp (ss.str());
s.swap (tmp);

....
OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
// read file_size with stat
std::string tmp;
tmp.reserve (file_size);
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;

const int block_size=BUFSIZ; // the 'right' size
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);

tmp.append (...)
} while(read_this_time>0 && in.good());

if (in.eof() {
tmp.swap (s);
return true;
} else {
return false;
}
 
J

Jason Heyes

Panjandrum said:
Phil said:
Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

std::string tmp (ss.str());
s.swap (tmp);

...
OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
// read file_size with stat
std::string tmp;
tmp.reserve (file_size);
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;

const int block_size=BUFSIZ; // the 'right' size
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);

tmp.append (...)
} while(read_this_time>0 && in.good());

if (in.eof() {
tmp.swap (s);
return true;
} else {
return false;
}

It is not necessary to use a temporary string. The written function is
already correct.
 
P

Panjandrum

Jason said:
It is not necessary to use a temporary string. The written function is
already correct.

Only in case of no error. Otherwise you destroy the user's string and
give back truncated contents.
 
J

Jason Heyes

Panjandrum said:
Only in case of no error. Otherwise you destroy the user's string and
give back truncated contents.

No. The loop only terminates once eof is reached. So it is guarenteed that
in.eof() is true in your if-statement. This is why you don't need to use a
temporary string.
 
P

Phil Endecott

Jason said:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

s = "";
char c;
while (in.get(c))
s += c;
return true;
}


$ time ./tst /usr/share/dict/words

real 0m0.670s
user 0m0.555s
sys 0m0.030s

I've tried adding s.reserve() at the start and it doesn't seem to make
much difference.

The problem with Alan Johnson's function is that it is O(n^2) in the
file size: if I run it on a double-length input it takes four times as
long. I don't know why this is, because as the function above (which is
O(n)) shows, just appending to a string is not the problem. Any
thoughts? Do other compilers / libraries do better?


Panjandrum, the changes you suggest seem mostly to do with not changing
the initial value of s when an error has occured. I think that in the
cited application it is fine for s to be undefined in this case:

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

The question of whether peak memory use is reduced by using swap()
rather than assignment is an interesting one, but I do not have a good
way of measuring peak memory use. Does anyone have any suggestions?

--Phil.
 
P

Panjandrum

Jason said:
No. The loop only terminates once eof is reached.

Nope. Actually the temination condition is probably wrong anyway (see
below). But who really knows iostream error handling.
So it is guarenteed that
in.eof() is true in your if-statement. This is why you don't need to use a
temporary string.

Nothing is 'guarenteed'. It's probably:
while (read_this_time>0 && in); // not in.good()
if (in.eof()) ... // success
else ... // error
 
J

Jason Heyes

Phil Endecott said:
$ time ./tst /usr/share/dict/words

real 0m0.670s
user 0m0.555s
sys 0m0.030s

I've tried adding s.reserve() at the start and it doesn't seem to make
much difference.

The problem with Alan Johnson's function is that it is O(n^2) in the file
size: if I run it on a double-length input it takes four times as long. I
don't know why this is, because as the function above (which is O(n))
shows, just appending to a string is not the problem. Any thoughts? Do
other compilers / libraries do better?

I know why. The assign function used by Alan Johnson does two things, which
are:

1. It computes the number of elements to copy by taking the distance between
two iterators. Since the iterators are not random access iterators, the
difference is calculated by iterating between them. To do that it must read
the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.
The question of whether peak memory use is reduced by using swap() rather
than assignment is an interesting one, but I do not have a good way of
measuring peak memory use. Does anyone have any suggestions?

What is peak memory use?
 
A

Alan Johnson

Phil said:
Jason Heyes wrote:

The problem with Alan Johnson's function is that it is O(n^2) in the
file size: if I run it on a double-length input it takes four times as
long. I don't know why this is, because as the function above (which is
O(n)) shows, just appending to a string is not the problem. Any
thoughts? Do other compilers / libraries do better?

I'm not able to reproduce a quadratic running time. Here are the
results running on my machine (gcc 3.4.4 in Cygwin) on 1MB, 10MB, and
100MB files (respectively).

$ time ./a.exe testfile1

real 0m1.149s
user 0m1.171s
sys 0m0.015s

$ time ./a.exe testfile2

real 0m11.340s
user 0m11.327s
sys 0m0.046s

$ time ./a.exe testfile3

real 1m53.727s
user 1m52.905s
sys 0m0.796s


This is almost exactly linear growth in running time. You might try the
following to see if it also has quadratic time on your system.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}

If not, then I would suspect that either your library's string class
poorly handles the "assign" method, or perhaps you are running into some
sort of memory swapping issues on larger files (though it seems you
would have seen the same problems with any of the other functions you
tested).

-Alan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,436
Messages
2,571,696
Members
48,796
Latest member
Greg L.
Top