Modifying the contents of a file

P

Phil Endecott

The problem with Alan Johnson's function is that it is O(n^2) in the file
Alan Johnson's function reads the whole file twice! This explains
everything, I guess.

No, that's still linear, not quadratic.
What is peak memory use?

The maximum amount of memory that the program uses.

--Phil.
 
P

Phil Endecott

The problem with Alan Johnson's function is that it is O(n^2) in the
I'm not able to reproduce a quadratic running time.
bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}

That's linear.

If I add to the string inside the loop:

while (begin != end) {
++begin;
s += (*begin);
}

That's also linear.

If I replace the loop with a call to append:

s.append(begin,end);

That's quadratic.

But... if I use gcc 3.4 (rather than 3.3) it suddenly becomes linear!

Mystery solved. OK, time to upgrade to 3.4 all round and see how many
of my programs get a sudden speed increase.

Here are the results for my original test code using gcc 3.4:

Jason's original function:
real 0m0.153s
user 0m0.024s
sys 0m0.051s

Alan's original function:
real 0m0.366s
user 0m0.243s
sys 0m0.047s

My one:
real 0m0.143s
user 0m0.021s
sys 0m0.039s


--Phil.
 
J

Jason Heyes

Alan Johnson said:
I'm not able to reproduce a quadratic running time. Here are the results
running on my machine (gcc 3.4.4 in Cygwin) on 1MB, 10MB, and 100MB files
(respectively).

$ time ./a.exe testfile1

real 0m1.149s
user 0m1.171s
sys 0m0.015s

$ time ./a.exe testfile2

real 0m11.340s
user 0m11.327s
sys 0m0.046s

$ time ./a.exe testfile3

real 1m53.727s
user 1m52.905s
sys 0m0.796s


This is almost exactly linear growth in running time. You might try the
following to see if it also has quadratic time on your system.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}

If not, then I would suspect that either your library's string class
poorly handles the "assign" method, or perhaps you are running into some
sort of memory swapping issues on larger files (though it seems you would
have seen the same problems with any of the other functions you tested).

Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).
 
J

Jason Heyes

Phil Endecott said:
That's linear.

If I add to the string inside the loop:

while (begin != end) {
++begin;
s += (*begin);
}

That's also linear.

If I replace the loop with a call to append:

s.append(begin,end);

That's quadratic.

But... if I use gcc 3.4 (rather than 3.3) it suddenly becomes linear!

Mystery solved. OK, time to upgrade to 3.4 all round and see how many of
my programs get a sudden speed increase.

Here are the results for my original test code using gcc 3.4:

Jason's original function:
real 0m0.153s
user 0m0.024s
sys 0m0.051s

Alan's original function:
real 0m0.366s
user 0m0.243s
sys 0m0.047s

My one:
real 0m0.143s
user 0m0.021s
sys 0m0.039s

Better. I'm still curious how you get quadratic time using gcc 3.3. Perhaps
the for-loop calls distance in it's condition:

template <typename InIt>
void assign(InIt first, InIt second)
{
replace(0, size(), distance(first, second), 0);
for (iterator it = begin(); distance(first, second) > 0; it++, first++)
*it = *first;
}
 
J

Jason Heyes

Panjandrum said:
Nope. Actually the temination condition is probably wrong anyway (see
below). But who really knows iostream error handling.


Nothing is 'guarenteed'. It's probably:
while (read_this_time>0 && in); // not in.good()
if (in.eof()) ... // success
else ... // error

If this was your usual I/O that calls formatted input functions, you would
need to test eof. But here we call low-level, unformatted input functions
and the only danger is a loss of integrity in the stream buffer (i.e.,
in.bad() evaluates true). How often does that happen? Hardly seems worth the
effort and run-time required of using a temporary string. What do you think?
 
A

Alan Johnson

Jason Heyes wrote:
[snip]
Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).

First, O(2n) is a fairly nonsensical expression, as O(2n) defines
precisely the same set of functions as O(n).

Second, if by O(2n) you mean to imply that the entire file is read twice
(or more precisely, that assign iterates over the range twice), then I
would ask to see where in the standard that behavior is defined. In
fact, tracing through my compiler's library, assign is implemented by
calling replace, which is implemented by constructing a new string using
the InputIterator range, which is implemented by called a function
called _S_construct that iterates over the range only once, allocating
more space as it is needed.

-Alan
 
R

Robbie Hatley

I read the whole file into a single string instead of a sequence.

Ok, if that's more convenient for the kind of processing you're
trying to do.

I'm often interesting in processing files with many short lines,
such as killfiles, dictionary files, name-list files, etc.
I'm often interesting in inserting, erasing, altering, sorting,
or de-duping lines of text, so I use std::list<std::string> a lot.
It makes that kind of processing a breeze.

But if you want to suck the whole file into a single std:string,
I can see how that could have its own advantages. (Such as
using the vast assortment of methods -- substr, find, etc. --
that come with std::string.)
Here is the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

Yikes. What is "rdbuf"? Never heard of that.
Lemme look it up in Bjarne's book...
Hmmm... he doesn't say much about it. In fact, his only
explanation is the cryptic two-word comment, "get buffer".
So what is this "rdbuf" doing?

And how does this function handle newlines? Does it
insert them into the string?

Curious,
Robbie Hatley
Tustin, CA, USA
(e-mail address removed)
http://home.pacbell.net/earnur/
 
A

Alan Johnson

Jason said:
I know why. The assign function used by Alan Johnson does two things, which
are:

1. It computes the number of elements to copy by taking the distance between
two iterators. Since the iterators are not random access iterators, the
difference is calculated by iterating between them. To do that it must read
the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.

From the C++ Standard, Clause 24.1.1.3: "... Algorithms on input
iterators should never attempt to pass through the same iterator twice.
They should be single pass algorithms. ..."

Therefore, in a standard compliant implementation, assign is a single
pass algorithm.

-Alan
 
P

Panjandrum

Jason said:
If this was your usual I/O that calls formatted input functions, you would
need to test eof.

You need to test for an error (if you need to test for an error).
But here we call low-level, unformatted input functions
and the only danger is a loss of integrity in the stream buffer (i.e.,
in.bad() evaluates true).

You may just lose your user's data.
How often does that happen? Hardly seems worth the
effort and run-time required of using a temporary string. What do you think?

You mean it hardly seems worth the effort to safely handle your user's
data?
BTW, the cost of a temporary string that is swapped later is close to
immeasurable.
 
J

Jason Heyes

Alan Johnson said:
From the C++ Standard, Clause 24.1.1.3: "... Algorithms on input iterators
should never attempt to pass through the same iterator twice. They should
be single pass algorithms. ..."

Therefore, in a standard compliant implementation, assign is a single pass
algorithm.

Does this mean the gcc 3.3 library is not compliant with that clause? How
else do we explain the O(2n) performance?
 
J

Jason Heyes

Alan Johnson said:
Jason Heyes wrote:
[snip]
Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).

First, O(2n) is a fairly nonsensical expression, as O(2n) defines
precisely the same set of functions as O(n).

Second, if by O(2n) you mean to imply that the entire file is read twice
(or more precisely, that assign iterates over the range twice), then I
would ask to see where in the standard that behavior is defined. In fact,
tracing through my compiler's library, assign is implemented by calling
replace, which is implemented by constructing a new string using the
InputIterator range, which is implemented by called a function called
_S_construct that iterates over the range only once, allocating more space
as it is needed.

That's all fine. How do you explain your function taking twice as long to
execute when compiled with the gcc 3.3 compiler? Is the compiler's library
not compliant with the standard?
 
J

Jason Heyes

Robbie Hatley said:
Ok, if that's more convenient for the kind of processing you're
trying to do.

I'm often interesting in processing files with many short lines,
such as killfiles, dictionary files, name-list files, etc.
I'm often interesting in inserting, erasing, altering, sorting,
or de-duping lines of text, so I use std::list<std::string> a lot.
It makes that kind of processing a breeze.

But if you want to suck the whole file into a single std:string,
I can see how that could have its own advantages. (Such as
using the vast assortment of methods -- substr, find, etc. --
that come with std::string.)


Yikes. What is "rdbuf"? Never heard of that.
Lemme look it up in Bjarne's book...
Hmmm... he doesn't say much about it. In fact, his only
explanation is the cryptic two-word comment, "get buffer".
So what is this "rdbuf" doing?

And how does this function handle newlines? Does it
insert them into the string?

rdbuf() returns a stream buffer, which is like a low-level version of an I/O
stream.

All whitespace (including the newline) is inserted into the string during
extraction.
 
A

Alan Johnson

Jason said:
Does this mean the gcc 3.3 library is not compliant with that clause? How
else do we explain the O(2n) performance?

Merging another branch of this thread because it has turned into
That's all fine. How do you explain your function taking twice as long to
execute when compiled with the gcc 3.3 compiler? Is the compiler's library
not compliant with the standard?

You need to explain what you mean by O(2n). I could just as easily say
that all the other algorithms had running times in O(1000000n), and I'd
be correct, because those all define the same set of functions. (i.e.
the constants don't mean anything.)

The behavior that Phil Endecott observed was not that my algorithm took
twice as long, but that it had quadratic (i.e. O(n^2) ) performance on
his system. If in fact his implementation did (in violation of the
standard) iterate over the range twice, that still would only be linear
performance. If I had to take a wild guess at what was making the
performace quadratic, I'd guess that the memory allocation scheme used
in his library's "assign" method was allocating a constant amount of
memory each time it ran out (provably quadratic) instead of doing
something like doubling the buffer size each time it ran out of memory
(provably linear).

I don't have easy access to test on a gcc 3.3 compiler, but I would say
that his results do in fact indicate an error in his library, especially
since it suddenly was corrected when he upgraded compilers.

-Alan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,438
Messages
2,571,699
Members
48,796
Latest member
Greg L.
Top