Efficency and the standard library

Seebs · Feb 10, 2010

This may have gotten buried given that it started in a Nilges thread,
but it's actually pretty interesting.

There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string
Output: a newly-allocated string containing the original
string with each copy of the target string replaced with the
replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t, in)) {
++count;
t += inlen;
}

Here's a hunk of another one:

ptrIndex1 = strMaster;
while(*ptrIndex1)
{
ptrIndex0 = ptrIndex1;
while (-1)
{
for(;
*ptrIndex1 && *ptrIndex1 != *strTarget;
ptrIndex1++);
for(ptrIndex2 = strTarget, ptrIndex3 = ptrIndex1;
*ptrIndex3
&&
*ptrIndex2
&&
*ptrIndex3 == *ptrIndex2;
ptrIndex3++, ptrIndex2++);
if (!*ptrIndex2) break;
ptrIndex1 = ptrIndex3;
if (!*ptrIndex3) break;
}

These hunks are essentially performing the same part of the core loop;
find the next occurrence of the target string in the original string.

The second one was written by Edward Nilges ("spinoza1111"). He offers
in its defense the assertion that it's the "best" algorithm. (Nevermind
that it's got a bug; the bug could be fixed easily.)

My thinking:

The first is better, not because it's less buggy (that could be fixed), but
because it's simpler to understand. It may, or may not, be less efficient.
However, efficiency will vary widely with input characteristics; for some
cases, the arguable inefficiencies may be completely irrelevant, while in
others, they'd be significant.

But you should always write the simpler one first, and wait until you've
profiled the code and verified that there's a bottleneck, and you know where
it is, before trying to optimize. You may, or may not, be able to outperform
a given library's implementation of strstr(). If you have unusual inputs,
you have a pretty good shot at it... But it may not be worth it. The
complicated example has had at least one bug identified, and there may be
others. It's extremely hard to read, partially because of the poor
naming, and partially because of the counter-idiomatic usage. But mostly,
it's a lot more work to write and maintain, and for no known benefit.

It's good to think a little about efficiency, but write for clarity and
correctness first, so you have a known-working program to check your
results if you later need to modify it for speed.

-s

Moi · Feb 10, 2010

This may have gotten buried given that it started in a Nilges thread,
but it's actually pretty interesting.

There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string Output: a
newly-allocated string containing the original
string with each copy of the target string replaced with the
replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t,
in)) {
++count;
t += inlen;
}

Of course yours is more readable.
Personally, I prefer not to use one letter variable names (three letter names are
more grepable, also for the eyes). Also I dont like for(;

statements that are too wide
(and with repeating subexpressions). But it's all a matter of taste, of course.

I'd prefer:

for (count = 0, tmp=target; tmp = strstr(tmp, in); ) {
if ( !*tmp) break;
count++;
tmp += inlen;
...
}

My own strstr() loop:

nhit = 0;
for (tail = haystack; src = strstr(tail, needle); tail = src + needle_len) {
nhit++;
...
}

, which uses one variable too many, but that will be used later on.

AvK

Dann Corbit · Feb 10, 2010

usenet- said:
This may have gotten buried given that it started in a Nilges thread,
but it's actually pretty interesting.

There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string
Output: a newly-allocated string containing the original
string with each copy of the target string replaced with the
replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t, in)) {
++count;
t += inlen;
}

Here's a hunk of another one:

These hunks are essentially performing the same part of the core loop;
find the next occurrence of the target string in the original string.

The second one was written by Edward Nilges ("spinoza1111"). He offers
in its defense the assertion that it's the "best" algorithm. (Nevermind
that it's got a bug; the bug could be fixed easily.)

My thinking:

The first is better, not because it's less buggy (that could be fixed), but

For a standard library, the *most* important thing is correctness. IOW,
a bsearch() function that is blazing fast but fails if there are more
than 2 billion items is not as good as one that is only half as fast but
works with anything up to size_t (-1) number of elements.

Assuming correctness, then effeciency should be the next decider. An
interesting project would be to take a collection of all public C
libraries and do the following:
1. Exhaustively test for correctness
2. Analyze for efficiency
3. Choose best of breed from the above analysis. It could also be a
hybrid (e.g. a decider function chooses the best algorithm based on
conditions).

Here is a list of some compilers (most of these come with source code)
that could be used as a starting point:

Directory of c:\compiler
06/03/2009 11:38 AM <DIR> bscc
01/21/2010 04:16 PM <DIR> clang-2.6
01/21/2010 04:35 PM <DIR> commoncpp2-1.7.3
01/13/2010 01:13 PM <DIR> coreutils-8.4
10/16/2009 05:39 PM <DIR> ctool_2.12
10/16/2009 12:39 PM <DIR> fog
06/04/2009 11:52 AM <DIR> g95-0.92
06/04/2009 11:58 AM <DIR> gcc-4.4.0
10/16/2009 05:58 PM <DIR> gcc-4.4.2
01/21/2010 04:39 PM <DIR> gcc-4.4.3
10/16/2009 05:59 PM <DIR> ladsoft
10/16/2009 05:59 PM <DIR> lcc
10/16/2009 06:00 PM <DIR> llvm
01/21/2010 04:18 PM <DIR> llvm-2.6
01/21/2010 04:19 PM <DIR> llvm-gcc4.2-2.6.source
11/30/2008 08:00 AM <DIR> open64-4.2.1-0
10/16/2009 06:06 PM <DIR> OW18src
01/21/2010 04:18 PM <DIR> owdaily
06/04/2009 11:38 AM <DIR> tendra
01/21/2010 04:35 PM <DIR> ucommon-2.0.8
06/03/2009 02:49 PM <DIR> vmkit-0.25
06/03/2009 12:24 PM <DIR> watcom
01/21/2010 04:00 PM <DIR> x86_open64-4.2.3

Ben Pfaff · Feb 10, 2010

Dann Corbit said:
For a standard library, the *most* important thing is correctness. IOW,
a bsearch() function that is blazing fast but fails if there are more
than 2 billion items is not as good as one that is only half as fast but
works with anything up to size_t (-1) number of elements.

If the former bsearch() implementation is on a system that
doesn't support more than 2 billion bytes of contiguous data, I'm
not worried about it.

Phred Phungus · Feb 11, 2010

Ben said:
If the former bsearch() implementation is on a system that
doesn't support more than 2 billion bytes of contiguous data, I'm
not worried about it.

Why?
--

Ben Pfaff · Feb 11, 2010

Phred Phungus said:
Why?

Because the bsearch() function cannot exceed a limit of 2 billion
items on an implementation that limits objects to 2 billion bytes
or less.

Ben Bacarisse · Feb 11, 2010

Seebs said:
This may have gotten buried given that it started in a Nilges thread,
but it's actually pretty interesting.

There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string
Output: a newly-allocated string containing the original
string with each copy of the target string replaced with the
replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t, in)) {
++count;
t += inlen;
}

Here's a hunk of another one:

These hunks are essentially performing the same part of the core loop;
find the next occurrence of the target string in the original string.

The second one was written by Edward Nilges ("spinoza1111"). He offers
in its defense the assertion that it's the "best" algorithm. (Nevermind
that it's got a bug; the bug could be fixed easily.)

My thinking:

The first is better, not because it's less buggy (that could be fixed), but
because it's simpler to understand. It may, or may not, be less efficient.
However, efficiency will vary widely with input characteristics; for some
cases, the arguable inefficiencies may be completely irrelevant, while in
others, they'd be significant.

I don't think you can dismiss the bugs so easily. I would argue that
it is not a coincidence that the more complex code has more bugs. I
can't write code that looks like spinoza1111's code and get it right.
I *have* to write simpler code, broken into simple functions, or I have
no chance of getting it to be correct.

For example, the latest improvement introduced another bug[1] and I
don't think that is simple carelessness. Without a re-write I suspect
almost any change is as likely to introduce a bug as fix one (I tried
to see where the error was, but I gave up). If I was paid to fix it,
I'd simply start again and it would end a up as I have already posted,
though to mirror the behaviour I'd now have to add some argument
checking.

<snip>

[1] Apparently when the target string is not present in the source
string: e.g. replace("a", "x", "b").

Dann Corbit · Feb 11, 2010

Because the bsearch() function cannot exceed a limit of 2 billion
items on an implementation that limits objects to 2 billion bytes
or less.

If we are talking about constructing a best of breed library system for
the C language, then the distinction is of enormous importance.

While I agree it is not relevant for the original implementation, it is
for an implementation where we are trying to pick and choose the best
possible routines.

P.S.
The former case is still a bug waiting to happen.

Even if the former implementation is a toaster IC, 40 years in the
future, our toaster ICs will have 50 GB RAM on board.

Seebs · Feb 11, 2010

I don't think you can dismiss the bugs so easily. I would argue that
it is not a coincidence that the more complex code has more bugs. I
can't write code that looks like spinoza1111's code and get it right.
I *have* to write simpler code, broken into simple functions, or I have
no chance of getting it to be correct.

You may be right on this. Certainly, this kind of thing is one of the
major reasons that I sometimes rewrite things a few times until I am
sure I can understand the logic just by looking at it. If I have to
think it through, usually that means something is wrong, or the problem
is extremely hard.

For example, the latest improvement introduced another bug[1] and I
don't think that is simple carelessness. Without a re-write I suspect
almost any change is as likely to introduce a bug as fix one (I tried
to see where the error was, but I gave up). If I was paid to fix it,
I'd simply start again and it would end a up as I have already posted,
though to mirror the behaviour I'd now have to add some argument
checking.

I think it would be interesting to start from his code and see whether it
can be fixed by cleanups and refactoring. I bet it could, although it'd
certainly take longer than just writing one from scratch.

But, e.g., starting by using names a little better than "ptrIndex1" and
the like would make it a ton simpler.

-s

Peter Nilsson · Feb 11, 2010

Seebs said:
There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string
Output: a newly-allocated string containing the original
string with each copy of the target string replaced
with the replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t;
t = strstr(t, in)) {

Why are you seemingly scanning the target string for occurances
of 'in', which I would presume to be original 'input' string?

++count;
t += inlen;
}

What if the string being scanned for is empty? If the string
to be scanned is non-empty, this will produce an infinite loop.

Seebs · Feb 11, 2010

Why are you seemingly scanning the target string for occurances
of 'in', which I would presume to be original 'input' string?

"in", "out", "original".

What if the string being scanned for is empty? If the string
to be scanned is non-empty, this will produce an infinite loop.

This was indeed observed, which is why the error checking at the
beginning of the function is now modified to reject a 0-length
thing-to-replace.

-s

Eric Sosman · Feb 11, 2010

"in", "out", "original".

This was indeed observed, which is why the error checking at the
beginning of the function is now modified to reject a 0-length
thing-to-replace.

... which is what I imagined the `*t' test in your fragment
was supposed to catch. If zero-length targets are rejected
earlier, I'm somehow missing the purpose of the `*t'.

Phred Phungus · Feb 11, 2010

Dann said:
If we are talking about constructing a best of breed library system for
the C language, then the distinction is of enormous importance.

While I agree it is not relevant for the original implementation, it is
for an implementation where we are trying to pick and choose the best
possible routines.

P.S.
The former case is still a bug waiting to happen.

Even if the former implementation is a toaster IC, 40 years in the
future, our toaster ICs will have 50 GB RAM on board.

That's a lot of computer power for something that requires almost none.

Mark · Feb 11, 2010

Richard said:
Total nonsense.

The better one is the one that runs more efficiently and works. So
assuming the bug ix fixed and its more efficient then other is
better. Its called a function. You dont need to know whats in it.

The one who will maintain this and other functions, needs to know what's in
it.

Nick Keighley · Feb 11, 2010

I don't think you can dismiss the bugs so easily. I would argue that
it is not a coincidence that the more complex code has more bugs. I
can't write code that looks like spinoza1111's code and get it right.
I *have* to write simpler code, broken into simple functions, or I have
no chance of getting it to be correct.

"I now suggest that we confine ourselves to the design and
implementation of intellectually manageable programs." Dijkstra

For example, the latest improvement introduced another bug[1] and I
don't think that is simple carelessness. Without a re-write I suspect
almost any change is as likely to introduce a bug as fix one (I tried
to see where the error was, but I gave up). If I was paid to fix it,
I'd simply start again and it would end a up as I have already posted,
though to mirror the behaviour I'd now have to add some argument
checking.

<snip>

[1] Apparently when the target string is not present in the source
string: e.g. replace("a", "x", "b").

Nick Keighley · Feb 11, 2010

This may have gotten buried given that it started in a Nilges thread,
but it's actually pretty interesting.

There was some discussion of algorithms to perform the task:
Inputs: target string, replacement string, original string
Output: a newly-allocated string containing the original
string with each copy of the target string replaced with the
replacement string.

Here's a hunk of mine:

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t, in)) {
++count;
t += inlen;
}

Here's a hunk of another one:

These hunks are essentially performing the same part of the core loop;
find the next occurrence of the target string in the original string.

The second one was written by Edward Nilges ("spinoza1111"). He offers
in its defense the assertion that it's the "best" algorithm. (Nevermind
that it's got a bug; the bug could be fixed easily.)

My thinking:

The first is better, not because it's less buggy (that could be fixed), but
because it's simpler to understand. It may, or may not, be less efficient.
However, efficiency will vary widely with input characteristics; for some
cases, the arguable inefficiencies may be completely irrelevant, while in
others, they'd be significant.

But you should always write the simpler one first, and wait until you've
profiled the code and verified that there's a bottleneck, and you know where
it is, before trying to optimize. You may, or may not, be able to outperform
a given library's implementation of strstr(). If you have unusual inputs,
you have a pretty good shot at it... But it may not be worth it. The
complicated example has had at least one bug identified, and there may be
others. It's extremely hard to read, partially because of the poor
naming, and partially because of the counter-idiomatic usage. But mostly,
it's a lot more work to write and maintain, and for no known benefit.

It's good to think a little about efficiency, but write for clarity and
correctness first, so you have a known-working program to check your
results if you later need to modify it for speed.

yes. If strstr() didn't exist it would be necessary to invent it. The
use of appropriate abstractions simplifies the code.

We once had a system that manipulated messages. Messages were linked
lists of memory blocks. Queues were linked lists of messages. Since
the same pointers were used for both purposes a queue was also a list
of memory blocks.

The code needed to free (return to a free list) all the memory blocks
of the messages that came before a particular message type. Two pages
of tangled code needed to find the block before the first block of the
target message followed by direct manipulation of queue head and tail
pointer and the free list pointer. The blocks also had to be marked as
free (they had some associated data). After I found my third bug I
wrote:

while (not empty (queue) and queue_head.message_type != target)
{
msg = remove_msg (queue)
free_msg (msg)
}

(pseudo code)

As my version was shorter, simpler and more correct it was adopted.
Why deal with three things at once when it can be delegated else
where? The code wanted to manipulate messages not memory blocks.

Nick Keighley · Feb 11, 2010

There was some discussion of algorithms to perform [some] task:

Click to expand...

[simple code]

and

[complex code]

Total nonsense.

The better one is the one that runs more efficiently and works.

how do we demonstrate that it works?

So assuming
the bug ix fixed and its more efficient then other is better. Its called
a function. You dont need to know whats in it.

"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare

Richard Tobin · Feb 11, 2010

Seebs said:
But you should always write the simpler one first, and wait until you've
profiled the code and verified that there's a bottleneck, and you know where
it is, before trying to optimize.

But wouldn't that be Extreme Programming?

(cf http://groups.google.com/group/comp.lang.c/msg/89ba3ef6554c9810?hl=en)

-- Richard

Seebs · Feb 11, 2010

But wouldn't that be Extreme Programming?

Only if you have two people do it at once, I think?

-s

John Bode · Feb 11, 2010

"Dann Corbit" <[email protected]> ha scritto nel messaggio
[snip]

For a standard library, the *most* important thing is correctness. IOW,

Click to expand...

for a standard library function
the *most* important thing is the easy to use
and the error handling well think

It doesn't matter how easy to it is to use if it gives you the wrong
answer.
It doesn't matter how well it handles errors if it gives you the wrong
answer.
It doesn't matter how fast it is if it gives you the wrong answer.
It doesn't matter how much memory it uses if it gives you the wrong
answer.

Correctness comes first. Then we can argue about everything else.

Implementing strstr	229	Mar 19, 2010
Future standard GUI library	51	May 18, 2013
Multithreading and compatibility library (libconfig)	1	Jan 23, 2013
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
AW: Pure python standard library and License	0	Mar 4, 2011
What is the most astounding C++ syntax construct?	0	Dec 22, 2022
review of the "container library", part 1/?	18	Mar 1, 2011
ParseTree and the Standard Library	0	Jul 27, 2007

Efficency and the standard library

Seebs

Moi

Dann Corbit

Ben Pfaff

Phred Phungus

Ben Pfaff

Ben Bacarisse

Dann Corbit

Seebs

Peter Nilsson

Seebs

Eric Sosman

Phred Phungus

Mark

Nick Keighley

Nick Keighley

Nick Keighley

Richard Tobin

Seebs

John Bode

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads