Bug/Gross InEfficiency in HeathField's fgetline program

Tor Rustad · Oct 8, 2007

Richard said:
Tor Rustad said:

The important thing is to avoid reading it n times.

If performance were the primary consideration, that's exactly what I'd have
done. Since it wasn't, it isn't. I considered robustness in the face of
long strings to be more important.

I didn't look at the original code.

In particular, if the function in question, was supposed to interact
with a human providing input, OP's objections was nonsense.

santosh · Oct 8, 2007

Richard said:
santosh said:

Yes, it's certainly possible. So is the reverse, of course. The only
proper course is to measure, realising that the measurement will be
specific to a particular implementation on a specific machine.

I have measured the performance of the code - my code - on an Athlon
1.4 (no slouch, but hardly a state-of-the-art mean machine) under gcc
2.95.3, using an input file over 10 Megabytes in size (a typical real
world input would be a handful of kilobytes). Exact input file size:
11057780 bytes. Number of lines: 120,000 (compared to a typical real
world input of perhaps a few dozen, or maybe three or four thousand
for a fairly large system).

The profiler reports that the program took 0.1 seconds to run (on
inputs that are orders of magnitude larger than would be expected in
production). The code whose performance is in question is called ONCE,
by the way, and the profiler (which claims to measure in microseconds)
reports that the function takes zero time to run. Obviously that can't
be literally true, but it's certainly too small for my gprof
implementation to measure. It might be reasonably argued that it takes
almost a microsecond.

Until recently I used, where possible, a handwritten assembler wrapper
that used the Pentium's RDTSC instruction. I've switched to Valgrind.

The purpose of the program is to take as input a list of error
messages and identifiers, and convert these into a .h file with
#defines for the error identifiers, and a .c file with a function that
converts a number into an error message.

It's a programmer's tool, and requires as input a file that is used by
a programmer to store an intelligent identifier/message pair. To edit
such a file, for a superhuman programmer like Chris Torek, might take
as little as - what, five seconds? (Wow, watch those fingers fly) And
the next step would be to run the program (perhaps automatically, on
saving the file). If this superhuman programmer did nothing but
updates to the input file all day every day, he would cause 86400/5 =
17280 program runs per day. If the file size is as large as in my test
(deeply unlikely in the real world, but just about possible for a
really, really, really large project), the total program time taken
would be a little less than half an hour (29 minutes 28 seconds, in
fact) - remember that this is for over seventeen THOUSAND runs.

If, for the sake of argument, the OP's code is correct and if it
reduces the cost of converting dots to underscores from 1 microsecond
to *zero*, the total time saving per day will be 0.01728 seconds. Over
a thousand years of 24hrs/day of running this program 17280 times a
day, the total time saved will be about an hour and three quarters.

Compare this to the time spent on the discussion so far.

Good point. By the way, I'd have thought something like Perl would be
better suited to your task. But of course, it's best to code it
whatever language you are most familiar with.

Richard · Oct 8, 2007

Joachim Schmitz said:
Indeed. Not mine though but Antonius'...
It is definitly easier to read. Yes it could have been written more compact,
I esp. disliked the "return t;" in 2 lines and with a different brace style
one could save another couple of lines and still keep it simple.

esp. in a debugger it is much easier to follow the flow if not everything is
done in one line

You are being purposely obstructive for some reason. No one mentioned 1
line. It is however important, in my experience, to keep tight code
close together so it can be examined without excessive scrolling or
white space cluttering up the display.

see above, because it is easier to debug.

No, it isn't. by adding the s++ line I suggested you have ability to
examine data on every iteration. Set your watched variables, step. Done.

For that reason I liked your improvement to Antonius' version (moving the ++
part into the otherwise empty body of the loop).

Two reasons : because it meant you didnt have the same s++ twice. Less
code as it happens. In addition a suitable break point to examine the
flow. Contrary to a lot of opinion here I belieb debuggers are very
useful and incredibly handy for teaching new programmers how a system
works. As well, of course, as being able to set watch points and HW
breakpoints.

?? You mean RH being the source and I defend it because of that? Or what
else are youn trying to say here?

Just that. I can see no advantages whatsoever to the original code. It
is wordy, spacey and inefficient.

We're talking about one call to strcpy, which in a decent implementation is
lightning fast and higly optimized, so saving it doesn't really save
much.

So, which, as a C programmer, do you prefer?

To me there is only one contender. Sorry and all that but I feel
strongly about it. Really.

Richard · Oct 8, 2007

santosh said:
Richard wrote:

As far as the second point applies to this specific example, it could be
possible that the strcpy implementation is more efficient than a manual
coded loop. This could become noticeable for large string numbers or
sizes.

That would be a point if it had any validity at all here. As it is, I'm
not sure it does. There was simply no need to copy the entire string
using strcpy and then loop through it again doing the replacement
IMO. However, I would be interested to hear than I am wrong. I just like
the "do it once" mantra.

Minor point, but since this whole thread is nit-picking, I though I'd
mention it.

It is not nit picking in the slightest.

Reducing a 20 line lump of code to 3 or 4 lines can does have big
ramifications on

code reads
maintenance
documentation
efficiency
debugging

Richard · Oct 8, 2007

santosh said:
Richard said:

santosh said:

Richard wrote:

[ ... ] heatjfield suggested [ ... ]

What is it with people mangling Richard's surname?

Click to expand...

It's sheer ignorance. Getting someone's name right is a basic mark of
respect, which is why I *always* take care with names - I hate getting
them wrong, and have done so only rarely. "Kylheku" and "Dijkstra" are
two that I've got wrong in the past, but I always get them right now.

If someone continually gets a name wrong, it's a reasonable sign that
they consider that person to be not worth the bother of taking trouble
for. Richard Riley's inability or unwillingness to take out a second
or two to get my name right is therefore suggestive. And if he is so
short of time and care when composing Usenet articles that he *doesn't
even have time to get my name right*, I see no reason why I or anyone
should accord his hurried, careless views any weight. Furthermore, I
would venture to suggest that there *may* be a correlation between
those who write hurried, careless Usenet articles and those who write
hurried, careless C.

And if someone habitually gets a particular name wrong in precisely
the same way that a violence-threatening troll habitually does (not
Richard Riley, I hasten to add, but this Antoninus Twink character
that everyone is treating so seriously), then it's hard to avoid the
conclusion that we're dealing with a sock puppet.

Click to expand...

Regarding your last paragraph, there is strong evidence that
this "Antoninus Twink" and "Paul", (having an email beginning
with "paulcr"), who posted a few months ago are the same.

<http://groups.google.com/group/comp.lang.c/msg/f2e171fc9691cd7d?dmode=source>
<http://groups.google.com/group/comp.lang.c/msg/b571a9a155facb48?dmode=source>

As you can see, the "User-Agent" and "NNTP-Posting-Host" fields in the
header are identical. Further the timezone is also the same.

Whether the "Paul" in the second message above is indeed the one who
threatened you is not so easily verifiable, but it seems very likely.

Why would you post that?

Off Topic for a start and NOTHING to do with C.

The poster was quite correct in his improvements of RHs code. Regardless
of who or what he is.

Richard · Oct 8, 2007

Antoninus Twink said:
Wading through 20 lines of code that do a 4-line job in a roundabout way
can also be expensive.

Rubbish. I sometimes wonder if you only program in a class room. Even
K&R have this type of thing by page 20.

The code is as simple as it can get. A malloc and a typical if then else
duplication of a string using ?: It can hardly get any simpler. It seems
you are not as open to code corrections and/or improvements as you
frequently state. Why am I not surprised?

Working out?!?!?! A glance should be enough.

So know we go from "not obvious" to "I am too self important to even
bother looking at it". Not much new there then.

It's a pair of declarations, a simple return statement, and two lines of
moderately dense but completely idiomatic C. It's not like you need to
set aside a rainy afternoon to dedicate to reading it.

Actually I suspect that you've read it closely, because you'd like
nothing more than to humiliate me by pointing out a bug in it. As it's
manifestly correct, you've turned to word games instead of admitting
that your own code might, just might, admit some improvement.

I hate to agree. But I agree.

Richard Heathfield · Oct 8, 2007

santosh said:

By the way, I'd have thought something like Perl would be
better suited to your task.

Perl is available in lots of places, it's true - but it has not yet been
ported *everywhere*.

But of course, it's best to code it
whatever language you are most familiar with.

Or indeed a language for which an implementation exists on the target
machine.

"Oh?" I hear you say. "What machine /is/ he targeting?" My answer is
simple: "how on earth should I know? Suddenly we need to know what machine
people run our code on? When did /that/ rule come in?"

Richard · Oct 8, 2007

santosh said:
The email address of the poster who wrote this threatening message:
<http://groups.google.com/group/alt.comp.lang.learn.c-c++/msg/82c5c4b2e59984e0?dmode=source>

And this message:
<http://groups.google.com/group/comp.lang.c/msg/b571a9a155facb48?dmode=source
&utoken=0qjyGSsAAAAoZoYpti9uhwqsYFEYUqETYYZ-k6tkIFJmSBA4M7hURISd8ZKZA3rNsODuE9WiCO4>

begins identically.

Your messages in this thread share with the second message linked above,
identical "User-Agent", "NNTP-Posting-Host" headers as well as the
timezone.

A is related to B.
B is realted to C.

So

A is related to C.

Santosh, you are not Heathfield's lapdog. No one cares. Can we stick to
the issue at hand - namely efficient coding in C and the possible
ramifications of wordy, overly engineered solution compared to small
precise, elegant solutions which utilise the features of the language.

Richard · Oct 8, 2007

Joachim Schmitz said:
Try to resist the temptation, it's hard, I know...

Bye, Jojo

To reiterate why : it makes it nigh on impossible to debug. And people
do use debuggers in the real world. A lot.

Richard Heathfield · Oct 8, 2007

Richard said:

Rubbish.

Precisely. And I don't intend to waste my time looking at rubbish posted by
one troll and lauded by another.

Joachim Schmitz · Oct 8, 2007

Richard said:
You are being purposely obstructive for some reason. No one mentioned 1
line.

The real work in Antonius' version is done in one line:
while(*u++=(*s=='.' ? s++, '_' : *s++));

It is however important, in my experience, to keep tight code
close together so it can be examined without excessive scrolling or
white space cluttering up the display.

Excessive scrolling and 20 lines (that could easily be deduced to some 15)
is not the same

No, it isn't. by adding the s++ line I suggested you have ability to
examine data on every iteration. Set your watched variables, step. Done.

Which is exactly my point. Yes, with your change this is possibnle, with
Antonius' it is not.

Two reasons : because it meant you didnt have the same s++ twice. Less
code as it happens. In addition a suitable break point to examine the
flow. Agreed
Contrary to a lot of opinion here I belieb debuggers are very
useful and incredibly handy for teaching new programmers how a system
works. As well, of course, as being able to set watch points and HW
breakpoints.

Just that. I can see no advantages whatsoever to the original code. It
is wordy, spacey and inefficient.

It could be condensed by using a different bracing style, without making it
more difficult to read.
The OP claimed a Bug and/or gross inefficiency. This is nonsense. There's no
bug and the inefficiency is minor and outwheight by the simplicity of the
code. Even if I see room for improvment.

So, which, as a C programmer, do you prefer?

I liked your version, for it's elegance and Richard's for it's simplicity.
Antonius' version is too terse, good for teaching C maybe, but not good for
production code.

Bye, Jojo

Richard Heathfield · Oct 8, 2007

Richard said:

Can we stick to
the issue at hand - namely efficient coding in C and the possible
ramifications of wordy, overly engineered solution compared to small
precise, elegant solutions which utilise the features of the language.

The code as written took about a minute to write, tops. EVEN IF the
suggested replacement (which, no matter what you may choose to believe, I
haven't bothered to analyse or even read) were correct and even if it were
able to do the same job in zero time, the total time it is likely to save
me over the rest of my expected lifetime is significantly lower than the
time it would take to do the necessary edit, compilation, and testing of
the replacement code.

The OP is suggesting micro-optimisation with a vengeance. He is suggesting
that I replace working code that is called *once* per program invocation
and which does its job so fast that gprof can't keep up. This is silly
beyond words. That you cannot understand this does not surprise me - after
all, you even struggle to get people's names right. Mind you, I've noticed
that you're coming along nicely on that front - another ten or fifteen
years at this rate of learning and you might just be ready to take your
place in this group as someone whose articles are actually worth reading
other than for mere rebuttal.

Keith Thompson · Oct 8, 2007

Richard said:
It was perfectly obvious what you meant. Word games don't even begin to
cover it.

Yes, it's perfectly obvious what he meant. It's also perfectly
obvious that if he had meant that he believes that it doesn't
correctly replace the code he wrote, the could and would have said so.
He didn't.

John Bode · Oct 8, 2007

Joachim Schmitz said:
Joachim Schmitz said:

news:[email protected]...

Click to expand...

[snip]

?? You mean RH being the source and I defend it because of that? Or what
else are youn trying to say here?

Click to expand...

Just that. I can see no advantages whatsoever to the original code. It
is wordy,

Translation: easy to read and understand.

spacey

Translation: easy to follow.

and inefficient.

Facts not in evidence. As others have pointed out, strcpy() may be
implemented in such a way that is faster than copying individual
characters in a loop. And elsethread RH has pointed out that this
code gets called *once*, that he has benchmarked it for obnoxiously
long inputs (tens of megabytes), and that the performance is perfectly
acceptable (< 1 us according to his profiler).

Antonius' replacement is terse to the point of being painful to read
(at least for these 42-year-old eyes). Some judicious use of
whitespace would help a *lot* (I almost missed the multiple assignment
the first time through); even so, the conditional expression is
especially jarring, and it took me a couple of minutes to convince
myself it was doing what I thought it was (as opposed to taking almost
no time to understand the original). Yes, it uses common C idioms,
but not all of us use all idioms equally (I almost never use a
conditional expression in my day-to-day work, nor do I commonly use
the comma operator, and I've used the two in combination maybe twice
in the last 15 years). Until someone profiles that code vs.
Richard's, it's an open question as to whether it's really more
efficient, or if that gain in efficiency is worth the cost of
readability.

Less so than readability and maintainability*. I will happily
sacrifice screen real estate for code that's easy to understand.

This makes no sense to me; if anything in your code is going to work
reasonably well, I would expect it to be the library calls.

So, which, as a C programmer, do you prefer?

RH's. It's easier to grok at first glance. It would be easier to fix
if a bug was found. It's easier to explain how it works. It's easier
to verify that it's correct.

To me there is only one contender. Sorry and all that but I feel
strongly about it. Really.

As do I. I've maintained code that read like RH's and code that read
like Antonius'. I strongly prefer working with the former, if for no
other reason than obvious flaws are easier to spot and fix. I *used*
to write code like Antonius' version, and it *always* bit me in the
ass six to nine months later when I'd forgotten about it and had to
figure out how it worked again.

* Production code must be, in order: 1. Correct (it doesn't matter
how fast your code is if it does the wrong thing); 2. Maintainable (it
doesn't matter how fast your code is if it can't be fixed or
upgraded); 3. Robust (it doesn't matter how fast your code is if it
dies at the first hint of bad input); 4. Efficient, but not at the
cost of 1, 2, or 3.

Richard · Oct 8, 2007

Joachim Schmitz said:
I liked your version, for it's elegance and Richard's for it's
simplicity.

Richard's was more complex. It used the same reference/dereference with
additional library calls.

Richard · Oct 8, 2007

Keith Thompson said:
Yes, it's perfectly obvious what he meant. It's also perfectly
obvious that if he had meant that he believes that it doesn't
correctly replace the code he wrote, the could and would have said so.
He didn't.

You don't have to defend Heathfield. His fairly unique style makes it
very clear, as did his followup, exactly what he meant.

He didn't even bother to read the code as he made clear in a later post.

Richard · Oct 8, 2007

John Bode said:
Joachim Schmitz said:

news:[email protected]...

Click to expand...

[snip]

Don't defend code just because of the source.
?? You mean RH being the source and I defend it because of that? Or what
else are youn trying to say here?

Click to expand...

Just that. I can see no advantages whatsoever to the original code. It
is wordy,

Click to expand...

Translation: easy to read and understand.

Nope. Not at all. Long winded and overly wordy. Far more difficult to
determine its use "at a glance.

Translation: easy to follow.

No. Spacey and using unnecessary real estate.

Facts not in evidence. As others have pointed out, strcpy() may be
implemented in such a way that is faster than copying individual
characters in a loop. And elsethread RH has pointed out that this
code gets called *once*, that he has benchmarked it for obnoxiously
long inputs (tens of megabytes), and that the performance is perfectly
acceptable (< 1 us according to his profiler).

Regardless, unnecessary call. Sorry.

Antonius' replacement is terse to the point of being painful to read
(at least for these 42-year-old eyes). Some judicious use of
whitespace would help a *lot* (I almost missed the multiple assignment
the first time through); even so, the conditional expression is

It was obvious to me and I suspect any C programmer who works daily with
C.

especially jarring, and it took me a couple of minutes to convince
myself it was doing what I thought it was (as opposed to taking almost
no time to understand the original). Yes, it uses common C idioms,
but not all of us use all idioms equally (I almost never use a
conditional expression in my day-to-day work, nor do I commonly use
the comma operator, and I've used the two in combination maybe twice
in the last 15 years). Until someone profiles that code vs.
Richard's, it's an open question as to whether it's really more
efficient, or if that gain in efficiency is worth the cost of
readability.

I agree about the comma. Hence my "improvement" by moving the s++ to the
body. Not only for reduced code, but for a suitable debugger point.

Less so than readability and maintainability*. I will happily
sacrifice screen real estate for code that's easy to understand.

those 4 lines are MUCH more maintainable then Heathfields.

This makes no sense to me; if anything in your code is going to work
reasonably well, I would expect it to be the library calls.

I dont know. I had to take a double look as to WHY he called
strcpy. There was simply no need.

RH's. It's easier to grok at first glance. It would be easier to fix
if a bug was found. It's easier to explain how it works. It's easier
to verify that it's correct.

I disagree on almost every point you have made. As is our right

As do I. I've maintained code that read like RH's and code that read
like Antonius'. I strongly prefer working with the former, if for no
other reason than obvious flaws are easier to spot and fix. I *used*
to write code like Antonius' version, and it *always* bit me in the
ass six to nine months later when I'd forgotten about it and had to
figure out how it worked again.

If you found those 4 lines hard to understand than, to be honest, I
wonder about your C experience. it was a basic string transfer with
character replacement. I dont mean this as an insult, but I have never
experienced code like RHs getting through code reviews on systems I have
worked on. It would be called "inefficient" and "beginner like".

* Production code must be, in order: 1. Correct (it doesn't matter
how fast your code is if it does the wrong thing); 2. Maintainable (it

Are there bugs in the replacement?

doesn't matter how fast your code is if it can't be fixed or
upgraded); 3. Robust (it doesn't matter how fast your code is if it
dies at the first hint of bad input); 4. Efficient, but not at the
cost of 1, 2, or 3.

Agreed. All of which are met by the replacement.

Less lines also tend to translate to less errors.

Antoninus Twink · Oct 8, 2007

The email address of the poster who wrote this threatening message:
<http://groups.google.com/group/alt.comp.lang.learn.c-c++/msg/82c5c4b2e59984e0?dmode=source>

And this message:
<http://groups.google.com/group/comp.lang.c/msg/b571a9a155facb48?dmode=source
&utoken=0qjyGSsAAAAoZoYpti9uhwqsYFEYUqETYYZ-k6tkIFJmSBA4M7hURISd8ZKZA3rNsODuE9WiCO4>

begins identically.

Your messages in this thread share with the second message linked above,
identical "User-Agent", "NNTP-Posting-Host" headers as well as the
timezone.

A is related to B.
B is realted to C.

So

A is related to C.

Except that both connections are, to say the least, tenuous. I mean,
people with slightly similar email addresses *must* be the same! And
after all, there's a *unique* slrn user who posts through aioe!

Put the two together and you get something completely far-fetched. Do
you honestly believe that I, Paulcr, or anyone else would disappear for
5 years then suddently reappear, still nursing a grudge against Mr
Heathfield?

On the subject of names, I've been criticized for writing HeathField as
two words (hardly unreasonable, given that Heath and Field are... well,
both words), but I notice that Joachim and John Bode have both mistaken
my pseudonym for Antonius in this thread. What's sauce for the goose...

But really, it's just amazing how ready people are to shy away from
actual technical discussion and resort to mud-slinging instead.

Richard · Oct 8, 2007

Antoninus Twink said:
Except that both connections are, to say the least, tenuous. I mean,
people with slightly similar email addresses *must* be the same! And
after all, there's a *unique* slrn user who posts through aioe!

Put the two together and you get something completely far-fetched. Do
you honestly believe that I, Paulcr, or anyone else would disappear for
5 years then suddently reappear, still nursing a grudge against Mr
Heathfield?

On the subject of names, I've been criticized for writing HeathField as
two words (hardly unreasonable, given that Heath and Field are... well,
both words), but I notice that Joachim and John Bode have both mistaken
my pseudonym for Antonius in this thread. What's sauce for the goose...

But really, it's just amazing how ready people are to shy away from
actual technical discussion and resort to mud-slinging instead.

I am astonished that people claiming to be professional programmers
could be in any way "confused" or "unsure" about your concise
replacement for Heathfield's version.

,----
| while(*u++=(*s=='.' ? '_' : *s))
| s++;
`----

I simply can not see the complication or the need to "analyse" this. Yes
if we were training in lesson 2 we might expand it out a little. but
only to the extent of removing the ?: usage to an if then else.

Willem · Oct 8, 2007

Richard wrote:
) Nope. Not at all. Long winded and overly wordy. Far more difficult to
) determine its use "at a glance.

And the proposed improvement is overly terse and compressed.
It might be that someone who writes C every day can easily understand what
it does, but you can't assume that the maintenance programmer that comes
across your code in a year or two is that well-versed in C.
This is one possible implementation:

char *dot_to_underscore(const char *src)
{
char *ret;
size_t i;

ret = malloc(strlen(src) + 1);
if (ret == NULL)
return NULL;

for (i = 0; src != '\0'; i++) {
if (src == '.') {
ret = '_';
} else {
ret = src;
}
}
ret = '\0';

return ret;
}

This is another, much more suited for reading by non-C programmers:
char *dot_to_underscore(const char *src)
{
return substitute(copy_string(src), '.', '_');
}

(Of course, you need to have suitable 'substitute' and 'copy_string'
functions.)

What I'm trying to say is: it's entirely possible that your code
will be read by someone who's not a C guru. And I can tell you, lots
of people out there maintaining C code only have passing familiarity.

By the way, the perl equivalent would be:
(ret = str) =~ s/\./_/g;
But that's off-topic.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Fibonacci	0	May 13, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
C language. work with text	3	Dec 10, 2021
code review	26	Feb 6, 2004
Can't solve problems! please Help	0	Sep 26, 2022
compressing charatcers	35	Apr 2, 2014
Strange bug	65	Nov 19, 2010
K&R exercise 5-5	10	Feb 20, 2007

Bug/Gross InEfficiency in HeathField's fgetline program

Tor Rustad

santosh

Richard

Richard

Richard

Richard

Richard Heathfield

Richard

Richard

Richard Heathfield

Joachim Schmitz

Richard Heathfield

Keith Thompson

John Bode

Richard

Richard

Richard

Antoninus Twink

Richard

Willem

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads