Most Interesting Bug Track Down

Frederick Gotham · Nov 24, 2006

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

I'm sure several regulars here have more interesting stories...

Eric Sosman · Nov 24, 2006

Frederick said:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

Looks like it may not have jumped high enough: The bug is there
even without embedded '\0' characters.

Jack Klein · Nov 24, 2006

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

Why "as usual"? Once you had the application working correctly, was
it too slow? Did you profile or otherwise test and to prove that this
was a bottleneck?

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

So now, thanks to premature optimization, you have violated your
design. You are initializing with something that is not a string, or
more precisely a string plus something else.

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

How could that be a problem? You just said you eliminated all use of
strlen().

I'm sure several regulars here have more interesting stories...

Sounds like a poor design, aggravated by violating its constraints in
use.

Frederick Gotham · Nov 24, 2006

Eric Sosman:

Looks like it may not have jumped high enough: The bug is there
even without embedded '\0' characters.

Ah yes, I should have mentioned that I took the "sizeof "Hello" - 1" into
account.

Frederick Gotham · Nov 24, 2006

Jack Klein:

Why "as usual"? Once you had the application working correctly, was
it too slow? Did you profile or otherwise test and to prove that this
was a bottleneck?

I have never written a program for monetary gain. I program purely for the
enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.

So now, thanks to premature optimization, you have violated your
design. You are initializing with something that is not a string, or
more precisely a string plus something else.

I can live with the minor complication though, given that my code runs
several orders of magnitude faster than the "play it safe" equivalent.

CBFalconer · Nov 25, 2006

Frederick said:
.... snip ...

I was writing usable which dealt with strings. As per usual with my
code, I made it efficient to the extreme. One thing I did was
replace, where possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in
place of strlen.

The code grew though, and at one stage I needed to store info about
two strings in one of these structures. To do this, I used a null
separator, e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

Ugh. You deserved anything that happened to you.

Gordon Burditt · Nov 25, 2006

I have never written a program for monetary gain. I program purely for the

enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.

If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Even programs like the OS/360's IEFBR14, or the C equivalent:

int main(void)
{
return 0;
}

can be made to work faster if they don't have to run correctly.

Mark F. Haigh · Nov 25, 2006

Frederick said:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I'm sure several regulars here have more interesting stories...

Ok, I'll bite.

One of the most unusual things I've tracked down was a production-level
C program that suddenly failed with the addition of a single comment in
a single .c file. Mind you, this was an application deployed on custom
hardware and in active use at thousands of sites worldwide. It wasn't
even a new comment-- it was the addition of about 30 characters to an
existing comment.

The usual suspects were considered. Perhaps it was a build-order
issue, I thought. I've run into those before. If there are data
corruption problems in the code, changes in the order that files are
compiled and linked can cause the linker to allocate different storage
locations for different variables. This can cause a "more critical"
(ie pointer) variable to be clobbered rather than another (ie
statistics counter) when corruption occurs.

No dice. My original build, a clean rebuild, and a rebuild of a clean
build with several hundred random files touched all failed. Hmmm. At
best this was inconclusive.

Now this was an embedded product running the vxWorks RTOS. In the bad
old days, on the bad old MMU-less platforms, it could take a real
effort to get debugging information out of a production board if it
crashed hard. Unfortunately, I couldn't track down any boards with
debug facilities, so I had to disassemble the chassis and have a rework
tech solder a DB-9 on it so I could get some serial output from it (my
soldering isn't the greatest).

Much to my chagrin, I quickly realized that any substantial code
modifications other than that comment line caused the code to spring
back to life. "If you can't beat 'em, diff 'em", I thought. I
reverted the original change and rebuilt the file, except this time
saving the assembly and preprocessed output. I then re-added the
problematic comment and rebuilt, again saving the assembly and
preprocessed output. On each build I set the randomization seeds to
the same values, so any pseudo-random numbers used by the compiler
would have the same values on both builds.

I fired up the graphical diff program. No changes to the preprocessed
files except the line numbers, which differed by one. No changes to
the assembly files, except...

.... A single additional line of assembly code. Aparrently a
preprocessor macro in a header file far, far away had decided to use
the value of __LINE__ for field debugging of production code. Since
the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
immediate IIRC, and (shockingly) the __LINE__ value was exactly on the
boundary, the compiler had generated a "load immediate; increment by 1"
instruction sequence instead of just a "load immediate".

But who cares!?! The instruction sequence was correct, anyways.
Perhaps an assembler bug? This did not seem to be likely, given "The
Two General Rules of Compiler and Assembler Bugs":

1. It's not a compiler or assembler bug.
2. See rule #1*.

* Unless you can prove it.

I verified with objdump and the ARM Architecture Reference Manual that
the opcodes being generated by the assembler were indeed correct.
Hmmm.

Every code modification I could think of caused things to start working
again. Things were seeming to me to point toward a hardware bug of
some kind. But given "The Two General Rules of Hardware Bugs": ...

1. It's not a hardware bug.
2. See rule #1*.

* Unless you can prove it.

.... I now had to prove it, especially since the units were already in
the field. On a hunch I looked at the assembly code directly after
this new additional instruction. I saw a load of a value from memory
that was added to another value and used as a pointer.

I inserted a jump to a routine I wrote in assembly that moved the
freshly-loaded value to a known memory location. It then jumped to
some C code in another translation unit that dumped the value, and then
reloaded and re-dumped it.

The value of the original read was 0, and the reload of the same
location generated a valid pointer value. Obviously they should have
been equal (I had already ruled out concurrent access problems
previously).

Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.

The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.

Mark F. Haigh
(e-mail address removed)

websnarf · Nov 25, 2006

Frederick said:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

First of all (sizeof "inline string") is 1+strlen ("inline string").
So I assume you compensated for this in your macro.

Second of all, in "The Better String Library", which does the same
thing, this is not a bug but it fact, the correct behavior. '\0' is a
legitimate character, not a string terminator. Where the semantics
coincide (which is most of the time, when dealing with pure text data)
you can assume strlen(bstring->data) is the same as b->slen. In
Bstrlib, you would never try to mash two strings together using some
kind of hacked representation such as "string1\0string2", that would
make no sense. Because Bstrlib is more consistent in this respect,
these sorts of bugs are far less likely.

I'm sure several regulars here have more interesting stories...

Oh sure:

1) if (a < 0) a = -a; b = sqrt (a);

2) Anything involving a stack overrun with stack checking turned off.
You just have to be inspired to imagine that this is your problem. The
standard is worthless for helping you here.

3) Assuming that vararg parameters were passed by value and could be
"reset" by retrieving its original value. (No debugger or compiler
diagnostic can help you figure out what is going wrong here.)

4) Watching Microsoft Visual C++ barf on struct tagbstring b = {
sizeof("string")-1, -__LINE__, "string" }; because MS's preprocessor
emitted something like _line+425 for __LINE__, and it complained that
it was not a compile time constant.

5) Adventures with WATCOM C/C++ v11.x's optimizations with "-ol" turned
on. It just fails to build correct code for about 10% of the source
I've written. These are real fun to track down. Like the stack
checking thing, you just have to be inspired to try turning the flag
off to see if it fixes the problem.

Then there's the standard "I forgot I made assumption X in function Y
then passed it parameters which technically violated X even though it
wasn't obvious that it was". Unfortunately, in the C language, these
assumptions often take the form of "allocated at least some certain
amount of space" or "the parameter is a well form non-empty linked
list" etc, and the error is usually undefined behavior.

I don't do a lot of heap or stack smashing anymore these days, as I
generally wrap things in rigourous enough abstractions, and I just
generally use debug heaps while developing. But there can still be
problems of convention. A hash table I implemented has an iterator
mechanism, and I made the termination condition when the index was
greater than the current hash table size -- the problem is that when I
came back to reuse this code after more than a year, I forgot my
convention for termination and thought it was when the index was < 0.
So I walked off the end of the hash array nicely because I did not
sufficiently document the convention. The problem is that I was using
-1 as the start-up index (since 0 may or may not be a valid entry, and
you *have* to perform an increment on every call to the iterator
incrementor) and so could not use < 0 as the terminator condition. But
it meant that my intuition conficted with what was necessary. I fixed
this by creating an "isDone" macro for the iterator.

With multithreaded errors, I already know a priori that they are
difficult. When I can, and I detect such a bug, I will spend a short
amount of time try to track it down. If I can't get it, I junk the
contentious code and start over. Its just a matter of productivity --
these bugs can be so hard, that it will take longer to track them down
than to rewrite the code. Sometimes I don't learn/figure out what I
did wrong, but life is too short.

Frederick Gotham · Nov 25, 2006

Gordon Burditt:

If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Which is a tremendous argument, only that my programs do run right in the
end.

Eric Sosman · Nov 25, 2006

Gordon said:
If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Even programs like the OS/360's IEFBR14, or the C equivalent:

int main(void)
{
return 0;
}

can be made to work faster if they don't have to run correctly.

<off-topic>

Wikipedia's article on IEFBR14 makes amusing reading. The
original one-instruction program had a bug, which must set some
kind of unenviable standard for "fault density:" one bug per
machine instruction! Not only that, but the fix had a bug, and
the fix to the fix had a bug, and it wasn't until the fourth
version of IEFBR14 that the program "did nothing" correctly.

The final version had three times as many lines of source as
the first, executed three times as many instructions, and occupied
eight times as much memory. Code bloat wasn't invented in Redmond.

</off-topic>

William Hughes · Nov 25, 2006

Frederick said:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

I'm sure several regulars here have more interesting stories...

The ugliest bug (though by no means the harderst)
I ever had to fix was

strcmp(search_string, TARGET_STRING)

where TARGET_STRING was #defnied
as a string literal. I had a problem
that even though search_string was the same as the
define, it was not matching.

Turns out the code had been changed to
from using us_satellite1 and europe_satellite.
to us_satellite1, us_satellite2 and europe_satellite.
Because the same thing was done for both
us_satellite1 and us_satellite2 it was only necessary
to check the us_satellite part. Enter string_strip, which
in addition to taking off whitespace, was modified to
remove trainling numbers. Now all that was needed
was to strip the string, then compare. To get
the comparison string you strip TARGET_STRING.
You only need to do this once as the name does
not change.
(The contract on the progammer responsible is still
open. It requires proof of a slow agonizing death.)
Of course, it the compiler had refused to modify
a string literal this would not have worked.

Three bugs that are also cautionary tails about
undefined behaviour.

One large program started failing at odd momements. We traced the
problem to some changes made by one of the programmers. However,
these changes were made to a completely different part of the program,
and they were only intended to add another input format. Furthermore,
the changes looked fine and worked correctly.

On looking at the code I noticed that a standard library function
(I don't recall exact details) was called in a non-standard way.
Knowing that
this couldn't possibly be the proplem, I changed the call
to conform to the standard. The bug disappeared.
Undefined behaviour includes appearing to do exaclty what you
want, but causing subtle problems elsewhere.

Another bug had to do with the equivalent of

a = i++

This was in a piece of code we had used for years, on many platforms
and many compilers. One day the code didn't work correctly on
the SGI (this is also an argument for test suites with known results,
the failure mode was a small loss of accuracy, not a complete
failure.).
Yep, turned out that a compiler upgrade was the culprit. Undefined
behaviour
can lie dormant for years.

A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

- William Hughes

Ian Collins · Nov 25, 2006

Mark said:
Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.

That reminds me of one of mine - "your diver doesn't work in the afternoon".

About 2 o'clock each day, an HDLC driver I wad written started to
experience random crashes, no amount of debug code or simulation could
catch the problem. As a contractor, I was under immense pressure from
the hardware people to fix "my broken software". I the end I came in
over the weekend so I could pinch a decent scope and logic analyser,
hooked up a couple of dozen probes and waited.

Sure enough, mid afternoon the board fell over. But this time I had the
evidence, a late data acknowledge strobe. Turns out the hardware
designer had used a 47K rather than a 4K7 pull-up resistor. Which just
about pulled the strobe high in time until the lab temperature got above
20C. It was lucky for us we where testing in the summer.

Needless to say, I enjoyed the team meeting on the following Monday morning!

Mark F. Haigh · Nov 26, 2006

Ian Collins wrote:

That reminds me of one of mine - "your diver doesn't work in the afternoon".

About 2 o'clock each day, an HDLC driver I wad written started to
experience random crashes, no amount of debug code or simulation could
catch the problem. As a contractor, I was under immense pressure from
the hardware people to fix "my broken software". I the end I came in
over the weekend so I could pinch a decent scope and logic analyser,
hooked up a couple of dozen probes and waited.

HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
might have looked great on paper, but those things were complete turds
to program. Stitching buggy software together with buggy microcode,
ugh. At least with a PCI device you get a ~50% correct datasheet and a
T-shirt from the vendor, but with the IXP that kind of thing seems like
a utopian dream.

I suppose a topical conclusion can be drawn-- always write bomb-proof C
so you can prove it's a hardware problem.

Sure enough, mid afternoon the board fell over. But this time I had the
evidence, a late data acknowledge strobe. Turns out the hardware
designer had used a 47K rather than a 4K7 pull-up resistor. Which just
about pulled the strobe high in time until the lab temperature got above
20C. It was lucky for us we where testing in the summer.

Here's a funny one. A prototype high-end system that I was working on
was slated to get some new high-CFM (cubic feet / minute) fans because
of a new processor's increased heat output. These particular fans had
tachometer / RPM-sensing support. The plan was to use it to detect
failed fans and alert the user to order a replacement. The system
could run properly at normal room temperature with 2 failed fans.

It was a prototype board, so I had no chassis to mount the fans to. I
just put them on my desk, plugged them in, and powered up. The fans
went airborne. I quickly powered down and duct-taped the fans to my
desk.

Unfortunately the wires to the fans were quite short and the fans were
uncomfortably close to the power supply switch. I would just have to
be careful, I thought.

After a couple of days, I had all of the fan driver functionality
implemented. It could give you instantaneous RPM, low / high RPM
watermarks, and it was tied in to the rest of the system.

Near the end of the second day, I was distracted and reached over
carelessly to power the system down. Big mistake. The highly curved
and very sharp blades of one of the fans reached out and cut the very
tip of my right thumb off.

As my thumb bled into a paper towel, I thought "OK, time for a beer",
and headed across the street to grab a couple of pints. I eventually
got it all clotted up and bandaged. I headed back to my desk and
powered up the system...

Sure enough, the system had logged an alert for a possible fan failure
due to low RPM for the period during which it was de-skinning the tip
of my thumb.

Needless to say, I enjoyed the team meeting on the following Monday morning!

Not me. Bandage was still on my thumb, and everybody still thought it
was hilarious (including me).

Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.

Mark F. Haigh
(e-mail address removed)

Ian Collins · Nov 26, 2006

Mark said:
HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
might have looked great on paper, but those things were complete turds
to program.

No, MC68360, nice to program, not too buggy microcode!

Not me. Bandage was still on my thumb, and everybody still thought it
was hilarious (including me).

Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.

Amusing none the less.

goose · Nov 26, 2006

Frederick Gotham wrote:

I'm sure several regulars here have more interesting stories...

On the (seriously drain-bamaged) platform I am working on
right now, there are calls to open a file, read data from a file,
write data to a file and rename a file. Yup ... no close file function!

I ran across an interesting bug last week; when a file has been
opened for writing, but no data gets written to it and then the
file is renamed ... the file contains junk data of a fairly random
length.

Try explaining *that* bug to your boss. Luckily, it seems
that the newer models of this particular device will (when
we get them in 2007/2008) be running ARM linux, and not
this current POS-specially-written-OS...

goose,

Keith Thompson · Nov 27, 2006

William Hughes said:
A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

Ben Pfaff · Nov 27, 2006

Keith Thompson said:
Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

According to the SUSv3 "Change History" for strdup, it was
originally part of the X/OPEN UNIX extension, so it seems
believable that a strict POSIX mode might cause it to be omitted.

Perhaps it is worth commenting that #define is not a statement.

William Hughes · Nov 28, 2006

Keith said:
Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

No, this was from memory. However, a quick check suggests that strdup
is not a POSIX function. The problem was definitely that strdup was
available on
the machine, but not if one of our standardization definies was used.
I seem to recall this was some version of #define POSIX.

- William Hughes

James Dow Allen · Nov 28, 2006

Mark said:
Ok, I'll bite.
...

Great story! You should post this in alt.folklore.computers.
Most of my best bug stories also involve hardware, e.g.
http://james.fabpedigree.com/bug22.htm

I'll mention one C-language bug that cost me several
hours of debug: it seemed like a problem with a controller
chip, but was actually a violation by me, of C rules.

I had a function like
send_to_chip(char *dat, int cnt); /* send dat[0] ... dat[cnt-1] to
chip */

and another function
whatever( ..., short cmd);
which needed, among other things to send the 2-byte cmd
to the chip. Endianness was not an issue, as my program
was specific to MC680x0.

I tried to send the bytes with
send_to_chip(&cmd, 2);
but it didn't work. Eventually I found that "&cmd" was pointing
2 bytes *before* the 2-byte cmd on the stack.
I don't remember which compiler this was 20 years ago --
I think it was based on Plauger's and I deduced that the
peculiar "&cmd" value was a result of porting a little-endian
compiler to a big-endian machine.

I consider the compiler behavior clearly flawed, but concede
now that taking the address of a function argument was
a violation.

The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.

Glad you made it!

James Dow Allen

Interesting bug	3	Jan 1, 2011
Adding adressing of IPv6 to program	1	Feb 16, 2023
Interesting bug	55	Apr 20, 2004
Bug analysis	32	Jan 11, 2009
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Is this a compiler bug? (Function accepts ptr when *ptr is specified)	7	Jun 10, 2014
Passing an array of structures back through the argument list	9	Sep 7, 2013
Mandatory Elements To Conduct JavaScript Form Manipulation	7	Aug 22, 2023

Most Interesting Bug Track Down

Frederick Gotham

Eric Sosman

Jack Klein

Frederick Gotham

Frederick Gotham

CBFalconer

Gordon Burditt

Mark F. Haigh

websnarf

Frederick Gotham

Eric Sosman

William Hughes

Ian Collins

Mark F. Haigh

Ian Collins

goose

Keith Thompson

Ben Pfaff

William Hughes

James Dow Allen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads