Frederick said:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.
I'm sure several regulars here have more interesting stories...
Ok, I'll bite.
One of the most unusual things I've tracked down was a production-level
C program that suddenly failed with the addition of a single comment in
a single .c file. Mind you, this was an application deployed on custom
hardware and in active use at thousands of sites worldwide. It wasn't
even a new comment-- it was the addition of about 30 characters to an
existing comment.
The usual suspects were considered. Perhaps it was a build-order
issue, I thought. I've run into those before. If there are data
corruption problems in the code, changes in the order that files are
compiled and linked can cause the linker to allocate different storage
locations for different variables. This can cause a "more critical"
(ie pointer) variable to be clobbered rather than another (ie
statistics counter) when corruption occurs.
No dice. My original build, a clean rebuild, and a rebuild of a clean
build with several hundred random files touched all failed. Hmmm. At
best this was inconclusive.
Now this was an embedded product running the vxWorks RTOS. In the bad
old days, on the bad old MMU-less platforms, it could take a real
effort to get debugging information out of a production board if it
crashed hard. Unfortunately, I couldn't track down any boards with
debug facilities, so I had to disassemble the chassis and have a rework
tech solder a DB-9 on it so I could get some serial output from it (my
soldering isn't the greatest).
Much to my chagrin, I quickly realized that any substantial code
modifications other than that comment line caused the code to spring
back to life. "If you can't beat 'em, diff 'em", I thought. I
reverted the original change and rebuilt the file, except this time
saving the assembly and preprocessed output. I then re-added the
problematic comment and rebuilt, again saving the assembly and
preprocessed output. On each build I set the randomization seeds to
the same values, so any pseudo-random numbers used by the compiler
would have the same values on both builds.
I fired up the graphical diff program. No changes to the preprocessed
files except the line numbers, which differed by one. No changes to
the assembly files, except...
.... A single additional line of assembly code. Aparrently a
preprocessor macro in a header file far, far away had decided to use
the value of __LINE__ for field debugging of production code. Since
the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
immediate IIRC, and (shockingly) the __LINE__ value was exactly on the
boundary, the compiler had generated a "load immediate; increment by 1"
instruction sequence instead of just a "load immediate".
But who cares!?! The instruction sequence was correct, anyways.
Perhaps an assembler bug? This did not seem to be likely, given "The
Two General Rules of Compiler and Assembler Bugs":
1. It's not a compiler or assembler bug.
2. See rule #1*.
* Unless you can prove it.
I verified with objdump and the ARM Architecture Reference Manual that
the opcodes being generated by the assembler were indeed correct.
Hmmm.
Every code modification I could think of caused things to start working
again. Things were seeming to me to point toward a hardware bug of
some kind. But given "The Two General Rules of Hardware Bugs": ...
1. It's not a hardware bug.
2. See rule #1*.
* Unless you can prove it.
.... I now had to prove it, especially since the units were already in
the field. On a hunch I looked at the assembly code directly after
this new additional instruction. I saw a load of a value from memory
that was added to another value and used as a pointer.
I inserted a jump to a routine I wrote in assembly that moved the
freshly-loaded value to a known memory location. It then jumped to
some C code in another translation unit that dumped the value, and then
reloaded and re-dumped it.
The value of the original read was 0, and the reload of the same
location generated a valid pointer value. Obviously they should have
been equal (I had already ruled out concurrent access problems
previously).
Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.
The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.
Mark F. Haigh
(e-mail address removed)