Kenneth Brody said:
(Consider the case where a database index corruption is detected
150,000 records into a report, and the corruption occurred some 20,000
records earlier. I'd hate to think about tracking that one down without
a debugger. "Could" it have been done without one? Yes, which I guess
technically means the debugger wasn't "essential". Then again, you could
say that a C compiler isn't "essential" either, as an experienced
programmer could hand-compile the code without one.)
If you can pin down where it's happening and take a look at that, a
single execution trace combined with a desk-check is often enough to
find the problem without a debugger.
If all you have is a post-mortem dump, I've yet to come across a
debugger that will let you step backwards over 20K records' worth of
execution (though I'm not going to claim that they don't exist).
So I'm not convinced that this is the best example of where a debugger
is essential.
War story:
The last time I encountered a comparable problem, the debugging tool of
choice turned out to be Excel.
It was a program that was supposed to compare records in the outputs of
multiple processes running on the same input data and combine results
that came from the same feature in the input; on occasion it would fail
to combine inputs when it should have.
Using a debugger to find the problem would have required either an
independent implementation of that module (to trap when the results
disagreed; this would have been the easiest way to programmatically
detect an error) or single-stepping through a few hours worth of
recorded data until we could eyeball a "should have been combined"
output that wasn't combined.
(If we had a dataset that exhibited the bug at a known point, we could
have done an offline run and interrupted it just before we got to that
point, but finding that point would have required the pre-debugger
analysis we ended up doing anyways, and once that analysis was done
setting up the test run under the debugger (never mind actually running
it) would have taken at least as long as where we actually went from
there.)
Then somebody decided to take an inputs-and-outputs dump from a large
run, and plot that with Excel; looking at the graphs let us find a
relatively isolated data point that exhibited the problem. Filtering
the (same) execution trace data to only look at what that data point
was doing turned up an anomaly that led to finding and fixing the bug
within about five minutes. (Total elapsed time between importing the
data into Excel and arranging a test run for the fix was under two
hours.)
(This particular bug involved a global property of the interactions
between the data structures the module used internally and the
distribution of the input data, so running on a reduced dataset (which
would have made an interactive debugging session rather less painful)
would've hidden the bug, even though it was deterministic for a given
input stream. It also took a while to come up with a unit test that
would have caught it, given the dependence that had on internal tuning
thresholds.)
dave