The Ariane 5 incident doesn't tell us anything about exception handling,
data typing, scope creep or unit testing. Neither of those were the
culprit. It _does_ tell us a few things about requirements /
specification (mis-)management.
It tells us about all those things, which is why i mentioned them. And
more - i should also have mentioned process management.
The fundamental failure was about requirements, absolutely. That's what i
referred to as scope creep - the scope of the inertial navigation system
was originally defined as being Ariane 4, but crept to include Ariane 5,
without this being properly addressed.
But that was not the only failure. There were several points at which
something could have been done differently which would have saved the
rocket. Off the top of my head:
1. The module that failed was a pre-launch calibration daemon in the
inertial navigation system; it had no use at all after launch. If it had
been shut down at launch, the failure would not have occurred.
2. IIRC, the pre-launch procedure had changed such that the daemon was not
needed anyway. If it had been removed, the failure would not have
occurred.
3. The failure involved a cast from (in Java terms) a double (used to
capture and instrument reading) to a short (used for calculations) which
overflowed. If doubles had been used for calculation, the failure would
not have occurred.
4. The cast was not protected by a suitable exception handler. If it had
been (although i'm not sure what the handler would actually do), the
failure would not have occurred.
5. The inertial navigation system's top-level exception handling handled a
crash by writing diagnostic information to the same data bus used for
output, without any metadata indicating that it was diagnostics rather
than data; the guidance computer interpreted it as data, and went wild. If
the diagnostic information had been written elsewhere, or had been marked
and subsequently recognised by the guidance computer as being such rather
than data, the failure would not have occurred.
6. The combination of a real inertial navigation system and a real
guidance computer was never tested with real sensor inputs. The guidance
computer was tested with a mock inertial navigation system, which did not
accurately reproduce the real system's faulty behaviour. It was a unit
test rather than an integration test. If the test had been an integration
test, the fault would have been detected long before launch, and the
failure would not have occurred.
Yes, you can identify a root cause, in the form of a mistake in the
requirements process. But you can also identify a series of other mistakes
which enabled that mistake to cause the failure. To pay attention only to
the root cause and discard the other mistakes is foolish.
tom