By Jack Ganssle
Nuclear Exception Handler
Published 10/19/2005
Sources (http://www.azcentral.com/business/articles/1014paloverde14-ON.html) report that the Palo Verde Nuclear power plant in Arizona has been running for 19 years with a latent defect in its emergency coolant system. Sketchy technical details leave much to the imagination. I imagine the operators run periodic tests of the safety gear. But it's likely those tests don't simulate every possible failure mode.
Kudos to the Nuclear Regulatory Commission for finding the bug, and let's breathe a sigh of relief that the worst didn't happen.
There's an intriguing parallel to software development. Exception handlers - the code's safety system - are just as difficult to implement correctly and fully test.
Exception handlers are those bits of firmware we hope never execute. They're stuffed into the system to deal with the unexpected, just as a reactor's emergency cooling system is only there for the unlikely catastrophic fault. But the unexpected does happen from time to time. Hardware fails, unanticipated inputs crash the code, or an out-and-out design flaw - a bug - causes the code to veer out of control, with luck throwing an exception.
That's when the safety systems engage. An exception handler detects the failure and takes appropriate action. If the code controls a nuke plant it may very well dump emergency coolant into the core. An avionics unit probably switches over to backup equipment. Consumer devices might simply initiate a safe reset.
But exception handlers are notoriously difficult to perfect. It's hard to invoke those snippets of code since the developer must perfectly simulate something that should never happen. It's even harder to design a handler that deals with every possible failure mode.
My collection of firmware disasters is rife with system failures directly traceable to faulty exception handlers.
The NEAR spacecraft was, ah, "near"ly lost when an accelerometer transient invoked an error script... but the wrong, untested version had been shipped, causing thrusters to dump most of the fuel.
Two years earlier Clementine was lost when an unhandled floating point exception crashed the code and spewed all of the fuel into space.
The first Mars Exploration Rover, one of the wonderful twin robots today roaming the planet, crashed when the filesystem filled. and the exception handlers repeatedly tried to recover by allocating more file space.
Ariane 5 rained half a billion dollars of aluminum over the Atlantic when the inertial navigation units shut down due to an overflow. Yet they continued to send data to the main steering computer, asserting diagnostic bits meaning "this data is no good." The computer ignored the diagnostics, accepted the bad data, and swiveled the nozzles all the way to one side.
In the case of firmware I believe a lot of the problem stems from overstressed engineers putting too little time into analyzer potential failures. But that's certainly not the case at the Arizona reactor. With an entire city at risk from a core meltdown surely the engineers used every tool possible, like Failure Mode and Effects Analysis, to ensure the plant responds correctly to every possible insult.
Perhaps we human engineers of all stripes, software, mechanical, and nuclear, are just not good at anticipating failure modes.
That's a scary thought.
What's your take on safety systems and exception handlers?