By Jack Ganssle

Just Reset It

In 2003 a Boeing 747-400 aircraft lost all engine and flight displays. Pilots flew on backup instruments for 45 minutes before ground technicians radioed back a fix. In 2001 the same type of aircraft experienced the same problem, which in this case wasn't repaired till the plane landed.

In both cases the fix was the same: cycle the circuit breakers. Punch reset, hit control-alt-delete, cycle power.

Though various authorities continue to look for the problem's root cause, Boeing has issued an interim solution: cycle the breakers. Just reset it.

A software problem locked up the Pathfinder spacecraft's computer as it descended to a landing on Mars. The watchdog timer brought the system back to life. It just reset the CPU.

The Clementine lunar mapper dumped all of its fuel when the software ran amok. There was no watchdog. The mission was lost because ground controllers couldn't just reset it.

One reader wrote that his stove's oven fan went whacko; apparently computer-controlled, its health was restored when he cycled power. He just reset it. My 4 year old niece offered a helpful suggestion while I was in the middle of resolving a LAN routing problem: "turn it off and turn it on, Uncle Jack; that always works for me." And we all know how to fix a Windows machine that's low on resources.

We just reset it.

(I'm fortunate; in this neighborhood the electric company provides regular power cycling as a customer service).

I wrote a series of articles about watchdog timers in ESP (http://embedded.com/story/OEG20021211S0032, http://embedded.com/story/OEG20030115S0042, and http://embedded.com/story/OEG20030220S0037), and then condensed them to a single piece, adding drawings and more thoughts (https://www.ganssle.com/watchdogs.pdf). In two months on-line it has been downloaded over 4000 times. Developers are apparently aching for a device to restore crashed systems back to life. Something that just resets it.

The culprit is buggy code, of course. The software revolution gave us tremendous functionality in the most mundane products. And it takes those capabilities away, randomly, usually at the most inconvenient moments. Sometimes for inexplicable reasons doing things in exactly the same manner we've used for months leads to a crash. But that's not a problem, cycle power or yank the batteries for 5 minutes. Just reset it.

We've created new life strategies to cope with the problems. My family calls me the "save it" czar, since anytime I notice anyone working on a document, spreadsheet, or other data manipulating tool, and the task bar is labeled "new document" instead of some filename that indicates at least one save took place, I slap them around. Metaphorically, of course. When using Word my left hand subconsciously rolls through the alt-f-s save-mantra every 10 or 15 seconds. Mostly that's a habit induced by past years of struggles with Windows 3.1, 95, and 98. XP has been astonishingly reliable.

One friend adamantly insists that our code should be perfect. Once I agreed. Small apps sporting 10k lines of code or so are truly tractable.

Old-timers will recall the debates that raged 30 years ago about reset switches. Were they needed? Desirable? Or perhaps a red flag to customers that even the vendor didn't trust the code they were shipping?

Today there are no reset switches, and products are huge, often employing hundreds of thousands of lines of code. Fact is, humans write this stuff. Humans are, last I checked, imperfect. Our work products always reflect ourselves, for better or for worse. Bugs abound.

I think the next great change in firmware development will be self-recovering code, software that detects failures and initiates a graceful recovery transparent to the user. One approach is to carefully segment the code into tasks, each protected by an MMU, coupled with exception handlers smart enough to restart the flawed thread. Now that transistors are free why not stick an MMU even in a cheap 8 bitter?

But until then we'll use the same old technique.

We'll just reset it.