By Jack Ganssle
Topsy Turvey
Published 1/08/2004
Computers are amazing things. My desktop executes billions of instructions per second, even when sitting more or less idle. It reliably - perfectly, actually - moves data to and from a billion bytes of DRAM in nanoseconds while sucking huge files from 400 Gb of hard disk space. Days and at times even weeks go by without a single glitch, yet every second of each of those days the machine slams vast amounts of data around. A single bit error will crash the machine. Yet the occasional problems all arise from imperfect software; the hardware operates flawlessly.
The engineers who design computers work with imperfect components. No one really knows the characteristics of the ICs, capacitors and every other element they use. That 1k resistor might actually be 1024 ohms, and though it's rated at ¬ watt odds are it'll handle a bit more than that, if used in a reasonably-ventilated space in a non-extreme environment. The 5 volt power supply might be putting out more like 4.95 volts, and that figure is likely to change as components age over the years and the mains vary due to summer air-conditioning demands. The track on the PC board doesn't really act like a wire at all; it's a complex transmission line which reflects, attenuates and distorts the digital signal it carries.
But the PC operates reliably because engineers realize their components are less than perfect. They design margin into every aspect of the system. In an ideal world perhaps just a single electron is enough to distinguish a zero from a one, but in the grimy reality of engineering EEs push thousands or millions of electrons down each wire. Capacitors are hard to make precisely, some come with wild tolerances like -20/+80%. We're not really sure how many farads the vendor will provide. so design in plenty of margin to ensure success.
Margin is the essence of reliable engineering. Civil engineers don't know exactly how strong a bridge beam will be, especially when it's a concrete structure poured on-site by bored and careless laborers. The bridge stands because the beam is two or three times stronger than absolutely needed. When building the Brooklyn Bridge Roebling (see The Great Bridge by David McCollough for a wonderful book) discovered the wire used to suspend the entire bridge was of inferior quality. Yet his design had so much margin that the bridge still stands a century later, still held up by some of that bad wire.
In the firmware world we work with perfect components. A one is a one is a one. But there's no margin in the world of firmware engineering. One unitialized variable, a single miscomputed bit, just one mismatched push/pop pair, causes the application to crash. If the user types in unexpected data or operates knobs out of sequence our code might kill someone (example: the Therac 25 (see http://sunnyday.mit.edu/papers/therac.pdf)).
Our programs are like bridges without margin, likely to collapse under the most feeble stress.
In truth, there are some techniques that add a bit of margin to our code. Build the system around an MMU and individual tasks might fail, but a very smart OS can save the system. Pepper the code with asserts() to find error conditions and then take corrective action. Add redundant execution streams to identify and repair software errors. Include stack monitors, malloc() traps, and plenty of other instrumentation to build fault-tolerant code. Of course few of us actually do any of this.
Yet these strategies reveal that firmware is topsy turvey compared to any other form of engineering. Software margins come at the expense of vastly increased design costs, with, in these days of cheap transistors, little in the way of increased production expenses. The Shuttle's code is probably the best software ever written. Price tag: $1000 per line.
Bridge-building, though, is all about materials cost. A mere stroke of the designer's pen increases the size of a beam, but perhaps doubles production costs. The wise EE who's worried about pushing .2 watts through a ¬ watt resistor simply uses the next size up. There's zero design effort but substantial recurring costs.
Perhaps software engineering is somewhat akin to automotive design. The car industry must minimize recurring costs - which includes the costs due to recalls and repairing defects. They spend billions engineering a new product. Or maybe it's like building a spacecraft; of the $800 million spent on the Mars Expedition Rovers, only a pittance is in materials cost. Engineering sucked up the largest chunk, because failure is intolerable.
Both automotive and spaceflight share the same philosophy: perfection is worth plenty of engineering dollars. Sadly, that's a far cry from the approach of most firmware engineering projects, where nothing is more important than minimizing NRE and the schedule.
What do you think? Can we put more margin into our firmware? Is it worth the extra time and effort?