Built-in Diagnostics
No system is useful unless it can be built in production. Add simple diagnostics.
Published in Embedded Systems Programming, April, 1996
By Jack Ganssle
I love small companies. Nimble operations running with a lean staff operate less by email and memo than by yelling across the room: "Joe, what version of the compiler are you using?" Workers in the smallest of outfits wear many hats, from that of software developer to production supervisor to chief maintenance engineer.
Compare this to the corporate behemoth. Employees packed like sardines into a sea of cubicles barely know their neighbors. Each is focused on one part of the company's activities - perhaps getting one particular subroutine to run. Few engineers see their product over its entire development cycle. Fewer still have the opportunity to see the widget in production, or to work with the technicians and assemblers who sweat 8 hours a day building the engineers' wonderful creation.
All employees are hopefully working towards common corporate goals, yet each has a different vision of the company's needs and problems. Too many developers, never exposed to the harsh realities of the production floor, fail to understand that small changes in the system's design can greatly reduce the time required to build, test, or repair a product. Engineering is only done once (well, OK, we do tend to iterate, sometimes forever, getting bugs out, but you get the idea), yet production goes on day after day, for the life of the product. Remove a few minutes of hassle from manufacturing and you've leveraged that few minutes by the number of units built.
Hardware designers know that pushing timing margins or ignoring electrical specs may result in a system that works on the bench, but that will be unreliable in production - and thus expensive to manufacture. In high volume situations they know how to bring test points out or otherwise expose the circuits so automated test systems can exercise every node.
Firmware folks largely haven't learned this lesson.
To a programmer the word "testing" conjures images of correctness proofs, exhaustive software trials, and code coverage analysis. A production person probably has never heard of any of these concepts; he looks at testing as the daily routine of ensuring each and every unit works correctly before being shipped. Very complex products are tested and repaired by technicians with little formal computer training. The best are usually culled and assigned to work in engineering support, leaving production with workers who may be skilled but who are certainly not rocket scientists. As software engineers it is our responsibility to the company to give the techs the tools they need to ship product. As software managers, it is our responsibility to convince management that this is an important and desirable goal.
Peripherals
I wrote about embedded monitors recently (February, 1996). If you've got a spare serial port available, by all means include a simple monitor in your code. By its nature firmware is an invisible black hole, barely understandable or even visible except to the developer. A monitor is one very cheap, very simple way to leave a back door in the code, to give access to the code itself, and to the CPU and it resources to future engineers and production people.
You can buy, borrow, or steal a monitor from numerous sources - the costs approach zero. Even a simple monitor lets you change and examine memory and I/O. Giving the hardware troubleshooter access to I/O can save him hours of work - entering an input command to see what a port does is much simpler than trying to capture the event on a logic analyzer. If you feel really generous with your time, display the status of all system I/O in a table, converting cryptic hex statuses to meaningful keywords. "Data ready" is a lot easier to understand than "02".
A monitor by itself, however, it almost worthless without backup documentation. In these days of high integration ASICS, PALs, FPGAs and the like, I/O is often buried inside an impossibly complex circuit whose operation is far from obvious. Don't toss in the monitor and tell the user "The I command reads a port." You who have written this firmware must surely know, and must surely have documented somewhere, what each port is and what each bit does. Pass this information to the poor technician.
I have written Visual Basic applications that drive a monitor, and that displays the values of ports in plain text. For example, Intel's 188 processor has dozens of internal I/O ports. A bit of Basic code lets the novice technician select "UART Status Port", and then see the setting of each bit in English (i.e., "data ready set" instead of "01"). If the production people have access to a PC a simple application like this can shield them from the bits and bytes of the machine, yet tell them the status of every peripheral in a system.
Bear in mind that the technicians will use a scope to isolate most problems. Design your monitor to ease problem diagnosis with this tool. You need to do two things to make scoping easy: allow every monitor command to be run repeatedly (scopes are particularly good at looking at repetitive signals), and generate a trigger pulse that syncs the scope to the command. This is a bit you toggle a bit simultaneously with, say, the Input and Output commands.
With these two resources any competent engineer or technician can find most common board problems by exercising I/O ports and tracking the signals throughout the board with the scope.
Power On Self Tests
Some systems include Power On Self Tests (POSTs) as part of the product's ROM to give a "go/no-go" indication without using other test equipment. The unit's own display or status lamps show test results. On the PC we see a RAM test at boot time, and a sequence of beeps that tell us nothing, as every vendor uses their own non-standard codes which are documented in that manual we lost exactly a year before the silly thing broke.
Internal diagnostics are worthwhile, though, because they do give the test technician some ability to track down problems. They're also an effective marketing tool, giving the customer a (possibly false) feeling of confidence in the integrity of the product each time he turns it on.
Though internal diagnostics are often viewed as a universal solution to finding system problems, their value lies more in giving a crude test of limited system functions. Not that this isn't valuable. Internal diagnostics can test quite a bit of the unit's I/O and some of the "kernel", or the CPU, RAM, and ROM areas.
The computer's kernel frequently defies standalone testing, since so much of it must be functional for the diagnostics to run at all. Most systems couple at least the main ROM and RAM closely to the processor. The result - a single address, data, or control line short prevents the program from running at all.
It's easy to waste a lot of time coding internal diagnostics that will never provide useful information. They may satisfy vague marketing promises of self-testing capability, but why write dishonest code? Realize that internal diagnostics have intrinsic limitations, but if carefully designed can yield some valuable information. Apply your engineering expertise to the diagnostic problem; carefully analyze the tradeoffs and identify reasonable ways to find at least some of the common hardware failures.
What portions of the kernel should be tested? Some programmers have a tendency to test the CPU chip itself, running through a sequence of instructions "guaranteed" to prove that this essential chip is operating properly. Witness the ubiquitous PC's BIOS CPU tests. I wonder just how often failures are detected. On the PC an instruction test failure makes the code execute a HALT, causing the CPU to look just as dead as if the it never started. More extensive error reporting with a defective CPU is a fool's dream. Instruction tests stem from minicomputer days, where hundreds of discrete ICs were needed to implement the CPU; a single chip failure might only shut down a small portion of the processor. Today's highly integrated parts tend to either work or not; partial failures are pretty rare.
Similarly, memory tests rely on operating memory to run - a high tech oxymoron that makes one question their value. Obviously, if the ROM is not functioning, then the program won't start and the diagnostic will not be invoked. Or, if the diagnostic is coded with subroutine calls, a RAM (read "stack") failure will prematurely crash the test, before providing any useful information.
The moral is to design diagnostics so to ensure that each test uses as few unproven components as possible.
Do test things that may reasonably fail, yet that will not crash the CPU. Running an RTOS? Some sort of timer will generate ticks for context switching. If the timer is an external component write a bit of code that checks the interrupt it generates. Does an external controller (like and 8259) sequence other interrupts? Certainly test the device - few technicians have the knowledge needed to diagnose a problem in such a complex part of the circuit.
Do check out my August, 1995 column for ideas about testing ROM and RAM.
Seeding Memory
Few embedded systems use every last byte of ROM and RAM. Unused areas are, well, unused - who cares what is burned into the last few bytes of a ROM?
You should. Unless you're sure your code is perfect - absolutely, positively bug free - use these empty areas wisely. Has your code ever wandered into an used section of memory? Has a hardware fault (e.g., bad address line) caused the code to crash after executing a handful of instructions? Seed unused ROM with instructions that permit you to trap these faults. You'll save yourself some time by catching the software bugs, and the production folks will love the robust, fault-tolerant nature of the code.
On the Z180/Z80 processors the RST7 instruction (a one byte call to location 0038) is cleverly encoded as 0xff. Unburned ROM defaults to just this value. Write an error trap handler at 0038 that toggles an LED or otherwise signals the failure. Wandering code, for whatever reason, will flash the LED and immediately indicate something is wrong.
Most Z180's systems with an electronics failure that disables ROM will go into an infinite loop executing 0xff instructions, which is instantly visible on a scope. The characteristic double writes from the RST7's pushes tell an experienced scoper in a second what is going on. Teach your production folks this simple trick.
The x86 family has a similar single byte call - INT3 - which vectors through an interrupt vector at 000C. Again, use this instruction as a default fill value for ROM, and write an error processor.
Every embedded system starts with a loop to set up initialized values in RAM. I like to precede this loop with one that sets all of RAM to the INT3, RST7, or whatever. Then, your odds of having a wandering program encounter the instruction are much higher.
Some of the hardest problems to diagnose, either in development or in production test, are erratic interrupts. If an interrupt controller fails or is misprogrammed it may assert an incorrect vector on the bus. Again, preempt the problem by using a complete interrupt table with entries for all possible interrupts, not just the ones you are using. Vector unused, unexpected, interrupts to an error handler.
Obviously there's no guarantee that wandering code will hit these unused locations. However, seeding the code in this manner is a nice way to arm yourself to possibly catch a latent problem. Teach your production people what to expect if such an error takes place.
Watchdog timers are another type of preventative medicine we often employ to detect failures. Too many, though, are designed for the convenience of the programmer or customer, with no thought to the poor electronics technician trying to repair things. A watchdog timer that resets and restarts the code invisibly is a nightmare - how can you tell if a system is operating marginally?
At the very least be sure the watchdog toggles an external bit that a technician can scope. Give the poor guy a chance to see that the system crashes occasionally!
Of course, a better solution is to include the ROM monitor discussed above, with an error word that logs watchdog timeouts and other problems. The repair folks can then log onto the monitor and read a record of problems.
Modern processors often have various sleep modes that power-down all or part of the CPU at various intervals. Neat idea, but take the time to educate your people that the processor may be idle at different intervals. Just last week I spent an entire day trying to find a problem with a system that seemed to crash erratically. Totally baffled, I started timing the interval between crashes and found it to be exactly 127 seconds, every time. The chip was powering itself down, as the code never disabled the default sleep mode.
Conclusion
As a company, we're all in this together - right? Use your expert knowledge, and your knowledge of everyone else's job (after all, we all should strive to be high tech Renaissance Persons), to make the job of the techs in production test and repair simply possible.
Be creative. I wonder if there's a way we can put a TCP/IP link to a modem in our embedded systems, to allow technicians a thousand miles from the customers' sites to diagnose problems via the Internet.