The Tao of Diagnostics
Part 2 of a series about embedded diagnostics.
Published in ESP, July, 1990.
By Jack Ganssle
A few weeks ago I had our broken microwave apart in my workshop. It says something about our business that one of my greatest fears is repairing a microprocessor-based product; a simple chip failure often consigns an appliance to the landfill. Fortunately, this was a simple case of the door microswitches not engaging properly. After a bit of study, I discovered that they had to close in one particular processor-monitored sequence, no doubt to prevent backyard mechanics from bypassing them and getting fried. The correct sequence was (of course) undocumented and difficult to adjust.
This is an all too common example of poor embedded design. Every system should have some provision for in-the-field repair. Software engineers have a responsibility to make these adjustments easier. Why didn't the designers include a little code to show what sequence the switches engaged in?
As I mentioned last month, it's really impossible to say intelligent things about the huge range of I/O used in embedded systems. But, for God's sake, let's use our brains! Be sympathetic to the user's needs. Remember that the system will fail, either in the field or in production test. Make it easy to isolate the problem.
Like the microwave oven example, most embedded systems interface to mechanical and electronic sensors and actuators. Certainly the mechanical portions are prone to failure; just as certainly analog I/O is subject to drift, noise, and other effects that we digital people hate to acknowledge. The tests described last month (and those you've invented in your never-ending quest to build a reliable product) will help get the system to boot. The next step is to give the test technician and end user a "back door" into a diagnostics suite.
Consider the system's analog circuits. These components all exhibit slightly different characteristics, so potentiometers are used to tune offsets and gains. Sometimes, lots of pots are used. It's interesting to watch a test group calibrate these sorts of instruments; frequently special test equipment is needed to monitor the voltages during pot twiddling. Without this equipment these adjustments simply cannot be made in the field. In most cases a bit of clever software can take advantage of a panel display to replace the test gear. Write a bit of code to show raw voltage, or whatever is being monitored, on the system's own output device. Certainly you've already written low level routines to get the data (for use in the main program); spend an afternoon writing a simple diagnostic that calls this subroutine and formats the output.
Pots are a continuous source of frustration to users. Think - can you come up with a better, self-calibrating design? Try writing code that removes offset, gain, and other errors mathematically.
The Scope
When you design diagnostics for field or in-house use, be sure to bear in mind the sorts of tools users will have available. One of the most useful troubleshooting tools is the venerable oscilloscope. Most test and repair technicians rely on the scope almost exclusively. Logic analyzers, emulators, and the other tools used in engineering are not nearly so ubiquitous in the test environment. Remember this when writing diagnostic code.
Yin and yang. While the scope is the universally accepted troubleshooting tool, computer-based systems are not really well suited to scope diagnosis. Digital events tend to be wide (requiring many channels - like for addresses and data), or very intermittent (a 1 microsecond event once per second). Even the most sophisticated scope can't capture these signals without some help from the code. The solution? Write diagnostics that run in repetitive loops, and be sure to toggle a bit (say, an I/O port) at the start of each loop. The technician can trigger the scope's sweep (i.e., start the trace at the left side of the screen) each time the bit is asserted. This "scope trigger point" gives an essential reference to the sequencing of events, in many cases making the scope as useful as logic analyzer.
The best software engineers regularly make use of scopes during initial code debugging. It's amazing just how much information you can extract from the code by watching event synchronization, port assertions, or even chip selects on a scope's display. If you are not familiar with this valuable tool, have a hardware guru give you a lesson in its use. Debugging embedded code is hard - take advantage of every tool you can find.
Reporting Failures
I've always hated the annoying beep my Macintosh makes on reset. Until this year, that is, when the computer died with a dramatic belch of smoke. Where, exactly, was the failure? The screen went blank - was the CPU dead? Could the power supply have failed? But wait - on reset the computer still beeps! Power must be OK, and the CPU is probably working. Indeed, it turned out the problem was localized to the video circuits, and a $100 mail order board brought the computer back to life. The once annoying beep saved an expensive trip to the Mac man.
Years ago Computer Automation installed Go/Nogo LEDs on every board in their "Naked Mini" (their name, not mine) computers. Like the Mac's beep, these simple indicators save users a lot of grief. Nothing this simple is foolproof, but even an 80% success rate is worthwhile.
Certainly systems with CRTs or other alphanumeric displays can easily show lots of useful error information. Working in C makes formatting output especially easy. Use these resources, but don't depend on them. An awful lot of hardware and software must work before even a single character can be displayed on a CRT; self test routines should depend on the absolute minimum of functioning hardware.
Learn from the automotive companies. Cars have a lot of sensors, all wired to an under-hood computer. Dozens (at least) of potential failure nodes exist. Ford, GM, and others let the mechanic put the computer into a self-test mode, and flag errors by toggling one bit very slowly. The engineers cleverly realized that a voltmeter is about all you can count on a mechanic having and understanding, so their software drives the bit up and down so slowly that even a meter needle can show the transitions. Error 51 might mean "failed PCV valve", and is indicated by 5 needle deflections, a pause, followed by one more. What could be simpler?
A LED is just as effective and even easier to use. If the product is too cost sensitive to include even a 50 cent LED, provide a place to clip one on. If you use a LED rather than a voltmeter, than the flashes can be quite a bit faster. A subroutine to show one digit of a code is simple, and typically takes the following form:
Pseudocode:
Set COUNT=# flashes wanted LOOP: turn LED ON delay for 1/4 second turn LED off delay for 1/2 second COUNT=COUNT-1 Go to LOOP as long as COUNT is non-zero
Avoid using zeroes as part of an error code. While zero might correspond to "no flash", it is visually very confusing.
Showing error codes to a single LED is arguably better than showing the complete code in a conventional 7 segment or ascii display. The single bit approach is more robust; not much hardware support is needed. If the system has a number of LEDs, consider sending the same pattern to all of them. A single LED (or port) failure will then be obvious, and the remaining LEDs will still show the error code.
ROM Monitors
Let's not forget the sophisticated troubleshooter. We've all had the unpleasant experience of being called in to find and fix design flaws. Build in tools to make this sort of work easier for you and your associates.
If the embedded system includes some sort of terminal interface, then including a monitor (or "remote debugger") is a nice way to give the high-end user access to the system's internals. A ROM monitor may not be as powerful as an emulator or logic analyzer, but it is easy to invoke. A built-in monitor is like a sleeping giant, dormant, waiting to be called into action by entering a secret command. But be careful - I once failed to check for keyboard overflow in a product, and a user called to complain about the weird mode (the monitor) that the product entered when his cat sat on the keyboard.
Even a simple monitor lets you change and examine memory and I/O. Giving the hardware troubleshooter access to I/O can save him hours of work - entering an input command to see what a port does is much simpler than trying to capture the event on a logic analyzer. If you feel really generous with your time, display the status of all system I/O in a table, converting cryptic hex statuses to meaningful keywords. "Data ready" is a lot easier to understand than "02".
A disassembler, assembler, and simple breakpoints is a lot more work to add, but if you go through the trouble you can then patch small test routines into the product's RAM. At the very least have a GO command that starts a program at any address. Then, you can patch in instruction hex codes and start simple test loops that perhaps cycle a particular port. The scope-happy technicians will love you for it. Is a port very occasionally intermittent? A few bytes of code can monitor this much more effectively than any other means.
A monitor can serve as a diagnostics platform. It is any easy way to invoke complex test routines, and gives the basis of a nice interface for communicating test results. Like Microsoft's new Programmer's Work Bench, it is a sort of software bus to hang diagnostics and other utilities from.
All of my company's products include such a monitor. Our customers are not aware of it, but in our lab we regularly invoke it to diagnose all sorts of problems.
A number of companies sell commercial ROM monitors. First Systems, Microtec, and Intermetrics all provide quite sophisticated products that can be included in a design.
Diagnostics Tricks
I could go on at great length about using powerful troubleshooting aids like emulators and Fluke's Microsystem Troubleshooter. These sorts of tools quickly find bus shorts and other problems that prevent the computer from coming up at all. If it doesn't boot, then all the internal diagnostics in the world are useless. If the techs don't have decent tools, then they will be reduced to "shotgunning" - replacing components at random and hoping for success.
You can make their job a bit easier during the product's hardware design. (Yes, programmers should be involved in hardware design, at least to the extent of contributing their expert knowledge to make the system as close to perfect as possible). A nice way of finding bus shorts, memory failures, and the like is to execute a looping program, letting the technician examine each address and data line with a scope to find the source of the trouble. Of course, if the memories don't work, or if the address bus is shorted, how can we run a program?
On the Z80 and 8085 family the RST 7 instruction is a one byte CALL to location 38. Was Intel clairvoyant, or was it just luck that caused them to use opcode FF for this instruction? As a result, if you add pullup resistors to the bus, then simply removing all memory chips will make the processor execute CALLs to 38 all day long. The stack pointer will decrement through the processor's entire address space, so the technician can look at address lines and check that they cycle properly. The data bus will show return addresses after each RST 7 executes; since the stack pointer decrements, these addresses will change as well. This trivial test gives the repetitive signal needed to effectively use a scope to check out the hardest parts of the system.
Other CPUs usually have a similar instruction. On the 8088 family the INT 3 instruction is a similar one byte opcode. A one byte PUSH might even be better. Since these instructions are not FF opcodes, pull up the bus and add a jumper field so the technician can set the proper opcode.
Be sure that at least some of the diagnostics can run with the absolute minimum amount of the system working, and minimal number of boards plugged in. Think about the example set by the Naked Mini - diagnostics were limited to each card, reducing potential for interaction between system components.
OK, so you say this is a one-off unit that will never be reproduced, and that has a design life only a few weeks. Why spend time writing diagnostics? This is a valid point, but even in these extreme cases be sure the system has at least an "easy mode". That is, be sure that on power up (or by installing a jumper or setting a switch) a dramatic event occurs - say, a lamp lights. This way you can tell in a second if the computer is running and power is applied. You don't want to spend time chasing timing problems in a complex system when the computer hasn't even started.
As a company, we're all in this together - right? Use your expert knowledge, and your knowledge of everyone else's job (after all, we all should strive to be high tech Renaissance Persons), to make the job of the techs in production test and repair easier (or even possible).
The microwave oven is fixed, but my workbench is still littered with broken electronics. Now, if I could only get my car radio's FM section to work...