Troubleshooting

Troubleshooting skills are critical to debugging embedded systems. Sure, a lot of these skills come only from experience. Many, though, are more of a philosophical nature, as described in the following.

Published in EDN in October, 1994.

By Jack Ganssle

There comes a time in any project when your new design is finally assembled, awaiting your special expertise to "make it work". Sometimes it seems like the design end of this business is the easy part; troubleshooting prototype hardware can make even the toughest engineer a Maalox and Rogaine addict.

You can't fix any embedded system without the right world view; a zeitgeist of suspicion tempered by trust in the laws of physics, curiosity dulled only by the determination to stay focused on a single problem, and a zealot's regard for the scientific method.

Perhaps these are successful characteristics of all who pursue the truth. In a world where we are surrounded by complexity, where we deal daily with equipment and systems only half-understood, it seems wise to follow understanding by an iterative loop of focus, hypothesis, and experiment.

Too many engineers fall in love with their creations only to be continually blindsided by the design's faults. They are quick to overtly or subconsciously assume the problem being chased is due to the software, the lousy chips, or the power company, when simple experience teaches us that any new design is rife with bugs.

Assume it's broken. Never figure anything is working right until proven by repeated experiment; even then, continue to view the "fact" that it seems to work with suspicion. Bugs are not bad; they're merely a test of your troubleshooting ability.

Armed with a healthy skeptical attitude, the basic philosophy of debugging any system is to follow these steps:

For (i=0; i< # findable bugs; i++)
  {
  while (bug(i))
    {
    Observe the behavior to find the apparent bug;
    Observe collateral behavior to gain as much information as possible
        about the bug;
    Round up the usual suspects;
    Generate a hypothesis;
    Generate an experiment to test the hypothesis;
    Fix the bug;
    };
  };

Now you're ready to start troubleshooting, right? Wrong! Stop a minute and make sure you have good access to the system. No matter how minor the problem seems to be, troubleshooting is like a bog we all get trapped in for far too long. Take a minute to ease your access to the system.

Do you have extender cards if they're needed to scope any point on the board(s)? How about special long cables to reach the boards once they are extended?

If there's no convenient point to reliably clip on the scope's ground lead, solder a resistor lead onto the board so you're not fumbling with leads that keep popping off.

Some systems have signals that regulate major operating modes. Solder a resistor lead on these points as well, as you'll surely be scoping them at some point. This small investment in time up front will pay off in spades later.

Following the Loop

Let's cover each step of the troubleshooting sequence in detail.

Step 1: Observe the behavior to find the apparent bug.

In other words, determine the bug's symptoms. Remember always that many problems are subtle and exhibit themselves via a confusing set of symptoms. The fact that the first digit of the LCD fails to display may not be a useful symptom -- but the fact that none of the digits work may mean a lot.

Step 2: Observe collateral behavior to gain as much information as possible about the bug.

Does the LCD's problem correlate to a relay clicking in? Try to avoid studying a bug in isolation, but at the same time be wary of trying to fix too many bugs at the same time. When ROM accesses are unreliable and the front panel display is not bright enough, address one of these problems at a time. No one is smart enough to deal with multiple bugs all at once - unless they are all manifestations of something more fundamental.

Step 3: Round up the usual suspects.

Lots of computer problems stem from the same few sources. Clocks must be stable and must meet very specific timing and electrical specs... or all bets are off. Reset too often has unusual timing parameters. When things are just "weird", take a minute to scope all critical inputs to the microprocessor, like clock, HOLD, READY, RESET, and the like.

Never, never, never forget to check Vcc. Time and time again here at Softaid we see systems that don't run right because the 5 volt supply is really only putting out 4.5... or 5.6... or 5 volts with lots of ripple. The systems come in after their designers spent weeks sweating over some obscure problem that in fact never existed, but was simply the ghostly incarnation of the more profound power supply issue.

Step 4: Generate a hypothesis.

"Shotgunners" are those poor fools who address problems by simply changing things - ICs, designs, PAL equations - without having a rational for the changes. Shotgunning is for amateurs. It has no place in a professional engineering lab.

of the bug. You probably don't have the information to do this without gathering more data. Use a scope, emulator, or logic analyzer to see exactly what is going on; compare that to what you think should happen. Generate a theory about the cause of the bug from the difference in these.

Sometimes you'll have no clue what the problem might be. Scoping the logical places might not generate much information. Or, a grand failure like an inability to boot is so systemic that it's hard to tell where to start looking. Sometimes, when the pangs of desperation set in, it's worthwhile to scope around the board practically at random. You might find a floating line, an unconnected ground pin, or something unexpected. Scope around, but be always on the prowl for a working hypothesis.

Step 5: Generate an experiment to test the hypothesis.

Construct an experiment to prove or disprove your hypothesis. Most of the time this gets resolved in the process of gathering data to come up with the theory in the first place. For example, if the emulator reads all ones from a programmed ROM, a reasonable hypothesis is that CS or OE is not toggling. Scoping the pins will prove this one way or the other, though now you'll need another hypothesis and experiment to figure out why the selects are not where you expect to see them.

Sometimes, though, the hypothesis-experiment model should be much less casually applied. When Intel started shipping the XL version of the 186 (supposedly compatible with the older series), we found that none of our systems worked. Scoping around showed the processor to be stuck in a weird tri-state, though all of it's inputs seemed reasonable. One hypothesis was that the 186XL was not coming out of reset properly, an awfully hard thing to capture since reset is a basically non-scopable one-time event. We finally built a system to reset the processor repeatedly, to give us something to scope. The experiment proved the hypothesis, and a fix was easy to design.

Note that an alternative would have been to glue in a new reset circuit from the start to see if the problem went away. Problems that mysteriously go away tend to mysteriously come back; unless you can prove that the change really fixed the problem, there may still be a lurking time bomb in the system.

Occasionally the bug will be too complicated to yield to such casual troubleshooting. If the timing of a PAL will have to be adjusted, before wildly making changes visualize the new timing in your mind or on a sheet of graph paper. Will it work? It's much faster to think out the change than to actually implement it... and perhaps troubleshoot it all over again.

Rapid troubleshooting is as important as accurate troubleshooting. Decide what your experiment will be, and then stop and think it though once again. What will this test really prove? I like experiments with binary results - the signal is there or it is not, or it meets specified timing or it does not - since either result gives me a direction to proceed. Binary results have another benefit: sometimes they let you skip the experiment altogether! Always think through the actions you'll take after the experiment is complete, since sometimes you'll find yourself taking the same path regardless of the result, making the experiment superfluous.

If the experiment is a nuisance to set up, is there a simpler approach? Hooking up 50 logic analyzer probes is rather painful if you can get the same information in some easier way. I'd hate to be in a lab with a logic analyzer since they are so useful for so many things... but I try to relegate it to the tool of last resort, since most often it's possible to construct an easier experiment that is complete in a fraction of the time it takes to connect the LA.

Don't be so enamored of your new grand hypothesis that you miss data that might disprove it! The purpose of a hypothesis is simply to crystallize your thinking - if it is right, you'll know what step to take next. If it's wrong, collect more data to formulate yet another theory.

Step 6: Fix the bug.

There's more than one way to fix a problem. Hanging a capacitor on a PAL output to skew it a few nanoseconds is one way; another is to adjust the design to avoid the race condition entirely.

Sometimes a quick and dirty fix might be worthwhile to avoid getting hung up on one little point if you are after bigger game. Always, always, revisit the kludge and re-engineer it properly. Electronics has an unfortunate tendency to work in the engineering lab and not go wrong until the 5,000th unit is built. If a fix feels bad, or if you have to furtively look over your shoulder and glue it in when no one is looking, then it is bad.

Finally, never, ever, fix the bug and assume it's OK because the symptom has disappeared. Apply a little common sense and scope the signals to make sure you haven't serendipitously fixed the problem by creating a lurking new one.

Other Ideas

Constantly apply sanity checks. Twenty years ago the Firesign Theater put out an album called "Everything You Know is Wrong". Use that as your guiding philosophy in troubleshooting an embedded system. For example, just because you checked Vcc last night and it was fine does not mean that it's OK now. Prototype systems fail in wondrous ways, so always be on the lookout.

Another example: suppose your system runs fine at 10 Mhz but never at 20. Obviously you'd put a 20 Mhz clock source in a pursue the problem. Every once in a while go back to 10 Mhz just to be sure the symptom has not changed. You could spend a lot of time developing hypothesis about 20 Vs 10 operation, when the 10 Mhz test results might actually be a fluke.

It's a good idea to be on the lookout for excessive heat, especially now that so many components are surface mounted and tough to change when you blow them up.

All semiconductor devices generate some heat; big CPUs can produce quite a bit. A really hot device, one that you can't keep your finger on, is usually screaming for help. Excessive heat may indicate an SCR latch up condition due to outrageous ground bounce or a floating input.

Less dramatic overheating, much harder to detect without a lot of practice, often indicates a design flaw. Your finger can give important clues about the design. If two devices try to drive the bus at the same time, they'll overheat.

Be careful how you apply your personal temperature sensor. I've found that my callused forefinger is insulated enough to protect me from bad burns when a part is unexpectedly frying. Thus, I gingerly tough each part; if it seems reasonably cool I'll then use the much-more-sensitive back of my hand to try and determine if the chip is running hotter than it should. It's surprising how much information you can get with a little experience.

Final Words

At 3:00 AM when the problems seem intractable and you're ready to give up engineering, remember that the system is only a computer. Never panic - you are smarter than it is.