By Jack Ganssle
Note: My definitive take on watchdog timers is here.
Watchdogs Redux
Nearly a decade ago I wrote a series of three articles (http://www.eetimes.com/discussion/break-points/4024495/Born-to-Fail, slightly updated at https://www.ganssle.com/watchdogs.htm) about watchdog timers (WDT). It was my contention at the time that most WDTs were poorly-designed. Too many still are.
I won't repeat my arguments here since they are available on Embedded.com's website. However, I stated that a WDT is the last line of defense against product failure. Designed correctly and the system is guaranteed to recover from a crash; anything else may result in a ticked-off customer. Or loss of an expensive mission. Or, for some products, injury and death.
Remember that when a program crashes, it generally runs off and starts executing code from some random address. Rarely does the application actually stop; if it does stop, that's usually only after executing a lot of incorrect instructions.
So what's the state of the art today? A complete survey is impossible, but here are a few data points.
TI's MSP430 family comprises a wide range of nifty very low power 16 bit microcontrollers. The documentation (http://focus.ti.com/lit/ug/slau208h/slau208h.pdf) shows a very impressive-looking WDT block diagram (figure 13-1), but the reality is less thrilling. At any time the code can turn the protection mechanism off. So a crashed program running rogue code can issue an instruction that disables the WDT. The system crashes and never recovers.
The MSP430's instruction set is refreshingly simple, generally using a single word to represent, for instance, a MOV instruction. A nice feature is that to change any WDT setting one uses a MOV with the upper byte set to 0x5a; the lower 8 bits contain the command. Try to write to it with anything else in the MSB and the system will reset. The lower bits have a variety of configuration information that disables/enables the WDT, selects the clock source, or even switches the WDT to act as a simple timer. What are the odds that crashed code will issue exactly a 0x5a in the upper byte? Pretty slim, of course. But not zero, and probably a lot less than zero. There will be a lot of move instructions in the code, and some will be followed by an ADD, which is what 0x5a represents. Much better would be a configuration that allows one to access the WDT configuration only once. Or perhaps only once after resuming from a low power mode.
Then there's Freescale's newish 32 bit Coldfire+ line, like the MCF51Qx (manual here: http://cache.freescale.com/files/32bit/doc/ref_manual/MCF51QM128RM.pdf?fpsp=1&WT_TYPE=Reference%20Manuals&WT_VENDOR=FREESCALE&WT_FILE_FORMAT=pdf&WT_ASSET=Documentation). Instead of "watchdog" Freescale prefers the awkward phrase "Computer Acting Properly" (COP). But it does offer a very intriguing feature. In general, one pets the watchdog, uh, COP, by writing 0x55 and then 0xaa to the control register. But in one mode that sequence must be sent in the last 25% of the COP timeout period. A premature write results in a reset. Odds of a errant program getting the timing Goldilocks-correct (not too often, nor too infrequently) are tiny.
The part also generates a reset is any attempt is made to execute an illegal instruction. That's somewhat different from most CPUs, which issue an illegal op code interrupt. I rather like Freescale's approach, since interrupt handlers are not guaranteed to work if the code crashes. A blown stack, corrupt PC (on some CPUs if the PC is odd a fault is taken), or a vector base register changes. This also suggests that it's a good idea to fill unused flash at link time with an illegal op code, and on power-up fill all of RAM with a similar instruction, so that errant code waltzing through memory is likely to generate a reset.
Another nice touch is that the reset pin is open drain and is asserted when any of these errors occur. Tie it to the peripheral reset inputs. Even if wandering code issues output instructions their potentially scrambled little brains will be straightened out.
ST Microelectronics has a line of Cortex-M3 devices. The M3 has become extremely popular for lower-end embedded devices, and ST's STM32F is representative of these parts (though the WDT is an ST add-on, and does not necessarily mirror other vendors' implementations). The STM32F has two different protection mechanisms. An "Independent Watchdog" is a pretty vanilla design that has little going for it other than ease of use. But their Window Watchdog offers more robust protection. When a countdown timer expires, a reset is generated, which can be impeded by reloading the timer. Nothing special there. But if the reload happens too quickly, the system will also reset. In this case "too quickly" is determined by a value one programs into a control register.
Another cool feature: it can generate an interrupt just before resetting. Write a bit of code to snag the interrupt and you can take some action to, for instance, put the system in a safe state or to snapshot data for debugging purposes. ST suggests using the ISR to reload the watchdog - that is, kick the dog so a reset does not occur. Don't take their advice. If the program crashes the interrupt handlers may very well continue to function normally. And using an ISR to reload the WDT invalidates the entire reason for a window watchdog.
The WDT cannot be disabled once enabled - good thinking, folks! But oddly, the other configuration registers can be changed at will, which can make the watchdog behave incorrectly. A Novel Approach
The latest issue of IEEE Embedded Systems Letters (December 2010 Volume 2 Number 4) has an article that practically grabbed me by the throat. Titled "Control Focused Soft Error Detection for Embedded Applications" by Karthik Shankar and Roman Lysecky, it's a bit academic and rather a slog to get through. But the authors have come up with a fascinating twist on the concept of watchdog timers. In fact, they don't use the word "watchdog," and no timers are involved.
The idea is quite simple and at first glance not particularly novel: monitor the addresses the processor issues and compare those against a profile obtained during development. If the CPU goes to an unexpected location, take some sort of remedial action. Do the same if it doesn't issue an expected address. The authors go further and compare against the number of expected loop iterations and the like.
But what got my attention is how they monitor addresses. They simply suck in compressed trace data from the processor's serial trace data port. The paper talks about using an ARM CPU, but other parts also support various kinds of serial trace data, and there's even a standard named Nexus-5001 (http://www.nexus5001.org/), which is IEEE-ISTO 5001tm-2003.
ARM supplies a bewildering array of debugging IP, including a macrocell that sends trace data out just two pins. The so-called Program Flow Trace (PFT) is described at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0035a/IHI0035A_coresight_pft_architecture_spec.pdf. It can be set up in a zillion different configurations, but will at the very least emit compressed information that lets one track changes in program flow. For instance, the execution of a branch instruction pushes address data out. So does Branch with Link, which is ARM's way of calling a subroutine. Similar instructions in Thumb mode do the same.
There's a lot to like about the authors' approach. First, modern processors are barricaded behind caches, pipelines and speculative execution logic. Even if you're using a CPU that has address pins (as opposed to a self-contained microcontroller or IP on an FPGA) those address lines simply don't match anything that is going on in the processor's grey cells.
Second, pins are valuable. By squeezing addresses through a couple of debug pins few of these valuable resources are consumed. And logic is cheaper than pins; adding circuitry to decompress the trace stream and make sense of it may cost very little. In the paper the authors state that the overhead for their particular approach was only 3% of the silicon area of an ARM1156T.
Third, most watchdogs use ad hoc approaches that just are not reliable indicators of system operation. This new approach lets the designer decide just how fine-grained the monitoring should be. Check an occasional call. or watch every instance of every branch and call.
Fourth, there's no overhead in the system's firmware. Zilch. Unlike traditional watchdog approaches, which require one to seed the code with instructions to periodically kick the dog to keep it from resetting the system, address profiling is transparent to the software.
There are a few caveats, of course. The logic needed to make sense of the address information is substantial, and is probably impractical unless implemented in an FPGA. Building such a monitor would be a lot of work. But it needs to be done once, and then can ever after be used in a succession of products. I can see this as packaged IP sold by a third party, or as an open source project.
Remember, though, there's no guarantee that your ARM CPU will have the trace logic needed. Every vendor is free to include the IP or not.
Going Deeper
I haven't thought out all of the implications or possible ways to actually use this idea in a real embedded system, but here are a few ideas.
One problem with the proposed approach is that every recompile will shift all of the system's addresses, requiring the developer to re-profile the code to determine where the branches are. An alternative is to monitor just function calls, but use a level of indirection. Build a table of jump instructions that lives at a fixed address in memory; each entry corresponds to a particular function. Make calls through that table. Then monitor those jumps. One would have to be careful that the compiler didn't optimize the jumps away.
This does mean the software changes a bit, of course. But more than a few of us use these jump tables anyway. They can aid debugging and sometimes simplify on-the-go firmware updating.
ARM's program flow trace also very intriguingly sends address information out when a DSB, DMB, or ISB instruction is executed. DSB (Data Synchronization Barrier) holds up program flow until all of the instructions before it complete; DMB (Data Memory Barrier) ensures that memory accesses before it compete before one after it starts, and ISB (Instruction Synchronization Barrier) flushes the instruction pipeline, insuring that the instructions after the ISB are fetched from memory or the cache.
Why are these interesting instructions? ARM mandates that at the very least a DMB be issued before locking or unlocking a mutex, and before incrementing or decrementing a semaphore. One could monitor these, just like watching branches, to ensure that the code is properly going through all of the activities mediated by the RTOS. In fact, monitoring only these actions may be simpler than the addresses, because program flow can change quite a lot even from a simple change (requiring re-profiling), but resource management tends to change much less frequently.
As far as the DSB and ISB instructions, I'm not quite sure how they could offer useful watchdog information, but something makes me think they could offer some interesting options.
Parting Thoughts
If you're building an electronic toothbrush watchdogs are probably not terribly important. But an automated reset helps boost consumer confidence in our products' quality. Everyone hates the "remove batteries and wait 30 seconds" dance.
Many vendors are putting more thought into their WDT designs; some are doing a pretty good job. But we have a long way to go, and the wise developer will apply sound engineering practices to this often-neglected part of the system.
The article I cited shows that some ingenious approaches are being used. Consider adding a bit of hardware support if robustness is an important requirement.
Published February 1, 2011