Interrupt Predictability
Will your ISRs be fast enough? How do you know?
Published in Embedded Systems Programming, May 1995
By Jack Ganssle
"There are strange things done in the midnight sun by the men who moil for gold .
The arctic trails have their secret tales that would make your blood run cold;
The northern lights have seen queer sights, but the queerest they ever did see Was that night on the marge of Lake Lebarge, when I cremated Sam Mcgee."
Switch a few words from these lines from Robert Service's ode to the Yukon and you'd have a description of the weird and mysterious ways software developers make their real time systems run properly. For, it seems when interrupts start coming fast and furious, like a shower of arrows in an injun attack, no one really knows how to ensure that each interrupt gets serviced on-time and in-time.
How do you know that your code handles every interrupt in a timely manner? Though it's possible to watch for a system crash, some failures occur slowly and insidiously, like a Windows application that erratically leaks resources. A few missed interrupts may only cause your system to lose track of time, at first, or to miscount an encoder by just a few pulses. These cancerous infections may lurk for months or years before manifesting themselves as noticeable bugs.
Components of Disaster
A simple embedded system with a single interrupt clearly will run correctly as long as the ISR never takes longer to execute than the frequency of the interrupt. Correctness is easy to prove: measure the ISR's maximum execution time and compare it to the interrupt's minimum interval.
Unless you have very smart hardware that can stack up backlogged interrupts, then the software must service each interrupt before two get backed up. Two? One interrupt can go pending while another is being serviced; though the processor will ignore the single backlogged one as the first is in process, after the ISR completes the CPU will go ahead and respond to the asserted request.
If yet another interrupt were to occur (say, from that rotating encoder), then one of the two backlogged requests will be lost. After all, the interrupt comes in to the CPU on but a single pin; it can only express "pending" or "not pending" states; there's just not enough bits to indicate "hey, now I've got two pending!"
The obvious moral is to make sure interrupts are never disabled for so long that one can be missed. It's not easy - perhaps sometimes not even possible - to guarantee that the code will satisfy this condition. I contend that any reasonably complex system will probably not have an interrupt structure that is "practically" provably correct. "Practically" is the operative word - I have yet to speak to any embedded designer using any formal method of proving code correctness for any application.
If the academics have a solution, we're not using it!
Crummy hardware design will create significant interrupt service problems as well. Most processors have level sensitive interrupt inputs. Any device requesting an interrupt must assert the request until the processor acknowledges it. You can't just bleep the input and expect the CPU to catch it.
Design your hardware to assert the input until the CPU responds with an interrupt acknowledge cycle. Most modern processors will require this, as you'll have to drop a vector on the bus at the same time. Others, though, include default vectoring that sorely tempts a chip-limited designer to just assume the software will always be in an interrupt-ready state. Your code could be off doing something, with interrupts disabled, and miss that oh-so-short input signal.
Reentrancy
Well designed interrupt handlers are largely reentrant. Reentrant functions, AKA "pure code", are often falsely thought to be any code that does not modify itself. Too many programmers feel if they simply avoid self-modifying code, then their routines are guaranteed to be reentrant, and thus interrupt-safe. Nothing could be further from the truth.
A function is reentrant if, while it is being executed, it can be re-invoked by itself, or by any other routine. Reentrancy was originally invented for mainframes, in the days when memory was a valuable commodity. System operators noticed that a dozen or hundreds of identical copies of a few big programs would be in the computer's memory array at any time. At the University of Maryland, my old hacking grounds, the monster Univac 1108 had one of the early reentrant FORTRAN compilers. It burned up a (for those days) breathtaking 32kw of system memory, but being reentrant, it required only 32k even if 50 users were running it. Each user executed the same code, from the same set of addresses.
A routine must satisfy the following conditions to be reentrant:
1) It never modifies itself. That is, the instructions of the program are never changed. Period. Under any circumstances. Far too many embedded systems still violate this cardinal rule.
2) All variables changed by the routine must be allocated to a particular "instance" of the function's invocation. Thus, if reentrant function FOO is called by three different functions, then FOO's data must be stored in three different areas of RAM. The C language makes this trivial, assuming you are clever enough to use automatic variables in your code. Automatics are stored on the stack; each incarnation of a reentrant routine brings in its own stack frame, and own set of automatics.
This is not the only reentrancy issue, though. Suppose your main line routine and the ISRs are all coded in C. The compiler will certainly invoke runtime functions to support floating point math, I/O, string manipulations, etc. If the runtime package is only partially reentrant, than your ISRs may very well corrupt the execution of the main line code. This problem is common, but is virtually impossible to troubleshoot since symptoms result only occasionally and erratically. Can you imagine the difficulty of isolating a bug which manifests itself only occasionally, and with totally different characteristics each time?
Be sure your compiler has a pure runtime package.
Now, sometimes we're tempted to cheat and write a nearly-pure routine. If your ISR merely increments a global 32 bit value, say, to maintain time, it would seem legal to produce code that does nothing more than a quick and dirty increment. Beware! Especially when writing code on an 8 or 16 bit processor, remember that the C compiler will surely generate several instructions to do the
mov ax,[j] add ax,1 ; increment low part of j mov [j],ax mov ax,[j+1] adc ax,0 ; prop carry to high part of j mov [j+1],ax
An interrupt in the middle of this code will leave j just partially changed; if the ISR is reincarnated with j in transition, its value will surely be corrupt.
Even the perfectly coded reentrant ISR leads to problems. If such a routine runs so slowly that interrupts keep giving birth to additional copies of it, eventually the stack will fill. Once the stack bangs into your variables the program is on its way to oblivion. You must ensure that the average interrupt rate is such that the routine will return more often than it is invoked.
Measuring Interrupt Response
Though predicting a system's interrupt response is probably impossible, you can use a few tricks to get typical performance numbers.
Typical numbers are the best we can get. There's no assurance that measurements taken over a second, year or century will represent worst case system performance. Perhaps one day users select an unusual combination of inputs; the temperature is running a bit hot; interrupts are bunched up by a faster than usual serial stream, whose data for some reason consists of once-in-a-lifetime numbers that are tough to compute, burning more CPU time. One interrupt runs just a shade too long, causing another to back up till a third gets missed. It's a chaotic situation that we hope never occurs, but our hopes are based on nothing more than a nervous prayer. Thankfully the occupants of the aircraft whose autopilot your system controls don't understand just how poorly we know what we're doing!
Branch analyzers are the rage in larger systems. These devices, akin to emulators, monitor your code's execution to ensure that every possible branch in the code takes place. A branch analyzer insures that the code has been at least totally exercised, though correctness is more difficult to monitor. Though a branch analyzer will prove that each ISR has executed at least once, it simply can't ensure that interrupts will never be missed.
A scope can measure interrupt latency and response very effectively in a single-interrupt system, but when more than one device can interrupt the processor, the scope is generally unsatisfactory. Too much is going on, too fast, in too many dimensions, to monitor on even a fast digital scope. Similarly, logic analyzers do a poor job of finding crummy interrupt response.
Probably the best hardware tool you can use is a decent performance analyzer. Be sure to get one that measures more than average response to an interrupt; it must log the worst case, or maximum time, in each ISR. Make sure it can monitor all of the ISRs simultaneously. Run your tests for weeks over every possible condition - and then cross your fingers and hope things don't degenerate after the product starts to ship.
Personally, I think the best way to measure interrupt predictability is to instrument the code to fault when an error occurs. Plan for failure. If your system can at least alert the user that things have gone to hell, you'll avert a crash and will have the option of failing gracefully. In a life-critical application add a little hardware to indicate "lost interrupt"... but don't tie the output of this circuit to the CPU's normal interrupt pin! Use NMI, as this situation is as catastrophic as a power failure.
Beware of reentrant routines. Add a bit of code in the system's main loop to monitor the stack pointer. If the SP bottoms out, you've clearly got a problem that could be related to getting interrupts faster than the system can process them. Any sort of creeping SP is a deadly problem that is easy to detect.
Common Sense Coding
Poorly coded interrupt service routines are the bane of our industry. Most ISRs are hastily thrown together, tuned at debug time to work, and tossed in the "oh my god it works" pile and forgotten. A few simple rules can alleviate many of the common problems.
First, don't even consider writing a line of code for your new embedded system until you lay out an interrupt map. List each one, and give an English description of what the routine should do. Include your estimate of the interrupt's frequency.
Now approximate the complexity of each ISR. Given the interrupt rate, with some idea of how long it'll take to service each, you can assign priorities (assuming your hardware includes some sort of interrupt controller). Some developers assign the highest priority to things that must get done; remember that in any embedded system every interrupt must be serviced sooner or later. Give the highest priority to things that must be done in staggeringly short times to satisfy the hardware or the system's mission (like, to accept data coming in from a 1 Mb/sec source).
The cardinal rule of interrupt handling is to keep the handlers short. A long ISR simply reduces the odds you'll be able to handle all time-critical events in a timely fashion. If the interrupt starts something truly complex, have the ISR spawn off a task that can run independently. This is an area where an RTOS is a real asset, as task management requires nothing more than a call from the application code.
Reenable interrupts as soon as practical in the ISR. Do the hardware-critical and non-reentrant things up front, then execute the interrupt enable instruction. Give other ISRs a fighting chance to do their thing.
Use reentrant code! Write your ISRs in C if at all possible, and use C's wonderful local variable scoping. Globals are an abomination in any programming environment; never more so than in interrupt handlers. Reentrant C code is orders of magnitude easier to write than reentrant assembly code.
Don't use NMI for anything other than catastrophic events. Power-fail, system shutdown, interrupt loss, and the apocalypse are all good things to monitor with NMI. Timer or UART interrupts are not.
When I see an embedded system with the timer tied to NMI, I know, for sure, that the developers found themselves missing interrupts. NMI may alleviate the symptoms, but only masks deeper problems in the code that most certainly should be cured.
NMI will break a reentrant interrupt handler, since most ISRs are non-reentrant during the first few lines of code where the hardware is serviced. NMI will thwart your stack management efforts as well.
Conclusion
Start your interrupt planning before writing a single line of code. Work out the details, priorities, and maximum execution times.. Plan for problems: include code that looks for failures. In a really busy system try desperately to get time allocated for lots of testing, though we all know that when the system works at all, management will usually yell their mantra: "ship it!"
References:
The Cremation of Sam Mcgee, by Robert Service, from Collected Poems of Robert Service, 1907, G.P. Putnam Sons, NY