Metastability and Firmware
By Jack Ganssle
Published in Embedded Systems Programming, July 2001
Metastability and Firmware
Last month I discussed the general problem of making software that reads asynchronous hardware reliable. Some very simple situations - like a timer that uses an interrupt service routine - can result in rare but quite serious faults. Whenever we have a physical input to the computer that requires more than one I/O read, and that continues to run during input, there's a chance the data will be corrupt.
Suppose a robot uses a 10 bit encoder to monitor the angular location of a wrist joint. As the wrist rotates the encoder sends back a binary code, 10 bits wide, representing the joint's current position. An 8 bit processor requires two distinct I/O instructions - two byte-wide reads - to get the data. No matter how fast the computer might be there's a finite time between the reads during which the encoder data may change.
The wrist is rotating. A "get_position" routine reads 0xff from the low part of the position data. Then, before the next instruction, the encoder rolls over to 0x100. "get_position" "Courier New"">get_position" reads the high part of the data - now 0x1 - and returns a position of 0x1ff, clearly in error and perhaps even impossible.
This is a common problem. Handling input from a two axis controller? If the hardware continues to move during our reads, then the X and Y data will be slightly uncorrelated, perhaps yielding impossible results. One friend tracked a rare autopilot failure to the way the code read a flux-gate compass, whose output is a pair of related quadrature signals. Reading them at disparate times, while the vessel continued to move, yielded impossible heading data.
Input Capture Register
Hardware folks have dealt with similar problems for decades. Their usual solution is to add an input capture register between the I/O device and the processor. The register is nothing more than a parallel latch, as wide as the input data. The 10 bit encoder has a 10 bit register; the encoder's output goes to the register's inputs. A single clock line drives each flip-flop in the latch; when strobed it locks the data into the register. The output is fed to a pair of processor input ports.
When it's time to read a safe, unchanging value the code issues a "hold the data now" command which strobes encoder values into the latch. So all 10 bits are stored and can be read by the software at any time, with no fear of things changing between reads.
Some designers tie the register's clock input to one of the port control lines. The I/O read instruction then automatically strobes data into the latch, assuming one is wise enough to ensure the register latches data on the leading edge of the clock.
The input capture register is a very simple way to suspend moving data during the duration of a couple of reads. At first glance it seems perfectly safe. But a bit of analysis shows that for asynchronous inputs it is not reliable. We're using hardware to fix a software problem, so must be aware of the limitations of physical logic devices.
To simplify things for a minute, let's zoom in on that input capture register and examine just one of its bits. Each gets stored in a flip-flop, a bit of logic that might have only three connections: data in, data out, and clock. When the input is a one, strobing clock puts a one at the output.
But suppose the input changes at about the same time clock cycles? What happens? The short answer is that no one knows.
Metastable States
Every flip-flop has two critical specifications we violate at our peril. "Set-up time" is the minimum number of nanoseconds that input data must be stable before clock comes. "Hold time" tells us how long to keep the data present after clock transitions. These specs vary depending on the logic device. Some might require tens of nanoseconds of set-up and/or hold time; others need an order of magnitude less.
If we tend to our knitting we'll respect these parameters and the flip-flop will always be totally predictable. But when things are asynchronous - say, the wrist rotates at it's own rate and the software does a read whenever it needs data - there's a chance the we'll violate set-up or hold time.
Suppose the flip-flop requires 3 nanoseconds of set-up time. Our data changes within that window, flipping state perhaps a single nanosecond before clock transitions. The device will go into a metastable state where the output gets very strange indeed.
By violating the spec the device really doesn't know if we presented a zero or a one. It's output goes, not to a logic state, but to either a half-level (in between the digital norms) or it will oscillate, toggling wildly between states. The flip-flop is metastable.
This craziness doesn't last long; typically after a few to 50 nanoseconds the oscillations damp out or the half-state disappears, leaving the output at a valid one or zero. But which one is it? This is a digital system, and we expect ones to be ones, and zeroes zeroes.
The output is random. Bummer, that. You cannot predict which level it will assume. That sure makes it hard to design predictable digital systems!
Hardware folks feel that the random output isn't a problem. Since the input changed at almost exactly the same time the clock strobed, either a zero or a one is reasonable. If we had clocked just a hair ahead or behind we'd have gotten a different value, anyway. Philosophically, who knows which state we measured? Is this really a big deal? Maybe not to the EEs, but this impacts our software in a big way, as we'll see shortly.
Metastablility occurs only when clock and data arrive almost simultaneously; the odds increase as clock rates soar. An equally important factor is the type of logic component used; slower logic (like 74HCxx) has a much wider metastable window than faster devices (say, 74FCTxx). Clearly at reasonable rates the odds of the two asynchronous signals arriving closely enough in time to cause a metastable situation are low; measureable, yes, important, certainly. With a 10 MHz clock and 10 KHz data rate, using typical but not terribly speedy logic, metastable errors occur about once a minute.
The classic metastable fix uses two flip flops connected in series. Data goes to the first; it's output feeds the data input of the second. Both use the same clock input. The second flop's output will be "correct" after two clocks, since the odds of two metastable events occurring back-to-back are almost nil. With two flip-flops, at reasonable data rates errors occur millions or even billions of years apart. Good enough for most systems.
But "correct" means the second stage's output will not be metastable: it's not oscillating, nor is it at an illegal voltage level. There's still an equal chance the value will be in either legal logic state.
Firmware, not Hardware
To my knowledge there's no literature about how metastability effects software, yet it poses very real threats to building a reliable system.
Hardware designers smugly cure their metastability problem using the two stage flops described. Their domain is that of a single bit, whose input changed just about the same time the clock transition. Thinking in such narrow terms it's indeed reasonable to accept the inherent random output the flops generate.
But we software folks are reading parallel I/O ports, each perhaps 8 bits wide. That means there are 8 flip-flops in the input capture register, all driven by the same clock pulse.
Let's look at what might happen. The encoder changes from 0xff to 0x100. This small difference might represent just a tiny change in angle. We request a read at just about the same time the data changes; our input operation strobes the capture register's clock creating a violation of set-up or hold time. Every input bit changes; each of the flip flops inside the register goes metastable. After a short time the oscillations die out, but now every bit in the register is random. Though the hardware folks might shrug and complain that no one knows what the right value was, since everything changed as clock arrived, in fact the data was around 0xff or 0x100. A random result of, say, 0x12 is absurd and totally unacceptable, and may lead to crazy system behavior.
The case where data goes from 0xff to 0x100 is pathological since every bit changes at once. The system faces the same peril whenever lots of bits change. 0x0f to 0x10. 0x1f to 0x20. The upper, unchanging data bits will always latch correctly; but every changing bit is at risk.
Why not use the multiple flip-flop solution? Connect two input capture registers in series, both driven by the same clock. Though this will eliminate the illegal logic states and oscillations, the second stage's output will be random as well.
One option is to ignore metastability and hope for the best. Or use very fast logic with very narrow set-up/hold time windows to reduce the odds of failure. If the code samples in the inputs infrequently it's possible to reduce metastability to one chance in millions or even billions. Building a safety critical system? Feeling lucky?
It is possible to build a synchronizer circuit that takes a request for a read from the processor, combines it with a data available bit from the I/O device, responding with a data-OK signal back to the CPU. This is non-trivial and prone to errors.
An alternative is to use a different coding scheme for the I/O device. Buy an encoder with Gray Code output, for example (if you can find one). Gray Code is a counting scheme where only a single bit changes between numbers, as follows:
0 000 1 001 2 011 3 010 4 110 5 111 6 101 7 100
Gray code makes sense if, and only if, your code reads the device faster than it's likely to change, and if the changes happen in a fairly predictable fashion - like counting up. Then there's no real chance of more than a single bit changing between reads; if the inputs go metastable only one bit will be wrong. The result will still be reasonable.
Another solution is to compute a parity or checksum of the input data before the capture register. Latch that, as well, into the register. Have the code compute parity and compare it to that read; if there's an error do another read.
Though I've discussed adding an input capture register, please don't think that this is the root cause of the problem. Without that register - if you just feed the asynchronous inputs directly into the CPU - it's quite possible to violate the processor's innate set-up/hold times. There's no free lunch; all logic has physical constraints we must honor.
Don't Panic!
Some designs will never have a metastability problem. It always stems from violating set-up or hold times, which in turn comes from either poor design or asynchronous inputs.
All of this discussion has revolved around asynchronous inputs, when the clock and data are unrelated in time. Be wary of anything not slaved to the processor's clock. Interrupts are a notorious source of problems. If caused by, say, someone pressing a button, be sure that the interrupt itself, and the vector-generating logic, don't violate the processor's set-up and hold times.
But in computer systems most things do happen synchronously. If you're reading a timer that operates from the CPU's clock, it is inherently synchronous to the code. From a metastability standpoint it's totally safe.
Bad design, though, can plague any electronic system. Every logic component takes time to propagate data; when a signal traverses many devices the delays can add up significantly. If the data then goes to a latch it's quite possible that the delays may cause the input to transition at the same time as the clock. Instant metastability.
Designers are pretty careful to avoid these situations, though. Do be wary of FPGAs and other components where the delays vary depending on how the software routes the device. And when latching data or clocking a counter it's not hard to create a metastability problem by using the wrong clock edge. Pick the edge that gives the device time to settle before it's read.
What about analog inputs? Connect a 12 bit A/D converter to two 8 bit ports and we'd seem to have a similar problem: the analog data can wiggle all over, changing during the time we read the two ports. However, there's no need for an input capture register because the converter itself generally includes a "sample and hold" block, which stores the analog signal while the A/D digitizes. Most A/Ds then store the digital value till we start the next conversion.
There's a lot of information about metastability in circuits. One of the best is a Texas Instruments report (number SDYA006) named "Metastable Response in 5-V Logic Circuits". The formulas and empirical data included will help you quantitatively calculate the risks in your designs.