Non-Volitile RAM
Originally in Embedded Systems Programming, April, 1999.
By Jack Ganssle
Many of the embedded systems that run our lives try to remember a little bit about us, or about their application domain, despite cycling power, brownouts, and all of the other perils of fixed and mobile operation. In the bad old days before microprocessors we had core memory, a magnetic media that preserved its data when powered or otherwise.
Today we face a wide range of choices. Sometimes Flash or EEPROM is the natural choice for non volatile applications. Always remember, though, that these devices have limited numbers of write cycles. Worse, in some cases writes can be very slow.
Battery-backed up RAMs still account for a large percentage of non-volatile systems. With robust hardware and software support they'll satisfy the most demanding of reliability fanatics; a little less design care is sure to result in occasional lost data.
Supervisory Circuits
In the early embedded days we were mostly blissfully unaware of the perils of losing power. Virtually all reset circuits were nothing more than a resistor/capacitor time constant. As Vcc ramped from 0 to 5 volts, the time constant held the CPU's reset input low - or lowish - long enough for the system's power supply to stabilize at 5 volts.
Though an elegantly simple design, RC time constants were flawed on the back end, when power goes away. Turn the wall switch off, and the 5 volt supply quickly decays to zero. Quickly only in human terms, of course, as many milliseconds went by while the CPU was powered by something between 0 and 5. The RC circuit is, of course, at this point at a logic one (not-reset), so allows the processor to run.
And run they do! With Vcc down to 3 or 4 volts most processors execute instructions like mad. Just not the ones you'd like to see. Run a CPU with out-of-spec power and expect random operation. There's a good chance the machine is going wild, maybe pushing and calling and writing and generally destroying the contents of your battery backed up RAM.
Worse, brown-outs, the plague of summer air conditioning, often cause small dips in voltage. If the AC mains decline to 80 volts for a few seconds a power supply might still crank out a few volts. When AC returns to full rated values the CPU is still running, back at 5 volts, but now horribly confused. The RC circuit never notices the dip from 5 to 3 or so volts, so the poor CPU continues running in it's mentally unbalanced state. Again, your RAM is at risk.
Motorola, Maxim, and others developed many ICs designed specifically to combat these problems. Though features and specs vary, these supervisory circuits typically manage the processor's reset line, battery power to the RAM, and the RAM's chip selects.
Given that no processor will run reliably outside of its rated Vcc range, the first function of these chips is to assert reset whenever Vcc falls below about 4.7 volts (on 5 volt logic). Unlike an RC circuit which limply drools down as power fails, supervisory devices provide a snappy switch between a logic zero and one, bringing the processor to a sure, safe stopped condition.
They also manage the RAM's power, a tricky problem since it's provided from the system's Vcc when power is available, and from a small battery during quiescent periods. The switchover is instantaneous to keep data intact.
With RAM safely provided with backup power and the CPU driven into a reset state, a decent supervisory IC will also disable all chip selects to the RAM. The reason? At some point after Vcc collapses you can't even be sure the processor, and your decoding logic, will not create rogue RAM chip selects. Supervisory ICs are analog beasts, conceived outside of the domain of discrete ones and zeroes, and will maintain safe reset and chip select outputs even when Vcc is gone.
But check the specs on the IC. Some disable chip selects at exactly the same time they assert reset, asynchronously to what the processor is actually doing. If the processor initiates a write to RAM, and a nanosecond later the supervisory chip asserts reset and disables chip select, that write cycle will be one nanosecond long. You cannot play with write timing and expect predictable results. Allow any write in progress to complete before doing something as catastrophic as a reset.
Some of these chips also assert an NMI output when power starts going down. Use this to invoke your "oh_my_god_we're_dying" routine.
(Since processors usually offer but a single NMI input, when using a supervisory circuit never have any other NMI source. You'll need to combine the two signals somehow; doing so with logic is a disaster, since the gates will surely go brain dead due to Vcc starvation).
Check the specs on the parts, though, to ensure that NMI occurs before the reset clamp fires. Give the processor a handful of microseconds to respond to the interrupt before it enters the idle state.
There's a subtle reason why it makes sense to have an NMI power-loss handler: you want to get the CPU away from RAM. Stop it from doing RAM writes before reset occurs. If reset happens in the middle of a write cycle, there's no telling what will happen to your carefully protected RAM array. Hitting NMI first causes the CPU to take an interrupt exception, first finishing the current write cycle if any. This also, of course, eliminates troubles caused by chip selects that disappear synchronous to reset.
Every battery-backed up system should use a decent supervisory circuit; you just cannot expect reliable data retention otherwise. Yet, these parts are no panacea. The firmware itself is almost certainly doing things destined to defeat any bit of external logic.
Multi-byte Writes
Steve Lund wrote recently about a very subtle failure mode that afflicts all too many battery-backed up systems. He observed that in a kinder, gentler world than the one we inhabit all memory transactions would require exactly one machine cycle, but here on Earth 8 and 16 bit machines constantly manipulate large data items. Floating point variables are typically 32 bits, so any store operation requires two or four distinct memory writes. Ditto for long integers.
The use of high level languages accentuates the size of memory stores. Setting a character array, or defining a big structure, means that the simple act of assignment might require tens or hundreds of writes.
Consider the simple statement:
a=0x12345678;
An x86 compiler will typically generate code like:
mov [bx], 5678 mov [bx+2],1234
which is perfectly reasonable and seemingly robust.
In a system with a heavy interrupt burden it's likely that sooner or later an interrupt will switch CPU contexts between the two instructions, leaving the variable "a" half-changed, in what is possibly an illegal state. This serious problem is easily defeated by avoiding global variables - as long as "a" is a local, no other task will ever try to use it in the half-changed state.
Power-down concerns twist the problem in a more intractable manner. As Vcc dies off a seemingly well designed system will generate NMI while the processor can still think clearly. If that interrupt given the perversity of nature - your device will enter the power-shutdown code with data now corrupt. It's quite likely (especially if the data is transferred via CPU registers to RAM) that there's no reasonable way to reconstruct the lost data.
The simple expedient of eliminating global variables has no benefit to the power-down scenario.
Can you imagine the difficulty of finding a problem of this nature? One that occurs maybe once every several thousand power cycles, or less? In many systems it may be entirely reasonable to conclude that the frequency of failure is so low the problem might be safely ignored. This assumes you're not working on a safety-critical device, or one with mandated minimal MTBF numbers.
Before succumbing to the temptation to let things slide, though, consider implications of such a failure. Surely once in a while a critical data item will go bonkers. Does this mean your instrument might then exhibit an accuracy problem (for example, when the numbers are calibration coefficients)? Is there any chance things might go to an unsafe state? Does the loss of a critical communication parameter mean the device is dead until the user takes some presumably drastic action?
If the only downside is that the user's TV set occasionally - and rarely - forgets the last channel selected, perhaps there's no reason to worry much about losing multi-byte data. Other systems are not so forgiving.
Steve suggested implementing a data integrity check on power-up, to ensure that no partial writes left big structures partially changed. I see two different directions this approach might take.
The first is a simple power-up check of RAM to make sure all data is intact. Every time a truly critical bit of data changes, update the CRC, so the boot-up check can see if data is intact. If not, at least let the user know that the unit is sick, data was lost, and some action might be required.
A second, and more robust, approach is to complete every data item write with a checksum or CRC of just that variable. Power-up checks of each item's CRC then reveals which variable was destroyed. Recovery software might, depending on the application, be able to fix the data, or at least force it to a reasonable value while warning the user that, whilst all is not well, the system has indeed made a recovery.
Though CRCs are an intriguing and seductive solution I'm not so sanguine about their usefulness. Philosophically it is important to warn the user rather than to crash or use bad data. But it's much better to never crash at all.
We can learn from the OOP community and change the way we write data to RAM (or, at least the critical items for which battery back-up is so important).
First, hide critical data items behind drivers. The best part of the OOP triptych mantra "encapsulation, inheritance, polymorphism" is "encapsulation". Bind the data items with the code that uses them. Avoid globals; change data by invoking a routine, a method, that does the actual work. Debugging the code becomes much easier, and reentrancy problems diminish.
Second, add a "flush_writes" routine to every device driver that handles a critical variable."Flush_writes" finishes any interrupted write transaction. mso-bidi-font-family:"Times New Roman"">Flush_writes relies on the fact that only one routine - the driver - ever sets the variable.
Next, enhance the NMI power-down code to invoke all of the flush_write routines. Part of the power-down sequence then finishes all pending transactions, so the system's state will be intact when power comes back.
The downside to this approach is that you'll need a reasonable amount of time between detecting that power is going away, and when Vcc is no longer stable enough to support reliable processor operation. Depending on the number of variables needed flushing this might mean hundreds of microseconds.
Firmware people are often treated as the scum of the earth, as they inevitably get the hardware (late) and are still required to get the product to market on time. Worse, too many hardware groups don't listen to, or even solicit, requirements from the coding folks before cranking out PCBs. This, though, is a case where the firmware requirements clearly drive the hardware design. If the two groups don't speak, problems will result.
Some supervisory chips do provide advanced warning of immanent power-down. Maxim's (www.maxim-ic.com) MAX691, for example, detects Vcc failing below some value before shutting down RAM chip selects and slamming the system into a reset state. It also includes a separate voltage threshold detector designed to drive the CPU's NMI input when Vcc falls below some value you select (typically by selecting resistors). It's important to set this threshold above the point where the part goes into reset. Just as critical is understanding how power fails in your system. The capacitors, inductors, and other power supply components determine how much "alive" time your NMI routine will have before reset occurs. Make sure it's enough.
I mentioned the problem of power failure corrupting variables to Scott Rosenthal, one of the smartest embedded guys I know. His casual "yeah, sure, I see that all the time" got me interested. It seems that one of his projects, an FDA-approved medical device, uses hundreds of calibration variables stored in RAM. Losing any one means the instrument has to go back for readjustment. Power problems are just not acceptable.
His solution is a hybrid between the two approaches just described. The firmware maintains two separate RAM areas, with critical variables duplicated in each. Each variable has it's own driver.
When it's time to change a variable, the driver sets a bit that indicates "change in process". It's updated, and a CRC is computed for that data item and stored with the item. The driver un-asserts the bit, and then performs the exact same function on the variable stored in the duplicate RAM area.
On power-up the code checks to ensure that the CRCs are intact. If not, that indicates the variable was in the process of being changed, and is not correct, so data from the mirrored address is used. If both CRCs are OK, but the "being changed" bit is asserted, then the data protected by that bit is invalid, and correct information is extracted from the mirror site.
The result? With thousands of instruments in the field, over many years, not one has ever lost RAM.
Testing
Good hardware and firmware design leads to reliable systems. You won't know for sure, though, if your device really meets design goals without an extensive test program. Modern embedded systems are just too complex, with too much hard-to-model hardware/firmware interaction, to expect reliability without realistic testing.
This means you've got to pound on the product, and look for every possible failure mode. If you've written code to preserve variables around brown-outs and loss of Vcc, and don't conduct a meaningful test of that code, you'll probably ship a subtly broken product.
In the past I've hired teenagers to mindlessly and endlessly flip the power switch on and off, logging the number of cycles and the number of times the system properly comes to life. Though I do believe in brining youngsters into the engineering labs to expose them to the cool parts of our profession, sentencing them to mindless work is a sure way to convince them to become lawyers rather than techies.
Better, automate the tests. The Poc-It, from Microtools (www.microtoolsinc.com/products.htm) is an indispensable $250 device for testing power-fail circuits and code. It's also a pretty fine way to find unitialized variables, as well as isolating those awfully-hard to initialize hardware devices like some FPGAs.
The Poc-It brainlessly turns your system on and off, counting the number of cycles. Another counter logs the number of times a logic signal asserts after power comes on. So, add a bit of test code to your firmware to drive a bit up when (and if) the system properly comes to life. Set the Poc-It up to run for a day or a month; come back and see if the number of power cycles is exactly equal to the number of successful assertions of the logic bit. Anything other than equality means something is dreadfully wrong.
Conclusion
When embedded processing was relatively rare, the occasional weird failure meant little. Hit the reset button and start over. That's less of a viable option now. We're surrounded by hundreds of CPUs, each doing its thing, each affecting our lives in different ways. Reliability will probably be the watchword of the next decade as our customers refuse to put up with the quirks that are all too common now.
The current drive is to add the maximum number of features possible to each product. I see cell phones that include games. Features are swell! if they work, if the product always fulfills its intended use. Cheat the customer out of reliability and your company is going to lose. Power cycling is something every product does, and is too important to ignore.
Thanks to Steve Lund for his thoughts and concerns, and to Scott Rosenthal (www.sltf.com) for his ideas.