DMA

DMA is an important part of many embedded systems, yet far too many of us don't really understand it. Read on...

Published in Embedded Systems Programming, October 1994

By Jack Ganssle

Engineers love to torture the English language, using words in convoluted ways, verbizing the most passive of nouns, and inventing strange new words that might embarrass your mother (the word "dikes", referring to diagonal cutters - wire cutters - comes to mind).

Acronyms are our special bane. Every three word noun phrase is immediately shortened to its initials, even when this may make an acronym of an acronym. The passage of time dulls ones memory till the original words referred to by the letters slip away, so the acronym becomes its own word. For example, though "CRT" really refers only to a large tube, it has come to stand for a complete monitor, electronics, tube, and all. Even names of corporations reflect our reliance on verbal shorthand. IBM, GE, SAIC - an alphabet soup of letters dances in front of our eyes. CACI even reincorporated themselves some years back to make the acronym the new, real corporate name, consigning the words the letters stood for to eternal obscurity.

Many embedded systems make use of DMA controllers. How many of us remember that DMA stands for Direct Memory Access? What idiot invented this meaningless phrase? I figure any CPU cycle directly accesses memory. This is a case of an acronym conveniently sweeping an embarrassing piece of verbal pomposity under the rug where it belongs.

Regardless, DMA is nothing more than a way to bypass the CPU to get to system memory and/or I/O. DMA is usually associated with an I/O device that needs very rapid access to large chunks of RAM. For example - a data logger may need to save a massive burst of data when some event occurs.

DMA requires an extensive amount of special hardware to managing the data transfers and to arbitrating access to the system bus. This might seem to violate our desire to use software wherever possible. However, DMA makes sense when the transfer rates exceed anything possible with software. Even the fastest loop in assembly language comes burdened with lots of baggage. A short code fragment that reads from a port, stores to memory, increments pointers, decrements a loop counter, and then repeats based on the value of the counter takes quite a few clock cycles per byte copied. A hardware DMA controller can do the same with no wasted cycles and no CPU intervention.

Admittedly, modern processors often have blindingly fast looping instructions. The 386's REPS (repeat string) moves data much faster than most applications will ever need. However, the latency between a hardware event coming true, and the code being ready to execute the REPS, will surely be many microseconds even in the most carefully crafted program - far too much time in many applications.

How it Works

Processors provide one or two levels of DMA support. Since the dawn of the micro age just about every CPU has had the basic bus exchange support. This is quite simple, and usually consists of just a pair of pins.

"Bus Request" (AKA "Hold" on Intel CPUs) is an input that, when asserted by some external device, causes the CPU to tri-state it's pins at the completion of the next instruction. "Bus Grant" (AKA "Bus Acknowledge" or "Hold Acknowledge") signals that the processor is indeed tristated. This means any other device can put addresses, data, and control signals on the bus. The idea is that a DMA controller can cause the CPU to yield control, at which point the controller takes over the bus and initiates bus cycles. Obviously, the DMA controller must be pretty intelligent to properly handle the timing and to drive external devices through the bus.

Modern high integration processors often include DMA controllers built right on the processor's silicon. This is part of the vendors' never-ending quest to move more and more of the support silicon to the processor itself, greatly reducing the cost and complexity of building an embedded system. In this case the Bus Request and Bus Grant pins are connected to the onboard controller inside of the CPU package, though they usually come out to pins as well, so really complex systems can run multiple DMA controllers. It's a scary thought....

Every DMA transfer starts with the software programming the DMA controller, the device (either on-board a integration CPU chip or a discrete component) that manages these transactions. The code must typically set up destination and source pointers to tell the controller where the data is coming from, and where it is going to. A counter must be programmed to track the number of bytes in the transfer. Finally, numerous bits setting the DMA mode and type must be set. These may include source or destination type (I/O or memory), action to take on completion of the transaction (generate an interrupt, restart the controller, etc.), wait states for each CPU cycle, etc.

Now the DMA controller waits for some action to start the transfer. Perhaps an external I/O device signals it is ready by toggling a bit. Sometimes the software simply sets a "start now" flag. Regardless, the controller takes over the bus and starts making transfers.

Each DMA transfer looks just like a pair of normal CPU cycles. A memory or I/O read from the DMA source address is followed by a corresponding write to the destination. The source and destination devices cannot tell if the CPU is doing the accesses or if the DMA controller is doing them.

During each DMA cycle the CPU is dead - it's waiting patiently for access to the bus, but cannot perform any useful computation during the transfer. The DMA controller is the bus master. It has control until it chooses to release the bus back to the CPU.

Depending on the type of DMA transfer and the characteristics of the controller, a single pair of cycles may terminate to allow the CPU to run for a while, or a complete block of data may be moved without bringing the processor back from the land of the idle.

Once the entire transfer is complete the DMA controller may quietly go to sleep, or it may restart the transfer when the I/O is ready for another block, or it may signal the processor that the action is complete. In my experience this is always signaled via an interrupt. The controller interrupts the CPU so the firmware can take the appropriate action.

To summarize, the processor programs the DMA controller with parameters about the transfers, the controller tri-states the CPU and moves data over the CPU bus, and then when the entire block is done the controller signals completion via an interrupt.

DMA controllers are wondrous and scary things, each with dozens of registers you must program just right to get any sort of response. I think these things are designed by committee, with each member throwing every possible feature into the chip. Though it's nice to have so much capability, writing the code can be a trial.

You must first have a very clear idea of exactly what sort of DMA transfers your system needs. DMA was invented to move data between I/O and memory, but now people use it for a variety of other reasons as well.

Traditional Synchronous DMA moves a byte or word at a time between system memory and a peripheral, handshaking with the I/O port for each transfer. This sort of transfer recognizes that the port may not always be in a ready condition; the handshaking is a hardware mechanism to throttle the transactions.

With this sort of transfer, the program sets up the controller and then carries on, oblivious to the state of the DMA transaction. The hardware moves one byte or word between memory and I/O each time the I/O port signals it is ready for another transaction. On each read indication, the DMA controller asserts Bus Request, waits for a Bus Acknowledge in response, and then takes over the bus for a single cycle. Then, the DMA controller goes idle again, waiting for another ready signal from the port. Thus, the program and DMA cycles share bus cycles, with the controller winning any contest for control of the bus. Sometimes this is called "Cycle Stealing".

Burst Mode DMA, in contrast, generally assumes that the destination and source addresses can take transfers as fast as the controller can generate them. The program sets up the controller, and then (perhaps after a single ready indication from a port occurs), the entire source block is copied to the destination. The DMA controller gains exclusive access to the bus for the duration of the transfer, during which time the program is effectively shut down. Burst mode DMA can transfer data very rapidly indeed.

Flyby DMA, something that is not supported on many controllers, is a beast of a different color. The DMA controller gains access to the bus and puts the source or destination address out. Then, it initiates what is in effect a read and a write cycle simultaneously. The data is read from the source address, and written to the destination, at the same time. This implies that either the source or destination does not require an address, since it is very unlikely that both would use the same. An example might be copying data from memory to a FIFO port - the source address (a pointer to memory) increments on each transfer, while the destination is always the same FIFO.

Flyby transactions are very fast since the read/write cycle pair is reduced to a single cycle. Both burst and synchronous types of transfers can be supported.

Typical Uses

The original IBM PC, that 8088 based monstrosity we all once yearned for but now snicker at, used a DMA controller to generate dynamic RAM refresh addresses. It simply ran a null transfer every few milliseconds to generate the addresses needed by the DRAMs.

This was a very clever design - a normal refresh controller is a mess of logic. The only down side was that the PC's RAM was non-functional until the power-up code properly programmed the DMA controller.

Both floppy and hard disk controllers often use DMA to transfer data to and from the drive. This is a natural and perfect application. The software arms the controller and then carries on. The hardware waits for the drive to get to the correct position and then performs the transfer without further reliance on the system software.

If I'm working on a microprocessor with a "free" DMA controller (one built onto the chip), I'll sometimes use it for large memory transfers. This is especially useful with processors with segmented address spaces, like the 80188 or Z180 (the Z180's space is not actually segmented, but is limited to 64k without software intervention to program the MMU).

Both of these CPUs include on-board DMA controllers that support transfers over the entire 1 Mb address space of the part. Judicious programming of the controller lets you do a simple and easy memory copy of any size to any address - all without worrying about segment registers or the MMU. This is yet another argument for encapsulation: write a DMA routine once, debug it thoroughly, and then reuse it even for mundane tasks.

Over the years I've profiled a lot of embedded code, and in many instances have found that execution time seems to be really burned up by string copy and move routines inside the C runtime library. If you have a spare DMA controller channel, why not recode portions of the library to use a DMA channel to do the moves? Depending on the processor a tremendous speed improvement can result.

I'll never forget the one time I should have used DMA, but didn't. As a consultant, rushed to get a job done, I carelessly threw together a hardware design figuring somehow I could make things work by tuning the software. For some inexplicable reason I did not put a DMA controller on the system, and suffered for weeks, tuning a few instructions to move data off a tape drive without missing bytes. A little more forethought would have made a big difference.

Troubleshooting DMA

This is a case where a thorough knowledge of the hardware is essential to making the software work. DMA is almost impossible to troubleshoot without using a logic analyzer.

No matter what mode the transfers will ultimately use, and no matter what the source and destination devices are, I always first write a routine to do a memory to memory DMA transfer. This is much easier to troubleshoot than DMA to a complex I/O port. You can use your ICE to see if the transfer happened (by looking at the destination block), and to see if exactly the right number of bytes were transferred.

At some point you'll have to recode to direct the transfer to your device. Hook up a logic analyzer to the DMA signals on the chip to be sure that the addresses and byte count are correct. Check this even if things seem to work - a slight mistake might trash part of your stack or data space.

Some high integration CPUs with internal DMA controllers do not produce any sort of cycle that you can flag as being associated with DMA. This drives me nuts - one lousy extra pin would greatly ease debugging. The only way to track these transfers is to trigger the logic analyzer on address ranges associated with the transfer, but unfortunately these ranges may also have non-DMA activity in them.

Be aware that DMA will destroy your timing calculations. Bit banging UARTs will not be reliable; carefully crafted timing loops will run slower than expected. In the old days we all counted T-states to figure how long a loop ran, but DMA, prefetchers, cache, and all sorts of modern exoticness makes it almost impossible to calculate real execution time.

Scopes

On another subject, some time ago I wrote about using oscilloscopes to debug software. The subject is really too big to do justice in a couple of short magazine pieces. However, I just received a booklet from Tektronix that does justice to the subject. It's called "Basic Concepts - XYZ of Analog and Digital Oscilloscopes", and is their publication number 070869001. Highly recommended. Get the book, borrow a scope, and play around for a while. It's fun and tremendously worthwhile.