Perform or Perish

How do you deal with software performance problems?

Published in Embedded Systems Programming, March 1991

By Jack Ganssle

What processor should you use in your next product? High tech marketers push dozens of common CPU families, some with yet more dozens of variants within the family. Do you need 8, 16, or 32 bits? How big of an address space? How much on-board I/O? It's hard enough to pick a processor for a simple controller when cost is the only concern. What about more complex applications that make substantially more demands on the computer's raw computational capability?

Processor performance seems to be the Holy Grail of our business. The silicon wizards give us ever more MIPs for less and less money. But just how much is enough for the project at hand? Who can afford to select a chip, design hardware and tens of thousands of lines of code, and then discover that the system just can't keep up with real time events?

I don't know how to prove that a particular processor will be adequate for a non-trivial task. It's awfully hard to equate a less-than-adequate design specification with a MIP rating. This is even tougher when programming in a high level language, or when using a canned real time operating system.

In real life it seems that the hardware designers often pick the CPU based on cost and peripheral compatability, and then tell the programmers to "make it work". Somehow. Or, when software folks get involved in the decision, the choice sometimes winds up being made purely on the basis of experience. "The 6501 is in all of our products; surely we'll have no trouble with it in this one".

In fact, a lot of projects eventually get into trouble by overloading the processor. This is always discovered late in the development, during debugging or final integration, when the cost of correcting the problem is at the maximum. Then a mad scramble to remove machine cycles begins.

We all know the old adage that 90% of the processor burden lies in 10% of the code. It's important to find and optimize that 10%, not some other section that will have little impact on the system's overall performance. Nothing is worse than spending a week optimizing the wrong routine!

Fortunately, performance analysis tools and techniques abound. These range from a head-scratching session to $20k+ special purpose instruments. Do use the right tools (whether brainpower or technology) to find speed problems, and then optimize only where absolutely necessary. Otherwise, you'll be like a blind man hunting for a contact lens.

Performance Tools

Dozens of vendors stand ready to supply you with hardware-based performance analysis tools. Some come as add-ons to emulators; others are stand-alone products dedicated to the sole function of identifying performance bottlenecks. All offer easy, real time capture of your program's activity. The only (!) drawback is cost, which ranges from a few thousands dollars for an ICE add-on to over $20k for a dedicated instrument.

Hardware performance analyzers come in two very different flavors. Statistical units sample the target system's address lines every so often, logging the address found at each sample. Presumably, after some number of iterations a representative picture of the target's operation will emerge. Non-statistical instruments capture every single machine cycle issued by your program. These are more expensive than their statistical counterparts, but give an accurate execution profile of even single-shot events (like an infrequently-invoked subroutine).

Any of these performance analyzers are particularly good at prying unexpected information from your program. For example, many 8 bit C compilers generate vast quantities of code to handle automatic variables, since the stack capabilities of the CPUs leave lots to be desired. A hardware performance analyzer will instantly show up this overzealous stack use as a peak in processor use.

Of course, not everyone can afford the cost of such a special purpose instrument; it's surprising just how many programmers still make do with primitive "burn and hope" strategies, eschewing all development aids. Since most people spend relatively little time solving performance problems, it often makes sense to use cheaper, more readily available tools.

Several companies promote software based performance analysis. Paradigm's Inside and Borland's Turbo Profiler are possibly the best known. Both operate by instrumenting compiled code to track its execution.

Software based performance analyzers are a great compromise between price and capability. They offer high accuracy and simple operation at an absurdly low cost. Unfortunately, the products mentioned operate only on 80x88 series processors, and both expect DOS services support. This does limit their usefulness. Fortunately more and more embedded systems use Intel-series CPUs or their NEC near equivalents. The wide availability of inexpensive high quality compilers and assemblers are certainly a compelling argument in their favor.

You can often take advantage of software performance analyzers even if your embedded environment does not have DOS resources. Most of the code of any system is tied up in mathematical algorithms, screen handlers, and the like. It makes sense to prototype and measure the performance of all of these under MS-DOS before switching to the less stable development hardware.

Any software performance analyzer will interfere to some extent with the code's speed. It's a little like Heisenberg's Uncertainty Principle, except that instead of not knowing both the mass and momentum of a particle, the measurement of routine speed skews the system's real time behavior. Interrupt response, in particular, will be degraded. Still, you can generally measure enough of a system's response to get a pretty good idea where the biggest problems are.

Tricks and Techniques

Part of the fun of embedded programming is creating clever windows into a system's operation. It's awfully hard to extract any sort of information from most embedded systems; performance data is especially challenging.

of performance measurement it makes sense to understand and use other, perhaps less sophisticated, ways to get the same or similar data. Always strive to expand your arsenal of tools!

Pure software types always think I'm a little daft whenever I say this, but the oscilloscope, colloquially known as the scope, is one of the most useful debugging tools. After all, in the embedded business our code ultimately just pushes bits out I/O ports; scopes are a natural choice for measuring the timing of these signals. A scope is the ideal low-cost performance measurement tool.

Any decent scope shows the state of two independent signals simultaneously, as a graph of amplitude (i.e., voltage) versus time. Software people don't really care about the vertical axis calibration since any signal, regardless of voltage, is (for our purposes) as good as any other. The horizontal axis shows time, the crucial component needed in effective performance measurements.

Suppose you wrote an interrupt service routine in C. If more than a handful of lines, you'll probably have no idea how long it runs. Yet, an ISR that is too slow will never keep up with real time events. Modify the routine to set a spare parallel bit high when invoked, and drive it low just before exiting. Then, connect a scope to that bit and measure the duration of the pulse. With only a few seconds work you'll know just how long the ISR executes.

Also measure the time between pulses. This is the time available for all of the other code. If the ISR runs for 50 out of every 100 milliseconds, then it uses 50% of all CPU resources, a rather scary amount.

If, in this case, you happen to know the interrupt occurs every 25 milliseconds, then this simple measurement immediately indicates that the ISR takes so long to run that the system is missing interrupts. This is profound, as a missed interrupt usually spells disaster in real time systems. A scope gives the information in a heartbeat; few other tools will.

You can time any routine this way. Of course, if (gasp!) your routines use multiple entry and return points then you'll have to add a lot more bit setting and clearing statements. This is yet one more argument in favor of a single return at the end of each routine.

A variation on this theme is using the scope to measure "delta time" - the period between two events. Suppose you are concerned about how much time elapses between an interrupt being asserted and some other action happening. For example, if the interrupt signals the receipt of a "conversion complete" signal from and A/D converter, how long does it take the code to read data from the A/D? If the analog input varies continuously, fast A/D reads are probably crucial. Put one scope channel on the interrupt signal and the other on the A/D chip select pin. Read the difference in time between the two scope traces.

These techniques measure execution times of individual routines, or the latency of a single interrupt. With only two channels and no memory, the scope doesn't identify the time spent in individual lines of C within a routine. Be clever! Once again, add code to drive a bit high only during the code fragment's execution. Connect one scope channel to the bit, and use the other to probe addresses generated by the CPU. Look at one address bit at a time, making a paper record of when addresses change relative to the routine's start pulse. Given a short subroutine, it's not too hard to make a map of its execution.

Logic analyzers are almost as useful for performance measurements. Suppose the system runs a "main loop" in 100 milleseconds or so. Connect the analyzer to the system's address bus and command it to acquire one sample every 100 microseconds. 1000 acquisitions after starting, the logic analyzer's internal memory will contain a sort of statistical profile of where the code runs. It might miss very short routines, but it will give you a high level picture very quickly.

You can get a more refined look at the operation of any section of code by telling the analyzer to "trigger", or start collecting, when a certain address is sensed. Crank up the acquisition rate for finer details. If the analyzer can store 1000 cycles, then with a 10 microsecond rate you'll get a pretty good profile over 10 milliseconds of execution. In the example of monitoring an interrupt service routine, trigger collection on the ISR's start address.

If a scope or logic analyzer isn't your cup of tea, consider connecting a PC to your system via a fast serial or parallel port. You can get some crude idea of performance levels by having every routine transmit a unique code when invoked. Then, write a program to sift through the data and extract performance parameters. Real time response might be sacrificed, but again, you'll get an idea of what the system is doing.

What Then?

Usually performance problems fall into one of three categories: excessive compiler overhead, crummy code, or slow algorithms.

Modern C compilers generate high quality code. Most overhead problems come from weaknesses in the processor's architecture, like poor stack handling abilities. Recently I saw this graphically when comparing real time trace data on a Z80 to the original C source: ++I (with I a 16 bit signed integer) generated about 25 instructions. Changing I from an automatic to a static variable cut out 75% of the overhead.

Crummy code problems come mostly from programming without thinking. Don't force the compiler to optimize your program for you. Fold constants yourself, remove unneeded code from loops, and minimize conversions between types. This is just good programming style. Keep algorithms and routines to a single page, so you can understand what is really going on.

Given reasonable code, don't expect huge performance improvements from tuning a line here or there. Consider replacing a slow algorithm with one better suited to the application. For example, never use a Taylor series to compute transcendental functions. Many reference books (such as Computer Approximations by Hart) contain much faster converging series expansions. Or, when precision is not a problem but speed is, do a fast table lookup.

Close the Loop

How do you know if a particular processor will be adequate for a project? Many of us work for companies that provide a standard product. New versions often use the old code, or are based on algorithms and techniques developed on a current product. Consider instrumenting the current code to measure exactly how much free processor time exists. Use a real performance analyzer or one of the methods I've described.

Closing the feedback loop! When any project is complete, spent a day learning about what you did. Measure the performance of the system to see just how accurate your processor utilization estimates were. The results are always interesting and sometimes terrifying. If, as is often the case, the numbers bear little resemblance to the original goals, then figure out what happened, and use this information to improve your estimating ability. Without feedback, you work forever in the dark. Strive to learn from your successes as well as your failures.

Track your system's performance all during the project's development, so you're not presented with a disaster two weeks before the scheduled delivery. It's not a bad idea to assign CPU utilization specifications to major routines during overall design, and then track these targets like you do the schedule. Avoid surprises with careful planning.