By Jack Ganssle

Taming Complexity

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Folk wisdom, as reflected in Brian Kernighan's clever riff on clever code, is that insight we all know but generally ignore. But how can we measure "cleverness?" Supreme Court Justice Potter Stewart, trying to characterize pornography, memorably lamented that it somewhat defies definition, but "I know it when I see it" Likewise, most of can sense excessively clever code, but may disagree when examining specific code chunks.

We do know complicated functions are tough to understand, an agony to debug, and usually require excessive amounts of maintenance.

One of the tragedies of software engineering is that, in industry at least, we've successfully avoided the metrics-based quality revolution that transformed manufacturing. Few companies know their pretest bug rates in, say, defects per line of code. Or productivity in lines of code per hour. Or which are the worst functions in a project. In an experiment I conducted surreptitiously last year, only a third of correspondents could answer the seemingly simple question "how big is your code" in any metric they wanted to use (MB, lines of code, lines they developed, value in dollars, or, heck, the weight of the listings).

A management mantra states "if you can't measure it you can't manage it." Though surely true for building widgets, it's a lot harder to apply to engineering. But there are some objective measures that provide valuable insight into the art of developing firmware. Since we know complex functions are problematic, it seems logical that we measure and control complexity in functions.

And it turns out that's pretty easy to do.

But first, there are two kinds of complexity: essential, which is a quality about the problem being solved, and incidental, which is the complexity introduced by our engineering approaches. Here we're talking about incidental complexity, as that's the only kind we as developers can manage.

31 years ago Thomas McCabe proposed a metric called "cyclomatic complexity" to attach a quantitative measure to a function's complexity. It is one of many such measures, but is by far the one most used. It measures the number of "edges" and "nodes" in a function, where a node is a computational statement, and an edge represents the transfer of control between nodes. Cyclomatic complexity is then:

Note that there are many different formulations of this equation in the literature; this is the simplified calculation that's usually used and was given in McCabe's original paper (A Complexity Measure, Thomas J. McCabe, IEEE Transactions on Software Engineering, Volume SE-2, No. 4, December 1976, p 308-320).

Also note that "edges" include the flow between two sequential statements. Thus adding non-branching/non-looping statements does not change v(G), which is essentially the number of paths through a program. Hopefully that last statement illuminated a light bulb in your head with regards to testing. More on this later.

Limits

It's reasonable to assume that functions with high cyclomatic complexities will be buggier than simpler functions, and experience has proven this to be true. In fact, HP found a correlation coefficient of 0.8 between v(G) and defect density (http://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1989-04.pdf). But it's impossible to define a single value of it that separates good functions from bad ones. Think of cyclomatic complexity like the canary in a mine. It's an early warning that the function may be difficult to maintain and error-prone

Long ago George A. Miller ("The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information," Psychological Review, 63 (1956), 81-97) showed that humans can typically keep a maximum of five to nine things in short-term memory. That's a pretty good place to start: why write functions that are more complex than we can understand?

McCabe suggested that 10 is probably a good limiting value for v(G), and that number is certainly the most widely used. It's comfortingly close to Miller's 7±2 as well. McCabe pointed out that there's an important exception: switch statements can generate very high complexity metrics even in cases where the logic is pretty simple, so he suggested exempting functions with switches from complexity analysis. That rule has been interpreted in an awful lot of mostly dysfunctional ways, and is often used to excuse many other sorts of constructs. It's better to say: "No function can exceed 10, except in cases where the resulting code would be excessively hard to understand or maintain."

Over the years the following rules of thumb have been observed:

1-10 A simple function without much risk
11-20 More complex, moderate risk
21-50 Complex, high risk
51+ Unstable, very high risk

And:

1-10 5%
20-30 20%
>50 40%
Around 60 60%

For a variety of reasons few companies routinely practice code inspections, which are the best tool we have for cheaply and effectively eliminating bugs. But one can use the complexity metric to flag risky routines that should get an inspection. I recommend sending, at the very least, any function that scores over 10 through an inspection.

Validated Software (validatedsoftware.com) certifies firmware to various safety-critical standards. Scott Nowell there tells me that on one recent project the code had an average v(G) of 4.7, suggesting that great code is composed of simple functions.

So how do typical packages stack up? Here are some results from my lab using the RSM tool (described later):

Apache 2.2.8, the open source web server, is composed of 5752 functions with an average cyclomatic complexity of 6.04. Though 61 functions exceed 50 just under 5000 are below 10. One scored a terrifying 725! That function – a single function, mind you – is 2700 lines long with 145 cases and 353 if statements. If a v(G) of 100 indicates an untestable chunk of code, it's hard to imagine what 725 implies.

What about Linux? Kernel version 2.6.23 is comprised of 161k functions. RSM, which was nearly instantaneous in most of the other experiments, needed 3 hours to digest the over 3 million lines of code. It generated a 200 MB HTML file which crashed Internet Explorer and Firefox, though trusty Ultraedit handled it, albeit slowly. Functions averaged a very respectable complexity of 4.94, with one peaking at 352. That 1800 line function is practically devoid of comments and is unacceptable software.

Over 54,000 scored 1, suggesting that a third of all Linux functions are tremendously simple. 144k were under 10, but 764 are in the scary over-50 category, with 125 falling into the "untestable" over-100 bin.

Linux scored far better than the handful of GNU tools I checked. Grep, for instance, averages 13.52. The worst function's v(G) of 53 (in 128 lines of code) is lightly-commented and almost exuberantly sprinkled with gotos, continues and breaks.

On the embedded front the open source RTOS eCos's 175 functions averages a remarkably low 2.54 with only six over 10.

With only 2495 lines of code and 75 functions FreeRTOS, another open source real-time OS, is much smaller than Linux. It's average v(G) is just 3.44. One function scored 11; all the rest are under 10.

uC/OS-II averaged  a very respectable 5.85 for its 118 functions, with one outlier at 25. That function was composed of the switch structure McCabe says shouldn't be counted when running these metrics.

Testing

A lot of people ask me "how much testing is enough?" That's a tough question, but, since v(G) is exactly the number of paths through the function, it's clear that the metric is also exactly the minimum number of tests needed to completely exercise the routine. If your regimen has less then v(G) tests, the code is not completely tested.

Cyclomatic complexity doesn't tell us how many tests are required; the number could be much higher than v(G). And the number by itself tells us nothing about whether the correct tests have been run. But it is a quick sanity check that lets us know if we have conducted the necessary minimum number of tests.

So that 2700 line Apache function needs - for one function - at least 725 tests. Ouch!

Of the three RTOSes I looked at (uC/OS-II, eCos, and FreeRTOS) total cyclomatic complexity (i.e., the sum of all of the individual v(G)s) is 690, 445, and 258 respectively. Note that these numbers say nothing about the comparative quality of the code as those that offer more functionality will have more functions and higher total complexity. But to those of you who think you can write an OS in a weekend, well, it's clear that just writing the tests will consume far more time than that.

Apache's total complexity is 34764; Linux sums to a whopping 798092. If each test requires merely 4 lines of code, the tests will equal the size of Linux itself.

And that says something about the difficulty and expense of creating comprehensive tests.

McCabe and others have created ways of using the complexity metrics to create more meaningful tests. For instance, see the paper Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric by McCabe and Arthur Watson

Conclusion

Plenty of vendors sell tools to measure v(G). McCabe himself has highly-regarded, though pricey, products. See http://www.mccabe.com/. It's one of the few tools that will draw graphs of each function to clearly illustrate the possible paths.

There's a freebie Eclipse plug-in available from http://eclipse-metrics.sourceforge.net/ that looks interesting. Other open source products are available on Chris Lott's site (http://www.chris-lott.org/resources/cmetrics/).

I used RSM from M Squared Technologies (http://www.msquaredtechnologies.com) for gathering the numbers here. It's an inexpensive tool that acquires lots of useful metrics from source files. An evaluation version limited to ten files is free. It's fast (except on insanely-huge code basis like Linux) and easy-to-use with practically no learning curve. Highly recommended.

v(G) is not the only measure of complexity, and a lot of researchers prefer alternatives. But it's easy to interpret and plenty of tools exist. Like a lot of things it's imperfect, but in life, just as in school, a 90% is often an A grade. It's a step in the right direction of being more quantitative about software engineering.

Finally, cyclomatic complexity says nothing about the "goodness" of the code. I ran this truly horrible program (http://www0.us.ioccc.org/2001/williams.c), an entry from the International Obfuscated C Coding Contest, and found its average v(G) is about half of Grep's!