By Jack Ganssle
Department of Redundancy Department
Summary: Redundancy does not necessarily lead to reliability.
It seems that when Virginia's Information Technologies Agency let a $2.3 billion contract to Northrop Grumman to modernize the State's information infrastructure they neglected to included any requirements for redundancy. The story is here: http://spectrum.ieee.org/blog/computing/it/riskfactor/virginia-information-technologies-agency-believes-in-the-perfect-network-fairy, but the upshot is that the system experiences frequent outages which leaves the State's employees unable to work. Of course, at the Department of Motor Vehicles it may be impossible for the casual observer to notice any difference between when the employees are working or when they're not.
The article has some frightening stats: in the first six months of the year the Department of Transportation experienced 4677 hours of downtime. Since a half-year is 4380 hours either the auditors graduated from one of Virginia's troubled public schools (http://www.nciea.org/publications/RILS_LS05.pdf) or multiple simultaneous failures hobbled the system.
According to http://jlarc.state.va.us/meetings/October09/VITA.pdf the contract consists of 151-page agreement, 51 amendments, 29 schedules, 17 appendices, 17 addendums, & 6 attachments. Further digging suggests the vendor was expected to discern requirements not explicitly stated, a practice that highlights the difficulty of eliciting requirements, but that is fraught with contractual peril.
It sounds like, as is usual in these situations, there are plenty of targets for blame. Ultimately this will be fertile ground for the lawyers, who will probably come out as the only winners.
But I find it interesting that so many pundits are jumping on the system's lack of redundancy. Reliability is what is important; redundancy is merely one means to achieve that end. Reliability stems from many sources, not just redundancy alone. Consider the Internet: it is composed of redundant networks. But it is reliable because of a protocol that knows how to exploit those connections when parts of the system fail.
Shoehorning redundant but unreliable extra equipment and software into the system will simply lead to another gravy train for both sides' legal counsel.
Published November 25, 2009