|
Go here to sign up for The Embedded Muse.
Embedded Muse 220 Copyright 2012 TGG February 20, 2012
You may redistribute this newsletter for noncommercial purposes. For commercial use contact jack@ganssle.com. To subscribe or unsubscribe go to https://www.ganssle.com/tem-subunsub.html or drop Jack an email at jack@ganssle.com.
EDITOR: Jack Ganssle, jack@ganssle.com
Contents:
- Editor's Notes
- Quotes and Thoughts
- 2012 Salary Survey
- Tools and Tips
- Responses to Just Reset It
- Joke for the Week
- About The Embedded Muse
Editor's Notes
Are you happy with your bug rates? If not, what are you doing about it? Are you asked to do more with less? Deliver faster, with more features? What action are you taking to achieve those goals?
In fact it IS possible to accurately schedule a project, meet the deadline, and drastically reduce bugs. Learn how at my Better Firmware Faster class, presented at your facility. See https://www.ganssle.com/onsite.htm .
Quotes and Thoughts
100% test coverage is insufficient. 35% of the faults are missing logic paths. Robert Glass
Tools and Tips
The tool tips keep pouring in! Keep `em coming.
Gonzalo Sanchez likes Pico Technology's products: "Years ago, I had the chance to purchase (for a project that later went awry) a PC-based scope from Pico Technology (http://www.picotech.com/). It was a very nice gadget, and the software was really user friendly and feature-rich.
"Pico has now a very wide range of products, notably scopes (up to 100MHz BW, 8, 12 and 16 bit resolutions, and buffer depths from kS to MS); some of these are quite affordable. Some of the models seem to include Arbitrary Waveform Generators, which avoid the purchase of a separate signal generator.
"I guess TEM readers would probably like to give a look to the 2205 Mixed-Signal Oscilloscope; the description states it is a USB-powered, 2 analogue channels scope (up to 25MHz) having 16 digital channels (up to 100MHz) and capable of 200 MS/s mixed-signal sampling; it also includes a built-in function generator and arbitrary waveform generator. Price tag is around US$580 (which is not very cheap, though does not seem to me too expensive given the specifications). The accompanying PicoScope software lets you have an oscilloscope, spectrum analyzer and signal/waveform generator all in one; it has ability to continuously stream data so the scope can be used for data acquisition, too. Very appealing feature: free updates. Seems to have an SDK that facilitates to connect the scope software to other software tools such as LabView, Matlab and even Excel (for those who like it). Another nice feature is that you can download a demo version with 'built-in data' for trying it. Just for Windows users, though drivers for Linux are available in case anyone wants to develop his or her own software.
"Please note I have not used them in years, but at the time their scopes were among the best price-performance compromises we could find, and the software at the time was really, really very good. We did give the Linux drivers a try at the time, and they worked for us, but we did not develop anything useful."
Michael S. Wilk also weighed in: "In response to Charles' comment about a scope, I did almost the exact same thing investigating USB Scopes. Since I do almost entirely software/firmware, I had a hard time justifying an expensive scope (especially since most of what I do is not high speed). I just needed to be able to see/debug a variety of slower signals, mostly serial communications and/or GPIO's. I ended up with the Rigol DS1052E (which Davy Jones demonstrated a special modification to bring it to 100MHz like the DS1102E). It has more than met my needs and more than paid for itself. A good friend recommended this scope after I was looking for the end-all, be-all device that could do as much as possible in a single package. He gave me the (good) advice of buying decent tools that did single things well rather than a single device that did multiple things "just OK". I could always sell these entry level instruments and upgrade. Or, better yet, keep them as "spares" should the day come when I need something more sophisticated. Following that scope, I have purchased (and gotten every dollars' worth from) a Rigol DM3058 Multimeter and a Protek 3003L Power Supply. My GW Instek function generator hasn't justified itself yet but I expect it will someday (I got it at the tail-end of a debugging session and solved the problems before I needed it). Another couple of very handy devices I've purchased are the Microchip Serial Analyzer (about $50) and the WAV, Inc. USBee SX (~$170). These devices do similar things and are also assets in my debugging toolbox.
"We've all heard the saying about spending time sharpening your axe before you start chopping down the tree. However, if you don't even have an axe, that tree will never come down. Even if you're just a hobbyist, invest in some basic tools! They will serve you well!"
Responses to Just Reset It
Several readers had comments about Just Reset It in the last Muse. Brian St. Pierre wrote: "This is a good follow up to the discussion on design margins.
"The approach you mention above -- restarting flawed threads -- can add
additional complexity, which is the enemy we're trying to slay. This
is especially true if you're trying to catch "stuck" threads.
"I worked on a networking product that gave every task/thread a message
queue. Every task responded to a heartbeat message. One task monitored
heartbeats. Tasks that didn't respond were killed and restarted. The
monitor watched for exceptions -- any task throwing an OS-level
exception was restarted. The monitor was responsible for kicking the
dog -- so if the monitor died, the whole system restarted. It was a
robust design that handled various failure mechanisms.
"The two biggest problems: (1) The restart code is tricky -- especially
in an embedded OS that doesn't reclaim resources used by a particular
task/thread. E.g. how do you know that the task didn't have a mutex
locked? The monitor needed a fair bit of metadata about each task to
try to tear everything down cleanly on a kill/restart. (2) Since
might block waiting for a response that would never arrive. This
required every client to be able to handle unreliable "servers".
Either of these issues would be helped by good OS support -- but we
didn't have it and invested quite a bit of effort rolling our own. I
wouldn't do it again: much of that effort would have been better
applied to improving the quality of the code of the other tasks so
that they were less likely to fail.
"I had *much* better experience on a different product that just reset
whenever anything went wrong -- after first saving a bunch of debug
info into NVRAM. It was easier to (a) see failures and (b) figure out
what went wrong since the NVRAM reflected a frozen moment in time when
the system crashed. (And you can't ignore bug reports where the entire
system crashes.)
"On a couple of Linux-based products I worked on, the processes were
compartmentalized enough that the monitor approach worked reasonably
well. There weren't many inter-process dependencies, and those that
did exist were based on reliable socket communications so that a
restarted process didn't lock up the entire system. One also used the
crash-file approach -- saving and managing core dumps of crashed
processes so that we could debug failures more easily."
John Carter said: "But seriously, I have long concluded that Crash Only software is effectively what we everyone in the software industry writes.... our industry managers are just not bold enough to admit and then take the appropriate designs steps to make it properly Crash Only. See http://lwn.net/Articles/191059/ ."
Charles Manning wrote: "As you mention, single-threaded watchdogging is ineffective because it does not monitor the full system state and we really need some sort of multi-channel watchdog which allows multiple tasks (and ISRs etc) to all be monitored. Since you don't get multi-channel hw watchdogs, the only way to built them is as a watchdog manager layer that makes virtual software watchdogs.
"As you note, this is easy to do for periodic tasks with a predictable schedule (eg. the foo_manager must check for foos every 20msec), but it is more difficult to monitor for other more sporadic tasks such as UI or other non-periodic tasks.
"One way to handle those is to extend the watchdog manager (for want of a better word) to handle one-shot watchdogs. For example, if one task (or ISR) generates an event then we expect some other task to process that within a specified period. For example, for the UI task.... Have the keyboard interrupt raise a one-shot saying the UI must respond within 100msec. The UI task must then pat the dog within that time or it is considered locked up.
"As for self healing systems, that is a major selling point of microkernel OSs. If one part of the system starts behaving erratically then only that process needs to be restarted. No need to bring down the entire system. These are typically coupled with system monitoring tasks (ie. watchdog manager on steriods) which monitor task behaviour and respawn those tasks when they misbehave.
"I have also been experimenting with the ideas in behaviour based robotics (subsumption architecture in particular: http://en.wikipedia.org/wiki/Subsumption_architecture ) in embedded
programming outside of robotics. Essentially, behaviour based robotics breaks functionality into packages (behaviours). These behaviours are stacked so that the various events/states cause one behavior to override another.
"What is really cool about this is that it reduces the state space. Instead of having to keep a huge complex finite state machines managing the entire system state, we can break the state machine into these behaviour packages that can be run and tested independently.
"When we couple this with watchdogging it gives us some interesting system design options. We can make a backup mode into a low-level behaviour. If a high level behaviour loses its marbles and needs to be reset, we can fall back to the low level behaviour.
"If a sensor fails, we can have a fallback behaviour that models the parameter from other sensors.
"This gives us some interesting options for constructing fallbacks systems where a blind system reset is just not an option (eg. a car engine control or a plane in flight).
"We cannot win the war on software reliability by making the software more complex, only by breaking it into smaller robust units."
John Taylor said this: "This is one of the hardest things to get people to internalize: It is OK to have an expectation that both Software (Firmware) and Hardware should work 100% of the time and to do what is expected 100% of the time. Any time that something happens that you didn't expect, it's not good enough that the problem went away on its own or hitting the reset pin did the trick. If something happened once that you don't understand, the probability that it can happen again goes to 100% by going back in time before the impossible happened the first time! One of the most popular ways to sweep problems under the rug is to call them "Corner cases", i.e. things that won't happen in the real world of the end user; in what fairy tale is the real world more predictable that the lab?
"I have a drawer in my shop that holds the things from the past 35 years that I have never understood what went wrong. Every once in a while I open up the drawer and see what I've learned since the last time. You can't get out of the drawer until you're solved: A reset switch to solve software and hardware glitches is no better than using a shredder to empty my "Drawer of mystery"!"
Luca Matteini wrote: "I think it's an interesting subject, that one on reset buttons, that you talk about in EM219.
"Watchdogs have helped/messed firmware world, depending on how you look at their use. Their help is clear when a runaway application would otherwise drive crazy your system, but at the same time has given reason for even less accurate designs, when someone starts thinking "oh well, in case, the WDG will reset it..." Moreover, every time I read someone insisting that a WDG should not be reset in interrupt handlers (...) I realize that chances are *many* others really do it!
"So a WDG is like the "universal tool" description that I use for an hammer: it /fixes/ everything, using it properly, just its way...
"I don't agree on MMU and segmentation of firmware, as a panacea. You were mentioning Windows, and it shows how even a complex multitasking environment can't save from hang ups. Even any Linux/*x system can become completely unstable, if the error is in a low level driver.
So a multi processing system can give a sense of more reliability only if you code properly your tasks, so that the controller can't be fooled by the controlled. Frozen systems aren't always in a catastrophic state, sometimes a wrong variable value can influence a cascade of events, each of which, if taken by itself, looks like to be normal.
"Then we don't have to forget, especially in noisy embedded environments, that external events can daze a correct firmware. Glitches on supplies and data lines, dirt, humidity, strong EM fields, thermal or analog drifts (including EEPROMs, of course), can all add a temporary or permanent perturbation to system. Maybe the system could recover from them, if they are in steady state on power up, with just a system check (like I/O status, initial ADC values, etc.), and eventually signal the user that something is wrong. In this case a WDG restart could help.
"Another story is "reset button syndrome", that taught many ones how cycling power can "fix" things. Today, in many easy to diagnose lock ups, you see people affected by that syndrome. The mouse doesn't work? Someone turns the computer off, and retries wiggling the device, even before checking it's plugged in.
"Many times I've seen people closing all the applications on a computer desktop, before doing a new task: the reset paradigm works for them, as per computers, they know how to do things from the start, not when a desktop is full of windows!
"Sometimes, I'm reported of an erratic behavior in a firmware, but when I ask for more details, like "was the red LED still blinking?" the answer is "it's blinking, I restarted it", and any evidence has being canceled before you can diagnose anything."
Joke for the Week
Fred Strathmann sent this:
A math and engineering convention was being held. On the train to the convention, there were both math majors and engineering majors. Each of the math majors had his/her own train ticket. But the Engineers had only ONE ticket for all of them. The math majors started laughing and snickering. The engineers ignored the laughter.
Then, one of the engineers said, "Here comes the conductor". All of the engineers piled into the bathroom. The math majors were puzzled. The conductor came aboard and collected tickets from all the math majors. He went to the bathroom, knocked on the door, and said, "Tickets Please". An engineer stuck their only ticket under the door. The conductor took the ticket and left. A few minutes later, the engineers emerged from the bathroom. The math majors felt really stupid.
On the way back from the convention, the group of math majors had ONE ticket for their group. They started snickering at the engineers, who had NO tickets amongst them.
When the engineer lookout shouted, "Conductor coming!", all the engineers again piled into a bathroom. All of the math majors went into another bathroom. Then, before the conductor came on board, one of the engineers left the bathroom, knocked on the other bathroom, and said, "Ticket please."
About The Embedded Muse
The Embedded Muse is a newsletter sent via email by Jack Ganssle. Send complaints, comments, and contributions to me at jack@ganssle.com.
The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster. can take now to improve firmware quality and decrease development time.