MTBF and Software

Like many people, I sometimes need a bonk on the noggin to remember some essential bit of wisdom that I shouldn’t have forgotten in the first place. Such is the case with the relationship between hardware and software. In many cases, developers have lost their connection with the hardware. Even though it seems quite obvious that the software provides instructions that change the state of the hardware, developers don’t really seem to make the connection. Once you remember the hardware connection, it also begins to make sense that any aberration of the functionality of that hardware will also reflect in the reliability of the software. In short, the Mean Time Between Failures (MTBF) of the hardware also has an effect on the software that runs on the hardware and causes the hardware to perform specific tasks.

The issue that drove the point home for me is a simple hard drive. This particular hard drive came with the system and the vendor used a lower cost drive to keep prices low (normally I get really high quality hardware simply to avoid problems). What this means is that the MTBF of the drive is also quite low. Unfortunately, I encountered the MTBF late last week as a glitch that caused me to think there was a problem with my software. The software was just fineā€”it was the glitch with the hard drive that was the source of the problem. I only realized this fact after testing the software on another system. (Unfortunately, the hard drive got worse and took some of my system configuration with it, but I maintain backups, so the loss was minimal.)

However, the partial failure of the drive caused me to realize yet again that software can only operate correctly when the underlying hardware also operates correctly. I can’t remember the last time I read anything that even broached the topic of hardware as a potential source of software problems. It makes me think that there are probably developers out there right now trying to find the error in a piece of software that doesn’t even exist in the software, but is a matter of some hardware glitch.

It’s important to realize that hardware doesn’t always fail in a predictable manner either. For example, a glitch can occur when a hairline fracture occurs in the runs of a board. This sort of error makes its appearance when you start the system. When the board heats up, the failure goes away because the breach in the run is sealed. Expansion of the metal fixes the problem. I’ve actually encountered a host of incredibly odd hardware problems over the years, many of which could appear as an isolated software issue given the right circumstances.

The lesson relearned in this case is to always test software on multiple systems. It’s essential that these systems use different components. Doing so will eliminate a number of non-software issues as the source of a problem. For example, using mismatched systems can help you understand when an error is due to a particular device driver. The point is that you need to avoid shooting yourself in the foot by not thinking of all the possibilities. Complex software interacts with the hardware in a complex way, which makes it all the more likely that some insignificant hardware or firmware issue will cause you woe as a developer.

What are your experiences with odd hardware- or firmware-related behaviors? Have you even encountered such behaviors? Let me know at [email protected].