Software Is Less Reliable Than Hardware

From ProgrammersAreNotProfessionals

The number of hardware faults I've had over the past 10 years can be counted on the fingers of one hand. And I'm counting the time I dropped something on an exposed running hard drive and the other one that functioned well past its mean time between failures, as well as my wireless router which was defective out of the box. The software failures I've had in the past 6 months are too numerous to be counted. -- RK

The only software faults I've had in the last 6 months have been from using software I knew to be in development and unstable - not that my experience is any more useful. Data is not the plural of anecdote. If you applied the same standard to hardware (*all* hardware) as you do to software, the count would be higher anyway. The toaster in our cafeteria at work needs to be on the highest setting to even brown toast. That's a fault. The window in my room is slightly off-true and require extra effort to shove closed. Fault. The Handicap door opener on the mens bathroom forces the door to open too slowly, even when you're pushing it manually. Fault. I've worked briefly in construction - the number of faults and bodge jobs in your house is probably uncountable, especially if it's a cookie cutter condo type. Perhaps the real problem with software engineering is that we aren't good enough at hiding our errors. -- CM

Now you're applying a double standard. You're using "failure to meet expectations" for hardware and "catastrophic failures" for software. If we start counting the failures to meet expectations in software, we'll never even stop.

The most critical hardware component of any computer is the hard drive. Its mean time between failures is about 5 years. It can be trivially RAIDed to increase its MTBF to millions of years. The only other hardware components that are nearly as critical are the CPU and memory. Everything else can simply be replaced when and as they break down, in the general case.

In the general case, the MTBF for software is measured in weeks. And it cannot be easily replaced nor can its reliability ever be enhanced by a person of normal means. -- RK

Richard, I've run into all kinds of hardware problems. It happens with startling regularity. You may not realize it, but in a typical x86 Windows/Linux box (or even an IBM/Apple PPC970fx or Apple/Motorolla PPC60x) there are literally dozens of critical faults that the OS and drivers gloss over. Motherboard controllers with design errors, plain-old-faulty processors (and this is in the design, mind you, not just one chip coming out back) and RAM failure for the average machine is surprisingly high unless you buy expensive stuff.

There is a lot of effort in kernels and drivers to handle this kind of thing. The real difference between software and hardware here is one of acceptance. Driver folks just accept that the world of hardware is a Babel of noise and flaws, and roll with it. You don't hear people up in arms because the PPC970's branch prediction instructions often cause a pipeline flush (so instead of a performance boost, they cause a major performance penalty), but it's still one heck of an awful bug. Compiler writers just shrug and change their optimization profiles.

As software gets more friendly, people begin to use it more. As people begin to use it more, they begin to notice more problems. If a hardware bug causes a relatively rare condition that one application stumbles across often (let's say photoshop eats up so much RAM it goes into a bad patch which crashes the program), it's not photoshop's fault. But the user would probably say, "Photoshop is buggy." Because, well, only Photoshop eats that much ram and thusly exercises the bug. -- DaveFayram

That's an OS fault. And I know all about hardware bugs, I just don't care.

Yhe conclusion that DougMerritt came up with on SoftwareIndustrialRevolution is that software people deal with all the problems that hardware people have abandoned as too difficult to solve. Well guess what, I just don't care.

Software programmers are in the business of dealing with difficult problems, in the business of dealing with complexity and in the business of dealing with material things which fail. It's been in that business for decades now. And I really don't care to hear any more fucking excuses about why it fails at its job.

-- RK

(A brief interjection in RK's and DaveFayram's dialogue) RK, you are full of it. You demand that software be as reliable as hardware, claiming that there's almost no fault in hardware, but as soon as DaveFayram describes the situation in hardware--that it isn't *nearly* as flawless as you assume, but it's glossed over by software--you declare that "you don't care", software should just be reliable, darnit!, because Software programmers are in the business of dealing with difficult problems, and complexity, and all that! Do you know who else is in this kind of business--the business of doing difficult problems? Hardware engineers! To claim otherwise is to ignore that chip manufacturers have to use quantum mechanics to make sure that electrons don't "tunnel" into places where they aren't supposed to be. When was the last time you heard of a User Interface Designer being required to learn quantum mechanics so that they can make sure that their cute little UI is easy to use? This isn't to say that a software designer's job is necessarily easy, but the fact is that we *all* are working on complex problems, and as a result, we *all* produce mistakes. And I don't care if you're not willing to face up to that; indeed, to the degree that I do care, it is only to remember that you have no sense of proportion when it comes to fixing these mistakes. --Alpheus.

As an aside, it's not an OS fault. If your core pages of memory go bad, there is nothing you can do about it. No OS, no matter how well written, can keep the application from crashing. Software can only fake so much reliability on top of unreliable hardware. Anyways...

Are you willing to pay 20x to 30x the cost for this perfect software and hardware, wait 2 to 3 times longer for it, and accept that there will be "good enough" products for much less that other people will be enjoying? That's what it would take to even approach your quality standard.

And if you're willing to do so, can you convince everyone else as well? -- DaveFayram

"If your HD blocks go bad, there is nothing you can do about it. No OS, no matter how well written, can keep the application from crashing."<

You realize that this is quite stupid, right? If specific pages of RAM go bad, it would be possible to detect and mark them as unusable. Sure, it would add complexity. Sure, it might not be worth doing. BUT it still makes it an OS responsibility and not an application responsibility.

Software wouldn't be nearly as expensive as it is if it used up to date techniques and tools. The use of state of the art techniques compensates for the added cost of more reliable software. Similarly, the elimination of useless features compensates for the addition of entire new concepts. -- RK

It's impossible to detect bad RAM in the general case - it devolves to the HaltingProblem. It's possible to reduce errors at the cost of varying amounts of performance and money - a lot of mainframe systems have mechanisms in place to catch this. And, amusingly, you cannot deal with these sort of problems using "up to date techniques and tools", which were created to isolate you from these details. And it's not your call whether or not a feature is useless or not unless you're someone's client, so your opinion there is unneeded and unwanted.

Richard, where did we get into HD Blocks? A "Page" of memory is a unit of memory that the OS deals with. It's a concept built in to nearly every virtual memory-based system since ye olden days. We're talking about RAM here. We're talking about what happens if your stick of RAM fritzes and the whole stick loses data integrity. Where do you think all your OS's tables, instructions and queues go? Into RAM. If those specific pages go, the OS goes. Period. You could try and reduce the failure rate by placing the instructions on a ROM, but this means your OS can't be upgraded (without introducing it to the same crash flaw). For someone who claims that he does not understand the technical complexities of this issue, you're very quick to throw insults about. -- DaveFayram

Until recently, computers used to have several sticks of RAM. If one of them blew up, you could quarantine it and use others. Requiring that a computer have several small sticks of RAM instead of one large one for redundancy is hardly onerous.

As for the OS, if it suffers an immediate failure then it can shut down and you've lost a grand total of a few minutes of your work. Because the OS has been checkpointing everything on a regular basis. And if it hasn't then that's because you're dealing with a broken OS written by incompetent idiots. Which would be an OS failure.

And if the OS doesn't suffer an immediate failure then corruption of the OS means nothing. RAM is only cache after all. Everything in the OS can be pulled out of real storage (hard drives) or recalculated. And if it can't then that's because you're dealing with a broken OS written by incompetent idiots. Which would again be an OS failure.

You keep saying that "nothing can be done by the OS" and I just keep getting the impression that you have no idea what can be done. Ignorance isn't the same thing as absence. -- RK

Richard, you just don't understand what I'm saying. Let me break it down for you. You have an instruction pointer into memory. If that pointer leads to corrupted memory, and that's where the OS instructions are, you're toast. History. Finished. There is no more code. You have no recovery strategy in the OS, because there is no more OS. There would be no way for the computer to know how to restore from disk, or what to get. Your OS is toast.

As for the disk, what magical hard drive pixies would let you sync everything to disk at a reasonable speed? If you go through a big computation, are you going to speed limit it to the hard drive? Maybe in 20 years we'll have persistent RAM-like storage. But we sure don't now.

Now, if you have hardware to assist you, then you might be able to put error correction into your RAM. But this would not be in the OS, this would be in the hardware (unless of course you explored the OS-in-ROM-as-hardware, which technology has ditched as inflexible).

I get the impression that you had a 20 minute talk with someone knowledgeable, got a lot of mistaken impressions, and now in true RK style can't go back on what you said, so you're looking for some way to transform this into a LaynesLaw instance so that you can claim you were right/misunderstood all along. -- DaveFayram

I get the impression that you don't have any idea what you're talking about, that you're projecting your general state of ignorance upon me, and that you're not making any effort at all to understand what it is that I'm saying. This is tiresome and boring.

I also get the impression that everything you know or think about me is some kind of projection or generalization based on lesser people. For instance, I always know when I'm arguing about definitions. And either my definition is clearly superior or I give up the fight, sometimes both. I don't have any room to "save my ego" by moving something into a definitional argument because I understand what's going on in any argument no matter how many different levels it operates at. So your ad hominem attack ("can't go back on what you said so you're looking .... LaynesLaw instance") is both sleazy and wrong. -- RK


''You generally refer to software that fails to meet expectations as flawed (at best). I was following your lead. *catastrophic* software failure is quite rare, for varying values of catastrophic. Hardware failures in both drives and RAM are in fact quite common and it's software compensating for those failures that keeps you from noticing. And software which fails in such a way that it cannot be easily replaced or reliability enhanced is practically non-existent (bad firmware flashes, I suppose). Reliability of most software can be *trivially* enhanced simply by following good practices with regards to software usage, like performing regular backups. This is no different than turning your lights on when you drive at night. I have the feeling that when you talk about software failures you're mostly talking about failure to meet expectations in commercial, user facing, off the shelf (or off the Internet) software. This is the type of software which is a) the least funded and b) the most difficult to properly certify. It's the absolute worst case of reliable software creation and the fact that user-facing software works as well as it does now is actually pretty amazing.

Note that creating a RAID array is beyond the capacity of most (90%?) of computer users and that recovering one from a disk fault beyond the capacity of most of those who remain. In the case that it works without intervention, it's *software* doing that for you.

You can pretty trivially dismiss almost all software glitches with "so don't do that" explanations. I don't think this is fair, useful, or honest. But I don't comparing the quality and reliability of software to mechanical objects, whose faults and limitations we dismiss with exactly that is fair, useful, or honest either. -- ChrisMellon? Can someone convert that last sentence into meaningful English?''

"Good software usage practices" my ass. Backups are an outrageous requirement to heap on users! It's nothing less than intellectual dishonesty when you know that any well-designed OS won't require backups at all. And if you don't know this then you're incompetent, which would make you a standard programmer.

All software regularly used by everyday people is horrible, whether "user facing" or not. Everything available on Unidows from the kernel and filesystem on up, is utter crap.

Changing a hard drive is beyond most computer users. I don't give a fuck. Why do you keep bringing up irrelevancies?

Hardware failures in hard drives are astonishingly low considering that you're dealing with components with moving parts. The fact they have moving parts makes them about 1000 times more susceptible to failure than solid state components. If they have actual failure rates on par with software then that means that software is 1000 times more buggy than it should be.

I don't accept complexity as an excuse for increased failure. Programming is about dealing with complexity, to say that complexity is inherently buggy means that programming is hopeless. Which is false. -- RichardKulisz


Anybody here know what PRML stands for?

[Answer: See http://en.wikipedia.org/wiki/PRML]

The fact that you can get data back from your hard drive is very nearly science fiction. It's not the hardware that assures you that retrieval is possible, it's the software/firmware that does statistical magic to reconstruct the data.

The hardware guys I work with can recount any number of scenarios where software/firmware has to deal with bits of flakiness that physics deals them. Trivial examples include such commonplace things as debounce routines.

We don't argue amongst ourselves over who has the harder job. They don't want my job; I don't want theirs. I'm grateful they're as diligent as they are; they're happy we're able to work around some of the component shortcomings.

I'm not sure why this "brand X is better than brand Y" argument is even here. Oh, that's right, it's here to support the assertion that programmers are bad/stupid/lazy/incompetent/dishonest/bigoted/evil/redundant/smelly and probably communist.

Oh well. -- GarryHamilton

In addition, combined hardware and software solutions for replacing faulty memory modules, faulty CPUs, etc, have been available from time immemorial. They didn't make it (yet) to RK's desktop because they are expensive and not even the grand InteractionDesign work or the source code for BlueAbyss et. comp. (if such a thing exists at all) are worth the money involved. But in 2005 any decently paid software engineer can, in principle, afford one of those systems, and hopefully, by 2010 even self-appointed interaction designers will be able to buy such a computer system.

But this discussion has never been about technical facts, anyways.



The page title does describe a common perception, especially in this era of complex operating systems requiring weekly updates. IMHO, there are some nuggets in the diatribe above, and the following two sections should be kept. -- IanOsgood

Actually, error correcting RAM is pretty common. And it's a good thing, given how much RAM we stuff into the boxes these days.

See... http://www.pcguide.com/ref/ram/errECC-c.html and/or http://oak.cats.ohiou.edu/~piccard/mis300/eccram.htm


Could it be that hardware, especially hardware with moving parts, is so unreliable that we take care not to do really complicated things with it? And that we put layers of error checking and correcting hardware and software on top of it, to make it more reliable?

Could it be that software is or can be so much more "reliable", for a given level of complexity, that we demand complexity so many orders of magnitude higher, that reliability becomes a problem again?


Complex ICs (CPUs etc.) have very high rejection rates. I am out of that area now, but I seem to recall that 40% - 60% are immediately defective and there is also a substantial fall out after a 24 hour "burn in" session. Hard disks regualry ship with bad blocks that are flagged and the OS must work around. RF bleed over is a major issue. My cell phone regularly causes speakers to buzz and is not permitted to be used on an airplane. Yes, hardware does have its own set of problems as well.


I can't help but be curious: RK is so certain that software can be more reliable than it is, that I can't help but wonder: has he tried to design a computer from scratch? Or an operating system? I, for one, have--at least, I've tried to write the Op Codes for a ternary computer, and have attempted to find the time to figure out how to translate those op codes into the transistors and the logic gates that make up a computer. (I haven't yet tried to figure out how I'd get a design onto silicon. Yeah, a minor detail, I know, but still...) When I was in college, I tried to develop the "perfect" OS, using Forth, because it's an amazing language. I didn't get get too far--pesky homework and all that--but I also stumbled onto Linux, which I gradually discovered that it had what I wanted: a multi-process friendly DOS-like environment. That hasn't quashed my OS-Development bug, because I'm always looking to experiment and improve things, but the biggest limiting factor is realizing just how *big* of scope it is, to create a modern operating system.

Indeed, I have thought about giving enough to Common Lisp so that it could run on its own. I would need a filesystem, drivers, a scheduler (particularly on a multi-core machine), drivers for a monitor, a keyboard, storage, and probably a mouse, just to get to a REPL (aka a command line)! And once I got there, what would I do with it? I wouldn't be able to run anything on it--certainly not the hundreds of programs that are available for Linux or for Windows.

I have seen laments about how there's so little research with regards to Operating Systems. The reason for this is that operating systems are far from non-trivial. I would also propose that there's far more research going on than we realize: the Linux kernel, for example, is evolving all the the time, but because it's "volunteer" work that's "under the hood", no one is really aware of the advances that are happening there.

But for all this lamentation that software is unreliable, and that it should be just like hardware (or worse, just like engineering), is ignoring the fact that people can (and do) die just as readily from a hardware problem, whether it be computer or physical engineering, as they can from a software bug; and that these deaths happen more often than RK is willing to acknowledge. He's also on the record that he thinks certification of software engineers will fix this; I happen to have a friend who is a competent Civil Engineer, and the stories he tells me of engineers who are either certified, or well on their way to certification, doesn't increase my confidence in the profession. For that matter, he pointed out that a very large percentage of all those "specifications" we have in the Real World--things like "this barrier should be rated for collisions at 1 ton"--DO NOT have any sort of study or even engineering analysis behind them. They are pretty much created "off the cuff", using "intuition". Which is pretty much how software is designed, is it not?

Sure, things can be better, all around. It isn't pleasant to discover a new principle in materials science by having a building collapse on you (particularly since such a principle may already be known by some engineers, and may have been discovered by good Finite Element Analysis), any less than it's gut-wrenching to discover that a bit of software allowed patients to get fried by a machine. But the truth is, when we're building things that involve high amounts of energy (and in buildings, the structure itself represents a huge amount of potential energy, released when it collapses), and something goes wrong, people are going to die! And we have only so much time, energy, and patience to devote to preventing these deaths. At some point, we have to say "The risk is small enough. It's time to move on to the next project." and then pick up the pieces when things do go wrong. --Alpheus


EditText of this page (last edited November 4, 2013) or FindPage with title or text search