Software Is Less Reliable Than Hardware

From ProgrammersAreNotProfessionals

The number of hardware faults I've had over the past 10 years can be counted on the fingers of one hand. And I'm counting the time I dropped something on an exposed running hard drive and the other one that functioned well past its mean time between failures, as well as my wireless router which was defective out of the box. The software failures I've had in the past 6 months are too numerous to be counted. -- RK

The only software faults I've had in the last 6 months have been from using software I knew to be in development and unstable - not that my experience is any more useful. Data is not the plural of anecdote. If you applied the same standard to hardware (*all* hardware) as you do to software, the count would be higher anyway. The toaster in our cafeteria at work needs to be on the highest setting to even brown toast. That's a fault. The window in my room is slightly off-true and require extra effort to shove closed. Fault. The Handicap door opener on the mens bathroom forces the door to open too slowly, even when you're pushing it manually. Fault. I've worked briefly in construction - the number of faults and bodge jobs in your house is probably uncountable, especially if it's a cookie cutter condo type. Perhaps the real problem with software engineering is that we aren't good enough at hiding our errors. -- CM

Now you're applying a double standard. You're using "failure to meet expectations" for hardware and "catastrophic failures" for software. If we start counting the failures to meet expectations in software, we'll never even stop.

The most critical hardware component of any computer is the hard drive. Its mean time between failures is about 5 years. It can be trivially RAIDed to increase its MTBF to millions of years. The only other hardware components that are nearly as critical are the CPU and memory. Everything else can simply be replaced when and as they break down, in the general case.

In the general case, the MTBF for software is measured in weeks. And it cannot be easily replaced nor can its reliability ever be enhanced by a person of normal means. -- RK

Richard, I've run into all kinds of hardware problems. It happens with startling regularity. You may not realize it, but in a typical x86 Windows/Linux box (or even an IBM/Apple PPC970fx or Apple/Motorolla PPC60x) there are literally dozens of critical faults that the OS and drivers gloss over. Motherboard controllers with design errors, plain-old-faulty processors (and this is in the design, mind you, not just one chip coming out back) and RAM failure for the average machine is surprisingly high unless you buy expensive stuff.

There is a lot of effort in kernels and drivers to handle this kind of thing. The real difference between software and hardware here is one of acceptance. Driver folks just accept that the world of hardware is a Babel of noise and flaws, and roll with it. You don't hear people up in arms because the PPC970's branch prediction instructions often cause a pipeline flush (so instead of a performance boost, they cause a major performance penalty), but it's still one heck of an awful bug. Compiler writers just shrug and change their optimization profiles.

As software gets more friendly, people begin to use it more. As people begin to use it more, they begin to notice more problems. If a hardware bug causes a relatively rare condition that one application stumbles across often (let's say photoshop eats up so much RAM it goes into a bad patch which crashes the program), it's not photoshop's fault. But the user would probably say, "Photoshop is buggy." Because, well, only Photoshop eats that much ram and thusly exercises the bug. -- DaveFayram

That's an OS fault. And I know all about hardware bugs, I just don't care.

Yhe conclusion that DougMerritt came up with on SoftwareIndustrialRevolution is that software people deal with all the problems that hardware people have abandoned as too difficult to solve. Well guess what, I just don't care.

Software programmers are in the business of dealing with difficult problems, in the business of dealing with complexity and in the business of dealing with material things which fail. It's been in that business for decades now. And I really don't care to hear any more fucking excuses about why it fails at its job.

-- RK

(A brief interjection in RK's and DaveFayram's dialogue) RK, you are full of it. You demand that software be as reliable as hardware, claiming that there's almost no fault in hardware, but as soon as DaveFayram describes the situation in hardware--that it isn't *nearly* as flawless as you assume, but it's glossed over by software--you declare that "you don't care", software should just be reliable, darnit!, because Software programmers are in the business of dealing with difficult problems, and complexity, and all that! Do you know who else is in this kind of business--the business of doing difficult problems? Hardware engineers! To claim otherwise is to ignore that chip manufacturers have to use quantum mechanics to make sure that electrons don't "tunnel" into places where they aren't supposed to be. When was the last time you heard of a User Interface Designer being required to learn quantum mechanics so that they can make sure that their cute little UI is easy to use? This isn't to say that a software designer's job is necessarily easy, but the fact is that we *all* are working on complex problems, and as a result, we *all* produce mistakes. And I don't care if you're not willing to face up to that; indeed, to the degree that I do care, it is only to remember that you have no sense of proportion when it comes to fixing these mistakes. --Alpheus.

As an aside, it's not an OS fault. If your core pages of memory go bad, there is nothing you can do about it. No OS, no matter how well written, can keep the application from crashing. Software can only fake so much reliability on top of unreliable hardware. Anyways...

Are you willing to pay 20x to 30x the cost for this perfect software and hardware, wait 2 to 3 times longer for it, and accept that there will be "good enough" products for much less that other people will be enjoying? That's what it would take to even approach your quality standard.

And if you're willing to do so, can you convince everyone else as well? -- DaveFayram

"If your HD blocks go bad, there is nothing you can do about it. No OS, no matter how well written, can keep the application from crashing."<

You realize that this is quite stupid, right? If specific pages of RAM go bad, it would be possible to detect and mark them as unusable. Sure, it would add complexity. Sure, it might not be worth doing. BUT it still makes it an OS responsibility and not an application responsibility.

Software wouldn't be nearly as expensive as it is if it used up to date techniques and tools. The use of state of the art techniques compensates for the added cost of more reliable software. Similarly, the elimination of useless features compensates for the addition of entire new concepts. -- RK

It's impossible to detect bad RAM in the general case - it devolves to the HaltingProblem. It's possible to reduce errors at the cost of varying amounts of performance and money - a lot of mainframe systems have mechanisms in place to catch this. And, amusingly, you cannot deal with these sort of problems using "up to date techniques and tools", which were created to isolate you from these details. And it's not your call whether or not a feature is useless or not unless you're someone's client, so your opinion there is unneeded and unwanted.

Richard, where did we get into HD Blocks? A "Page" of memory is a unit of memory that the OS deals with. It's a concept built in to nearly every virtual memory-based system since ye olden days. We're talking about RAM here. We're talking about what happens if your stick of RAM fritzes and the whole stick loses data integrity. Where do you think all your OS's tables, instructions and queues go? Into RAM. If those specific pages go, the OS goes. Period. You could try and reduce the failure rate by placing the instructions on a ROM, but this means your OS can't be upgraded (without introducing it to the same crash flaw). For someone who claims that he does not understand the technical complexities of this issue, you're very quick to throw insults about. -- DaveFayram

Until recently, computers used to have several sticks of RAM. If one of them blew up, you could quarantine it and use others. Requiring that a computer have several small sticks of RAM instead of one large one for redundancy is hardly onerous.

As for the OS, if it suffers an immediate failure then it can shut down and you've lost a grand total of a few minutes of your work. Because the OS has been checkpointing everything on a regular basis. And if it hasn't then that's because you're dealing with a broken OS written by incompetent idiots. Which would be an OS failure.

And if the OS doesn't suffer an immediate failure then corruption of the OS means nothing. RAM is only cache after all. Everything in the OS can be pulled out of real storage (hard drives) or recalculated. And if it can't then that's because you're dealing with a broken OS written by incompetent idiots. Which would again be an OS failure.

You keep saying that "nothing can be done by the OS" and I just keep getting the impression that you have no idea what can be done. Ignorance isn't the same thing as absence. -- RK

Richard, you just don't understand what I'm saying. Let me break it down for you. You have an instruction pointer into memory. If that pointer leads to corrupted memory, and that's where the OS instructions are, you're toast. History. Finished. There is no more code. You have no recovery strategy in the OS, because there is no more OS. There would be no way for the computer to know how to restore from disk, or what to get. Your OS is toast.

Yawn. This would be what I called "an immediate failure". And even that MIGHT be recoverable if you register an exception handler that reloads the OS from storage. Of course, the handler would be periodically reloaded and so a fault would only be unrecoverable if two locations are hit at once.
- I don't care what you call it. Making up a term for it doesn't change what it is, or how it pertains to this discussion. And, since the beginning, all I've claimed is that it's possible for memory pages to go bad in such a way that you lose your OS. The fact that you admit this is tantamount to conceding the entire argument (you claim it could not be).

As for the disk, what magical hard drive pixies would let you sync everything to disk at a reasonable speed? If you go through a big computation, are you going to speed limit it to the hard drive? Maybe in 20 years we'll have persistent RAM-like storage. But we sure don't now.

It's been done. Repeatedly. Don't let your ignorance of the state of the art fool you.
- No, it hasn't in the scenario (speed/size) that we've been discussing. See below.
Oh, unless you think that every step of a computation is saved to disk. Yes, that would be quite idiotic, but that's YOUR idea and that would make YOU the idiot.
- This is the only way it'd meet the damn requirement I'm suggesting. Please feel free to join this conversation. Let me restate it again: Currently, there is a classification of RAM failure from which no computer could recover without data or work loss. This ranges from a few specific segments of memory (modern machines) to trigger the effect, off to the entire RAM storage (futuristic machines). Also, there exists no persistent storage which, without constant power, can reproduce the speed and size of RAM (but I know such technology is possible and even in the works, it is simply not here yet and probably won't be for a decade). I have not said more. I have not said less. You want to attribute such scenarios to programmers, but that is unfair and almost ridiculous. But it's a free country where I live, and you're allowed to be wrong about things like this.
- The consequences of an unrecoverable RAM failure span a world of difference. Currently, any little RAM failure causes the OS to crash, all "unsaved" user data to be annihilated and possibly the OS and user data to be corrupted for the future. A real OS would minimize those consequences till they were trivial. An OS that doesn't is faulty and broken. Perhaps you've said nothing about this but it's a permanent consideration for real people. Now, which do you honestly think is more important, an abstruse discussion about computing theory or a pragmatic discussion about things of real practical concern? Yes, you started this discussion and I obviously misunderstood your meaning. I guess it's my fault for thinking you had something of consequence to say. Again.

Now, if you have hardware to assist you, then you might be able to put error correction into your RAM. But this would not be in the OS, this would be in the hardware (unless of course you explored the OS-in-ROM-as-hardware, which technology has ditched as inflexible).

I get the impression that you had a 20 minute talk with someone knowledgeable, got a lot of mistaken impressions, and now in true RK style can't go back on what you said, so you're looking for some way to transform this into a LaynesLaw instance so that you can claim you were right/misunderstood all along. -- DaveFayram

I get the impression that you don't have any idea what you're talking about, that you're projecting your general state of ignorance upon me, and that you're not making any effort at all to understand what it is that I'm saying. This is tiresome and boring.

You get that impression a lot, and you're seldom right about it. Need I remind you the last time you did this was when you talked about StaticTyping? Need I remind you that you directly contradicted yourself above? Someone has already pointed out that detecting bad RAM, as an act in and of itself, devolves into the Halting problem before you can even get started on the infrastructure to recover from memory system failure. Need I point out below where you are showing that you don't even really understand how RAID systems work?

I also get the impression that everything you know or think about me is some kind of projection or generalization based on lesser people. For instance, I always know when I'm arguing about definitions. And either my definition is clearly superior or I give up the fight, sometimes both. I don't have any room to "save my ego" by moving something into a definitional argument because I understand what's going on in any argument no matter how many different levels it operates at. So your ad hominem attack ("can't go back on what you said so you're looking .... LaynesLaw instance") is both sleazy and wrong. -- RK

Your credibility here is too thin to pull the wounded paragon routine. It's not ad hominem to observe that you continuously evade arguments. It's not ad hominem to note that your credibility on the subject is sparse at best. It's not ad hominem to note that you yourself admit freely your understanding of engineering issues does not compare with an engineers' in terms of depth. -- DaveFayram
My internet access is timewise limited and I like it that way. It keeps me from wasting too much time online. However, it means I can't go into any complex arguments nor can I backtrack when I change my mind about something.
In any case, this isn't about engineering. Or at least my take on it isn't. And considering that my take on it is much more relevant to end users .... -- RK

''You generally refer to software that fails to meet expectations as flawed (at best). I was following your lead. *catastrophic* software failure is quite rare, for varying values of catastrophic. Hardware failures in both drives and RAM are in fact quite common and it's software compensating for those failures that keeps you from noticing. And software which fails in such a way that it cannot be easily replaced or reliability enhanced is practically non-existent (bad firmware flashes, I suppose). Reliability of most software can be *trivially* enhanced simply by following good practices with regards to software usage, like performing regular backups. This is no different than turning your lights on when you drive at night. I have the feeling that when you talk about software failures you're mostly talking about failure to meet expectations in commercial, user facing, off the shelf (or off the Internet) software. This is the type of software which is a) the least funded and b) the most difficult to properly certify. It's the absolute worst case of reliable software creation and the fact that user-facing software works as well as it does now is actually pretty amazing.

Note that creating a RAID array is beyond the capacity of most (90%?) of computer users and that recovering one from a disk fault beyond the capacity of most of those who remain. In the case that it works without intervention, it's *software* doing that for you.

You can pretty trivially dismiss almost all software glitches with "so don't do that" explanations. I don't think this is fair, useful, or honest. But I don't comparing the quality and reliability of software to mechanical objects, whose faults and limitations we dismiss with exactly that is fair, useful, or honest either. -- ChrisMellon? Can someone convert that last sentence into meaningful English?''

"Good software usage practices" my ass. Backups are an outrageous requirement to heap on users! It's nothing less than intellectual dishonesty when you know that any well-designed OS won't require backups at all. And if you don't know this then you're incompetent, which would make you a standard programmer.

It's outrageous that I have to manually turn my lights on when I drive at night and manually set a parking brake when I park. Reliability to the point where you don't need backups is a good goal, but it's a stupid expectation. Especially because, as I've repeated more than once, software is *not* a closed system and does not exist in a vacuum. The expectation that the user can be the stupidest moron to roll out of the gutter and still use the software with perfect efficiency and reliability is an unrealistic and unreasonable expectation.
- Actually, there are a number of cars that do both of those things for you: turning the headlights on when it's dark, and forcing you into "Park" (a drive wheel locking position) before you can remove the ignition key.
What the sod am I? A cron job?! Since when has it been the *computer's* job to program ME to meet ITS needs?! In fact, it's not just outrageous and onerous but totally unnecessary and ridiculous. If I have a RAID and sufficient space in it, the computer bloody damn well ought to keep a working filesystem.
A computer is a machine and a tool like any other. You accept the need to training and maintenance on *every other* device in your life. Accept it with a computer and your life will be much happier.
No, it's not a stupid expectation to do away with backups, and the fact you think so is why you're incompetent. The only reliability necessary with a transparently orthogonal and versioned storage system comes from RAID. Everything else is crap. The fact that we have broken filesystems and broken kernels and broken programs doesn't mean that this is the way things have to be. Because they don't and they didn't have to be that way. Programmers deliberately made things this way for concrete reasons.
I'm starting to think that you don't even know what RAID is or how it works. And stop with the goddamn conspiracy of programmers, it makes you look like a quack
Programmers refused to give design work to designers. Programmers refused to annihilate a prototype after it was complete (against best practice); and I don't care if management tells you to keep the prototype, it's easy enough to lose it "by accident". Programmers refused to reprocess prototypical solutions through designers to match them against the market. Programmers have behaved in a thoroughly reprehensible manner, unprofessional and egotistical in the extreme. All your talk about "this is the way things are" and "it's not our fault" is just so much crap.
By your own admission you don't know anything about the software industry and aren't a part of it, so where do you get off thinking you know why designers don't get work You ever consider that lying to your client/management is a violation of professional ethics, and rightly so? You sound like you're bitter about not being in demand for work.

All software regularly used by everyday people is horrible, whether "user facing" or not. Everything available on Unidows from the kernel and filesystem on up, is utter crap.

Changing a hard drive is beyond most computer users. I don't give a fuck. Why do you keep bringing up irrelevancies?

Because YOU brought up something complicated and pro-active a user can do to alleviate problems with hardware failure. If a RAID array is a reasonable solution to failing hard drives, then saving often is a reasonable solution to buggy software
No, it's not. The role of software is to automate shit and to meet user's needs. Installing a RAID can be subcontracted out to a bright 7 year-old with a material cost of 200$. "Saving often" requires learning, mental space, and at least 1 hour a week for the rest of your life. That's about >> $20,000 over a mere 20 years at 20$/hour conversion. You're telling me that a fix to faulty software that costs > 10X the entire cost of a computer is perfectly acceptable. You're so out of touch with real users that it isn't even funny.
- Saving often is pretty easy with most modern word processing tools, which do it automatically for you. Same for many email tools, spreadsheets, etc.
That won't get you a working RAID solution that will actually recover in a case of failure. And if you're going to claim I'm out of touch with real users, I'd like to see your credentials.
Recovering from a drive failure in a RAID array can be subcontracted out to a professional in the rare case (once every 5 years on average) that it happens. It's a trivial operation and so the cost won't exceed 100$ labour. Are you saying that a technician can't recover from a hard drive failure in a mirror mode RAID array?
Now, a hardware RAID controller + 160 GB HD costs 210$ here. Are you saying that typical 60$ hardware RAID controllers can only be put in stripe mode and not in mirror mode? -- RK

Hardware failures in hard drives are astonishingly low considering that you're dealing with components with moving parts. The fact they have moving parts makes them about 1000 times more susceptible to failure than solid state components. If they have actual failure rates on par with software then that means that software is 1000 times more buggy than it should be.

Hardware failures in drives are *extremely common* and are masked from you by very reliable software. I know you don't care, but you're involving yourself in a topic that's based around caring about these sort of things. The whole fucking point of suing engineers and why professionals aren't sued is about laying blame and deciding where responsibility lies. Putting the entire burden of responsibility for a system on the programmer is stupid, and people only do it now because of the reams of disclaimers software has, which means software is an easy scapegoat where nobody actually has to pay for anything.

I don't accept complexity as an excuse for increased failure. Programming is about dealing with complexity, to say that complexity is inherently buggy means that programming is hopeless. Which is false. -- RichardKulisz

No, it means programming is inherently *hard*, which is true. Dealing with complex problems is *hard*. It's made harder by the expectations that people are gaining that software should reduce complex problems to simple ones, which is ridiculous. Complexity inherently means increased failure - if you were any sort of engineer, professional or otherwise, you'd know this. That's why reduction of complexity is a goal in hardware design. The demands placed on software require greatly increased complexity, which means there a much greater inherent risk of failure and much greater work involved to remove that risk. This is really simple stuff to conceptually grasp. How much perfect software have you written?
Software has been dealing with complexity for more than 50 years now. Techniques have been systematically invented to make it manageable. OOP was one of those techniques and Smalltalk was its embodiment. If programmers refuse to use known techniques to reduce complexity then they damn fucking well can't blame complexity when they write unreliable software! Complexity isn't the bugaboo you paint it as.
Let's take a concrete example. There was a lot of engineering effort put into ReiserFS in order to make it reliable. ReiserFS has literally no features that a user would want it to have. It doesn't do logging, it doesn't do versioning, it doesn't do anything. What it has is a lot of nifty-keen implementation (optimizations) which engineers (ie, programmers) care about. If programmers can put out a hyper-complex yet reliable product to meet needs they pulled out of their own imaginations, they can damn fucking well do the same to meet real needs of real users. The only problem is they don't want to.

Anybody here know what PRML stands for?

[Answer: See http://en.wikipedia.org/wiki/PRML]

The fact that you can get data back from your hard drive is very nearly science fiction. It's not the hardware that assures you that retrieval is possible, it's the software/firmware that does statistical magic to reconstruct the data.

The hardware guys I work with can recount any number of scenarios where software/firmware has to deal with bits of flakiness that physics deals them. Trivial examples include such commonplace things as debounce routines.

We don't argue amongst ourselves over who has the harder job. They don't want my job; I don't want theirs. I'm grateful they're as diligent as they are; they're happy we're able to work around some of the component shortcomings.

I'm not sure why this "brand X is better than brand Y" argument is even here. Oh, that's right, it's here to support the assertion that programmers are bad/stupid/lazy/incompetent/dishonest/bigoted/evil/redundant/smelly and probably communist.

Oh well. -- GarryHamilton

In addition, combined hardware and software solutions for replacing faulty memory modules, faulty CPUs, etc, have been available from time immemorial. They didn't make it (yet) to RK's desktop because they are expensive and not even the grand InteractionDesign work or the source code for BlueAbyss et. comp. (if such a thing exists at all) are worth the money involved. But in 2005 any decently paid software engineer can, in principle, afford one of those systems, and hopefully, by 2010 even self-appointed interaction designers will be able to buy such a computer system.

But this discussion has never been about technical facts, anyways.

The page title does describe a common perception, especially in this era of complex operating systems requiring weekly updates. IMHO, there are some nuggets in the diatribe above, and the following two sections should be kept. -- IanOsgood

Actually, error correcting RAM is pretty common. And it's a good thing, given how much RAM we stuff into the boxes these days.

See... http://www.pcguide.com/ref/ram/errECC-c.html and/or http://oak.cats.ohiou.edu/~piccard/mis300/eccram.htm

Could it be that hardware, especially hardware with moving parts, is so unreliable that we take care not to do really complicated things with it? And that we put layers of error checking and correcting hardware and software on top of it, to make it more reliable?

Could it be that software is or can be so much more "reliable", for a given level of complexity, that we demand complexity so many orders of magnitude higher, that reliability becomes a problem again?

Complex ICs (CPUs etc.) have very high rejection rates. I am out of that area now, but I seem to recall that 40% - 60% are immediately defective and there is also a substantial fall out after a 24 hour "burn in" session. Hard disks regualry ship with bad blocks that are flagged and the OS must work around. RF bleed over is a major issue. My cell phone regularly causes speakers to buzz and is not permitted to be used on an airplane. Yes, hardware does have its own set of problems as well.

I can't help but be curious: RK is so certain that software can be more reliable than it is, that I can't help but wonder: has he tried to design a computer from scratch? Or an operating system? I, for one, have--at least, I've tried to write the Op Codes for a ternary computer, and have attempted to find the time to figure out how to translate those op codes into the transistors and the logic gates that make up a computer. (I haven't yet tried to figure out how I'd get a design onto silicon. Yeah, a minor detail, I know, but still...) When I was in college, I tried to develop the "perfect" OS, using Forth, because it's an amazing language. I didn't get get too far--pesky homework and all that--but I also stumbled onto Linux, which I gradually discovered that it had what I wanted: a multi-process friendly DOS-like environment. That hasn't quashed my OS-Development bug, because I'm always looking to experiment and improve things, but the biggest limiting factor is realizing just how *big* of scope it is, to create a modern operating system.

Indeed, I have thought about giving enough to Common Lisp so that it could run on its own. I would need a filesystem, drivers, a scheduler (particularly on a multi-core machine), drivers for a monitor, a keyboard, storage, and probably a mouse, just to get to a REPL (aka a command line)! And once I got there, what would I do with it? I wouldn't be able to run anything on it--certainly not the hundreds of programs that are available for Linux or for Windows.

I have seen laments about how there's so little research with regards to Operating Systems. The reason for this is that operating systems are far from non-trivial. I would also propose that there's far more research going on than we realize: the Linux kernel, for example, is evolving all the the time, but because it's "volunteer" work that's "under the hood", no one is really aware of the advances that are happening there.

But for all this lamentation that software is unreliable, and that it should be just like hardware (or worse, just like engineering), is ignoring the fact that people can (and do) die just as readily from a hardware problem, whether it be computer or physical engineering, as they can from a software bug; and that these deaths happen more often than RK is willing to acknowledge. He's also on the record that he thinks certification of software engineers will fix this; I happen to have a friend who is a competent Civil Engineer, and the stories he tells me of engineers who are either certified, or well on their way to certification, doesn't increase my confidence in the profession. For that matter, he pointed out that a very large percentage of all those "specifications" we have in the Real World--things like "this barrier should be rated for collisions at 1 ton"--DO NOT have any sort of study or even engineering analysis behind them. They are pretty much created "off the cuff", using "intuition". Which is pretty much how software is designed, is it not?

Sure, things can be better, all around. It isn't pleasant to discover a new principle in materials science by having a building collapse on you (particularly since such a principle may already be known by some engineers, and may have been discovered by good Finite Element Analysis), any less than it's gut-wrenching to discover that a bit of software allowed patients to get fried by a machine. But the truth is, when we're building things that involve high amounts of energy (and in buildings, the structure itself represents a huge amount of potential energy, released when it collapses), and something goes wrong, people are going to die! And we have only so much time, energy, and patience to devote to preventing these deaths. At some point, we have to say "The risk is small enough. It's time to move on to the next project." and then pick up the pieces when things do go wrong. --Alpheus