A computer industry writer once remarked that when animals are wounded, they often can still limp around in an attempt to find shelter in order to heal or evade a predator long enough for the heard to help out. However, when computer programs are "wounded", they often die a complete death. One binary digit gets flipped, and the result is often chaos or total and instant failure.
Related to this, once some colleagues and I had a fierce debate about whether report software that was to show averages should halt processing if it encountered a zero divisor, or continue on assuming an average of zero. Whether the divisor would ever be zero under "normal" circumstances was hard to determine from the requirements. Officially it wouldn't, but I suspected that such might slip through on occasion. Letting it continue with a zero average at least allows the report to be available. Sometimes an imperfect report is better than none when the user is in a hurry and there is no time to trace bad divisors. Ideas such as putting a large warning message on it were also considered.
Some report software (such as Cognos' products) expect divide-by-zero situations to occur. Cognos products continue processing, but show the result as "/0", telling the user that a divide-by-zero was encountered.
Suppose software ran your pace-maker and you were out hiking away from civilization and the pace-maker had a fatal floating-point error of some kind. Would you rather the software just stop, or that it push through a zero and try to continue?
That's a huge leap from generating a report or assessing CPU speed for gameplay to life-critical embedded software. No single solution applies across the range of software.
I did not mean that it applies across the board. It is a question to ask, not an answer.
Hmm. This seems to be a discussion on one facet of graceful failure recovery. Are there pages here already dedicated to that? Let's take a gander around; perhaps this page can be merged with one already started on this topic.
Good Idea FailFast, FaultTolerance, FaultIsolation,
This page reeks of ArgumentByAnalogy. A computer with a bad bit is not a wounded animal, nor does that have any real bearing on a pacemaker. Mission critical systems in the real world are made of independent redundant individually-FailFast systems, not a single fault tolerant system.
To correct the analogy: if a bone breaks in an animal, the animal can limp away, but not because a single system is fault tolerant. That animal lost its bone, that system. It doesn't limp away on the broken bone; it uses its other 3 legs to limp away. It abandons the broken system and relies on backups. The animal has a large of redundancy of systems to accomplish this. It's an organic system, horribly inefficient, but very resilient. Most computer systems are not organic with many interrelated backup systems. They are generally made of a very small set of horribly brittle systems with little to no redundancy whatsoever. Just like an animal each individual system is brittle, but unlike an animal there are very few to no backup systems.
If you want to take ques from nature, for a mission critical system bring along 3 computers, each with their own hardware and independently written algorithm, and bring along a voting-holding machine which takes acts on the most popular vote and reboots the machine which makes an unpopular vote.
A resilient system in the real world is not a system made resilient all the way through. A resilient system is a collection of brittle systems with many backups. It's not to say you can't make a single system resilient the whole way through; it's just much more expensive to do that, both in nature and in programing. It's simpler to abandon a broken system and rely on backups until the "main" system can be fixed. To try and make a system resilient all the way through requires a exorbitant among of error checking code, most of which is nigh impossible to test, thereby actually increasing bugs and brittleness.
While redundancy is part of it, it's not the whole story. Another strategy in organic systems is the ability to adapt. If an animal loses some of it's vision, it may learn to rely on it's nose more, and visa-versa.
I usually hear the terms used as follows:
As to whether you claim that a "system" is fault-tolerant to some class of faults, vs. "made of fail-fast parts", really depends on how you circumscribe the "system". A fault-tolerant mission-critical system might be constructed of fault-tolerant parts just as easily as fail-fast parts. Animals with a broken limb will still use that limb unless it has taken too much damage. (There's a difference between a hairline fracture and extensive shattering.)
Computer systems use many mechanisms for tolerance to fault: RAID, bad-block tracking for HDDs and SSDs, CRCs to both correct and identify faults, watchdogs, heartbeats, etc..
The LimpVersusDie opening seems to be remarking upon StateOfTheArt GeneralPurposeProgrammingLanguages at the time. TypeSafety has pursued 'robustness' to ever greater degrees, but it would be nice if our languages also facilitated - made it easy - to achieve graceful degradation and high levels of resilience. Languages that help us declare fallback systems, that implicitly perform caching to protect against disruption, that have advanced mechanisms to restore communications after loss, etc. PersistentLanguage covers resilience for a small but relevant subset of faults, namely disruption, but doesn't by itself imply GracefulDegradation.
One reason for the weakness of languages today is that they're running "locally" by default. That was a good default assumption twenty years ago, perhaps, but the world is now wired together, and "program systems" tend to cross multiple architectures and systems that can fail independently. We need programming languages that recognize this, continue providing service even as individual components fail, that help identify failure conditions to turn what would otherwise be 'subtle' failures into FailFast subsystems that can be rebuilt after full failure, and that resiliently regenerate after temporary failures or after the faulty nodes are replaced.
Redundancy is an important ingredient in achieving these properties, I'm sure. Data redundancy is needed in case disks or nodes fail, for example, and is required to regenerate systems when they recover. Redundancy also supports load-balancing and increased capacity while running normally, so serves as a basis for GracefulDegradation.
A financial transaction either happened or didn't happen. There may be a myriad of reasons why it didn't happen (application error, bad data, communications error, gateway error, bank rejection, etc.), but there's no room for fuzziness or limping along when it comes to whether or not it did happen. Sometimes it just makes sense to 'die'. Often, if not usually, in fact, it's better to fail than to proceed with invalid data. If you don't shoot that horse, you're going to be in for a lot of trouble.
That may be the ideal, but in general it ain't true. Fraud, failure, comms loss, shipping loss, etc. have all in the past been serious partial-failures for financial transactions. ACID properties for financial transactions attempt to achieve something better than what earlier systems ever offered, but even those are rare between banks, where all sorts of compensatory protocols are involved that don't achieve ACID properties.
All of this limping along is still relying on ArgumentByAnalogy. A computer program cannot limp.
Limping is one of two things in an animal: 1- Not using the broken leg and using the 3 backup systems, aka the other legs. 2- Being gentle with a leg with a hairline fracture, being careful to not put a lot of weight on it.
A computer program is not 1. It is not composed of independent backup systems. For some programming languages, the slightest misbehaving component can kill everything.
More importantly, a computer program is not 2. A bone when fractured decays in properties and performance in a linear fashion. When a bone has a hairline fracture, it still has the properties of a bone. It can still support weight. Computer programs are vastly more complex and show chaotic behavior. If you have the slightest logic flaw in your program, or a cosmic ray twiddles a bit, then your program will not degrade in this linear fashion. The slightest unexpected "butterfly wing flap" will drastically change the characteristics of your program, very much unlike a bone.
A computer program is not an animal! Stop arguing by analogy! (ArgumentByAnalogy)
However, a robust and reliable computer system is like that of an animal. It is a collection of individually brittle systems with backups. Each process is protected from each other, so while each individual program is quite brittle, the collection of these systems can be made robust and reliable.
I think your appeal on this subject is lacking perspective. Computer programs CAN and DO achieve a variety of behaviors that are very much analogous to "limping". Exponential backoff in TCP. Thrashing of vmem during large GCs or when too many apps are open. Continuing in the face of an unremedied exception. Undetected bit errors in UDP transport.
Computer programs are often composed of subsystems that achieve self-healing and backups, or that detect and recover after bit errors, or that can simply tolerate strange observations that occur as a result of errors. Sure, certain bit errors might result in critical and catastrophic breakdown or wild misbehavior. But animals and humans are not so different from computers in that regard. A heart palpitation or blood clot can kill a human or animal. A small breach or bit of damage to the spine can kill or paralyze permanently. One can choke on a small bone and die. Sometimes people even become allergic to themselves due to a tiny error by one t-cell somewhere in their body - a condition known as 'auto-immune disease', and reasonably common.
As far as the distinction between "computer program" and "computer system" - that's weak and artificial. At best you can distinguish between hardware and software, and otherwise within a specific software model (such as the Unix software model vs. Win3.1 software model). Consider: ErlangLanguage processes are also protected from one another, but a program often consists of more than one process. Same is true for Unix processes, once you start working with PipesAndFilters. Programs can also be made robust, reliable, even resilient. Windows 3.1 software could and often did thrash memory of other programs, and even later (Windows 95, 98, ME) it was quite common for one to see the "blue screen of death" - a point I make only to enforce that your "multi-process" and "each process is protected from the others" distinctions are at best incomplete and narrow minded. In the broad view, it's all one software system. In the broad view, that extends even to the Internet.
We're arguing over the validity of the analogy. I'm not sure I should continue. However, exponential backoff is not limping to me. That's par for the course. Limping to me is the result of an unexpected really bad thing (tm), not normal operation like exponential backoff. You seem to disagree with me on my assessment that computers are vastly different than animals. Computers programs / systems are many magnitudes more complex than animal body parts, and they lack enough similarity on the relevant properties. Any such analogies of a bone to a complicated computer program are automatically bunk as a basis of an argument. (Again, see ArgumentByAnalogy.) That was my entire point. I'm not saying you should kill a process on every error. I'm merely noting that most of this page is BS. A computer cannot have a hairline fracture. Exponential backoff is not a hairline fracture. The animal is not built with a hairline fracture in mind to solve a real world problem. A hairline fracture is never expected nor normal course of operation. Exponential backoff is an expected situation in normal course of operation. Animals and computers (as they exist today, and as commonly programmed today) are sufficiency different to make any argument based on analogy entirely null and void.
I suspect you'll find most large long-lived animals, including humans, suffer hairline fractures in at least some bones during the normal course of their lifetimes. In that sense, these sorts of partial failures are "expected situations in the normal course of an animal's operation." That they can heal from these failures indicates that they are accounted for. Are you assuming that failures that are anticipated and accounted for don't qualify towards resilience and robustness of a system? That is, are you assuming resilience and robustness can't be achieved by design?
Sigh. Regardless, your conclusion is utterly non-sequitur. The LimpVersusDie page doesn't actually have an ArgumentByAnalogy. That is, the LimpVersusDie page contains no argument of the form: "X is true for animals, and programs/computers/whatever are like animals, therefore X is true for programs/computers/whatever". Since there is no ArgumentByAnalogy, what is it you've been railing against? Oh, yes, what you've actually been objecting to is the use of analogy to explain a property that we might wish to achieve to a greater degree in computing. Explanation by analogy is not ArgumentByAnalogy. But even if this page did have an ArgumentByAnalogy, your objection to this analogy is constructed of various invalid premises. For example, you just now said "computer programs / systems are many magnitudes more complex than animal body parts", but any programmer with a comprehension of biology who has studied cells, DNA, ribosomes, enzymes, ATP, protein construction, etc. can tell you that even a single cell is more complex than many computer programs.
I suggest at least reading ArgumentByAnalogy. I'm not suggested it contains any literal argument by analogy ala Plato. What it does contain is explanation and reasoning rooted in analogy with whose applicability I very much disagree.
I suggest you stop patronizing me by assuming I have not read that page. If you don't like an explanation, but you understood it, then you have no room for complaint. Explanations don't need to be "liked". The only purpose of an explanation is to help an audience achieve understanding.
Computer systems when faced with an unexpected problem generally either die asap or behave in a chaotic fashion.
That's simply untrue. I suspect your belief is rooted in confirmation bias (http://en.wikipedia.org/wiki/Confirmation_bias). Fact is, computer systems - when faced with unexpected problems - often behave quite well. But you don't notice when the computer system behaves well. You only notice when the computer system dies or behaves in a chaotic fashion. Therefore, you fail to notice when the computer system behaves well in the face of the unexpected problem. Indeed, you're unlikely to even know that the problem exists until after it is severe enough to muck things up.
Animals sometimes display this when presented with a problem: e.g. your DNA example: a small error has chaotic effects on the system. However, the limping analogy applies to the other aspect of animal systems, that they tend to decay in a linear way to unexpected or abnormal conditions. I firmly believe this is a property which computer systems do not have in the same way as animals. A limb when injured gracefully decays, whereas a computer program with a single bit twiddled, or a single error case unhandled, can behave quite chaotically. (However, expected errors can be dealt with gracefully.)
Your firm belief... is wrong. Computer systems often degrade gracefully under abnormal conditions. Animals and computers are quite similar in this regard: each can fail catastrophically under some failure conditions, each can degrade gracefully under others. Whether it's "expected" or not shouldn't be part of the analysis or analogy unless you wish to consider whether the failure modes of animals were subject to 'expectations' of evolution or intelligent design. Expectations certainly weren't part of the analogy you're railing against.
An analogy is useful only insofaras everyone understands and agrees with it. If someone disagrees with the analogy, do not persist with it. It only brings more confusion and arguments over the applicability of analogy instead of discussion of the actual technical issue.
No, an analogy in explanation is useful if everyone merely understands it. Agreement is not required. It's hard to build an explanation or analogy that resonates with the audience. But even an utterly stupid, offensive, disagreeable analogy can go a long way towards establishing comprehension. Anyhow, if your wish is to focus on the actual technical issue, then your decision to focus on applicability of an analogy smells of hypocrisy. Choosing to not "persist with" the analogy is fine, but railing against it - especially after it has successfully served its purpose - is utterly pointless.
I also believe as a matter of empirical fact that it's "better" to make computer systems robust by making the individual components fail fast and brittle in the face of unexpected errors, like sanity check asserts of internal class invariants.
What does "individual" component mean when talking about computer systems? A bit? An expression? A function? A library? A module? A language object? An OS process? A full OS with all processes? A full machine (CPU + local memory)? A domain of connected machines on a LAN? Of what elements are "individual" components constructed? You need a proper definition for empirical analysis, which makes me doubt your assertions about empirical fact. Add that to the confirmation bias, described above.
I certainly agree that FailFast has its place. The LimpVersusDie page does not suggest the contrary; indeed, it opens by discussing a debate about which design results in a more robust system overall. But the ideal is not "robust the whole way through". Robustness itself is a means to an end, like reliability, predictability under composition, and survivability. Resilience (self-healing) can pick up where robustness is missing, and is somewhat more flexible in its ability to handle hardware failures. FailFast for predictable subsets of the system can improve resilience by making it easier for other parts of the system (including humans) detect and repair or replace the failing subsystem. They can also help limit damage... like a fuse getting burned out to avoid damage in the fine electronics.
I think we're agreeing about how to write programs (mostly) but disagreeing over analogy. I'll leave it at that.
I'm only playing DevilsAdvocate on behalf of the analogy; I don't care for it one way or the other. My objection is with your objection to the analogy - in particular, that you cried fallacy where there was none, that you've repeatedly made untrue assertions (likely from cherry-picking) about the relative nature of animals and computers, that you've let your beliefs about what 'should' be shape your beliefs about what 'is', etc. If you had provided a valid objection to the analogy, I'd let it stand.
My problem with the analogy is and always has been that computer systems do not react gracefully to unexpected errors. Animals can.
The examples you're conveniently ignoring are things like "malformed input data causes interpretation failure and causes fallback" vs "damage to leg causes blood clot which rushes to brain and causes death". To be logical, you need to consider all the data, not just the data that supports your conclusion. There is a word phrase for your behavior: it's called cherry picking (http://en.wikipedia.org/wiki/Cherry_picking). It's an easy trap to fall into.
[However the difference between animals and computer programs is that programs do not have physical parts that can wear out. The computer has physical parts that can wear out. This is one reason why some people object to the term "computer science" since computers are physical, and programming is about "computing". Another difference between animals and programs is that we can directly control the program to the most exact levels. One can change every little bit about the program at the lowest levels. We cannot directly control a running animal - we can only influence it by teaching it and training it. That may change when we are able to hack with DNA, however DNA is still something we are hacking and not actually in full control of where we can create it from thin air (programs can be created from thin air, animals cannot). A computer program can be changed without it being "taught", whereas animals require being taught or influenced in order to change slowly. A program could be designed so that it could be taught, but it is not a requirement for the program to change behavior - one could just edit the source directly and change it that way. With animals editing the source can not easily be done, unless we advance our DNA source code knowledge and figure out more how DNA works and how to program it (which might even be so dangerous that we'd better stick with computer programs and forget I even mentioned that) - but it does beg the question, about whether or not evolution and DNA has some programmer or source code behind it, even if it is all accidental and not a God]
Cancer is an example of the horrible things that can happen when a system does not fail fast and decide to "limp", cancerous cells are nothing but cells that are worn out (or "limping") (after many divisions and/or damage by carcinogens), and that, instead of committing suicide like good behaved cells, decide to live and reproduce without control. The only reason we believe biological systems are more resilient than our software systems is the extreme redundancy (a typical biological systems has millions of cells) and therefore it seems to heal, when in reality it "micro-dies" or "micro-fails-fast" and then "micro-replaces" the damaged elements of the systems with new ones... if defective elements fail to "micro-die" then all the system is compromised and eventually destroyed. Same thing applies to software, sometimes, the best solution for a problem is just to fail and start over, instead of continue to run in failure mode, and eventually poisoning the whole system.
Although it's still a nascent field, from what I've read, there are many protections against run-away cells; and many, if not most, division-related mutations are eventually dealt with. But some slip through the cracks. Too much complexity in order to prevent all possible division-related mistakes itself may contribute to problems. It's kind of like government auditing: at what point does the cost of auditing exceed the savings of prevented problems? Biology doesn't "like" cancer and puts up a strong fight against it, including slowed metabolism and replenishment later in an animal's age, which is why Michael Jordon had to eventually retire. In the end, entropy wins. Past a certain point, biology decided that it's best to let natural selection, instead of more self-auditing, "fix" the problem. Billions of species over billions of years couldn't find a work-around for aging, only compromises. I suspect software also has a similar complexity limit where the cost to manage the complexity exceeds the benefit of having high complexity, and alternatives such as dividing an app up are the better bet even if it costs some duplication. -t
I suggest reading The Greatest Show On Earth by Dawkins to dispel some myths you hold about evolution. I spotted no less than 2 flat out inaccuracies in your description of evolution by natural selection. "In the end, entropy wins." Yes. The second law of thermodynamics is pretty solid. However, Dawkins makes a great insight (one of the things I loved most when reading the book) that evolution by natural selection is the only known natural process (excluding intelligent design) which increases "information". (This is using not quite the technical definition of information. A counter example is a body radiating energy. It is is decreasing its entropy, and thus it is increasing its information as defined by information theory. Let's just use a slightly more colloquial definition of information in this context.) I strongly disagree with your characterization that entropy wins in the end in this context, that evolution cannot find a way around a problem because "entry always wins in the end", which leads me into my second point. "Past a certain point, biology decided that it's best to let natural selection, instead of more self-auditing, "fix" the problem. Billions of species over billions of years couldn't find a work-around for aging, only compromises." I will forgive the personification of biology and evolution; it's standard prose and we all do it. However, do specifically note that there is no intelligent design behind evolution. No agent decides what's best for the organism. Your observation that they "couldn't find a work-around for aging, only compromises" misses the point of evolution. Evolution will not drive species towards "ageless" or any other Platonic ideal. Instead, evolution drives a species towards the best replicator, and thus it does prove pretty well that ageless is not a good quality for a replicator. It does not prove that ageless is impossible or even impractical.
"In the end" refers to the scope of an individual only, not life in general.
It varies with species. Mayflies are an excellent example of your thesis - one day full of sex, and poof! However, many reptiles continue to breed throughout their life (post sexual maturity), which contradicts your thesis. It probably depends on the amount of resources in the current environment - a world with a steady surplus means that the next generations will not be limited by their parents continuing to breed, while a resource-poor environment might favor parents dying after breeding.
Of course there are exceptions to the rule. But also note that long-living reptiles also tend to have very slow metabolism.
Of course. Evolution tends to find local maxima, as opposed to global ones. Some species are a solution to the problem in one fashion, and others in a different fashion.
As a simple example, perhaps it requires less complex DNA to have an individual mature to sexual maturity quickly, but as a side effect cause the individual to die of old age (local vs global maxima.) Another example might be that longer living individuals slow down the rate of evolution. If your great x20 grandfather was still alive, then your population is changing at a much slower rate than that if your ancestors are already dead, and populations which evolve quicker tend to be better replicators. Perhaps there are other factors in the optimization problem which we don't know. So my original point stands based on my original two arguments: 1- Evolution favors local not global maxima, so the lack of ageless individuals does not prove that it would be difficult to engineer such a thing. 2- Evolution favors the better replicator, and I have yet to be convinced that ageless replicators are better. If anything, I have made a much more compelling argument in this paragraph which argues that ageless replicators would be worse, not better.
Better replicator means more offspring, period. The factors affecting that are a) onset of sexual maturity, b) length of reproductive span, and c) success of offspring. "Rate of evolution" is meaningless - if the individuals continue to live, they are fit, and their genes are good, and therefore will persist in the population. We know that most animals live longer in captivity, and breed for longer as well, which shows that the genes for long life are not selected against, just less visible in the wild.
First, you are ignoring many other selection pressures which aren't "onset of sexual maturity, length of reproductive span, and success of offspring". The biggest glaring missing thing is sexual selection, aka being attractive for a mate. There's also in-group selection bias, which depending on who you talk to matters / exists. That is helping out closely related family members increases the chances of your genes being passed on, otherwise known as a potential source of kindness, compassion, teamwork, and self sacrifice. (Optionally we can use Game Theory to explain some of that, but not self sacrifice.) I think I might be missing a couple other selective pressures, but I think I've made my point that it's not as simple as you make it out to be. Also, what do you mean that genes for long life are not selected against in the wild when animals live longer after a couple generations in captivity? This seems like a contradiction in a single sentence. If they don't manifest in the wild, but we can breed for them, then nature does select against these traits, at least to the degree that they're not as strong as they are under selective breeding. Selecting against doesn't mean that the trait is bad as a Platonic bad. It means in this particular case, that trait is too expensive in the optimization equation of evolution, and it is not chosen. A classic example is why do some animals have renewing teeth and some don't? Surely renewing teeth are always better? No, apparently they are not. Non-renewing teeth may be a local maxima problem, or, more likely, it's an expense problem. Having renewing teeth requires more calcium, for example, which may be better spent elsewhere, or maybe it would up the food intake requirements making that individual actually less successful. Either way it is a complex thing that one cannot simply armchair reason about. Finally, your pithy attack "rate of evolution is meaningless" is both ignorant and lacking content. You state your position as fact, as though this defeats my argument without even addressing it, and you do not address it. There's a reason sexual selection is a bit more common in the "higher order" organisms than asexual cloning. Sexual reproduction leads to sexual selection, which is almost always a negative selection pressure for the fitness of the species. It's also much less efficient than simple asexual cloning. Why do so many animals use sexual reproduction then? The increased rate of mixing of genes, aka the higher rate of evolution among the species which reproduce via sexual selection, allows them to evolve at a faster rate, and thus be more fit and successful.
[Let's not forget that having more offspring isn't necessarily better in the first place. Overpopulation can lead to a catastrophic drop in population.]
Again, this is a misunderstanding of evolution. Evolution by natural selection does not 'care' about Platonic or utilitarian ideals. It doesn't 'care' if the population dies. If some behavior or quality is self destructive to the species, but beneficial to an individual, then evolution will favor that quality, up until the very last member of the species dies. There is no such thing as a prudent predator. Lions and tigers and bears, oh my, will hunt themselves to extinction if allowed. If a better hunting lion came along, and it killed more prey, then evolution would favor it, resulting in less prey, until eventually all the remaining lions would starve. Luckily, evolution also tends to make the prey more resilient as well, and generally in large enough systems the number of prey and predators goes round in a cyclic relationship, so those better hunting lions become less in number because less food stocks are available, until the food stocks rebound in X years. Aka the large system can handle the small shocks.
Maybe it is just a a too complex problem...even for something as capable of slowly creating amazingly complex systems as evolution is... maybe the problem is the explosion of communication paths... imagine a biological body as a team... the formula for communication paths is that if you have n people on your team, there are (n^2-n)/2 connections... if we take each cell in our body as a team member... the number of communication paths is huge! too huge to deal with, so evolution is a attacking it the best way we know how.. by creating specialized teams that know a lot about how a task is done, but do not really what the whole body is doing (your liver can not do what your brain does). But, while that specialization and distribution of works makes it possible to create amazingly complex (and for a time apparently perfectly working) systems, it also means that communication is compromised, and some parts of the body are just unable to prevent others (and themselves) from making small mistakes, that eventually sum up and end up destroying all the organism... --LuxSpes
"Don't make copy errors" is not necessarily a system-wide problem or cause. What does the spleen have to do with copy errors in the big toe? A summary of possible reasons for the "allowance" of copy errors is as follows:
So maybe death (failure) of individuals is the price for evolution? If a replication archieves perfect copying, it stops evolving, and its outcompeted by others, and so, the advantages of never getting cancer are negated bye the fact that is now unable to evolve?
[Can we maybe talk about fault tolerance, please?]
See: FailFast, FaultTolerance, FailureIsInevitable, FaultIsolation, PersistentLanguage, GracefulDegradation, GatedCommunity, AssertionsAsDefensiveProgramming