The Recovery-Oriented Computing (ROC) project is a joint CalBerkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services.
See http://roc.cs.berkeley.edu/
From the web site (http://roc.cs.berkeley.edu/roc_overview.html):
The ROC approach takes the following three assumptions as its basic tenets:
These are similar to the assumptions of the ErlangLanguage designers, as described in the 1990 paper Erlang - An Experimental Telephony Programming Language, http://www.ericsson.com/cslab/publications/iss90.ps
The third assumption, which to me reads as "user error is a major source of system failures" seems completely incorrect. I would say that programmer error is a major source of system failures... i.e., bugs. Perhaps I'm misreading the statement, or maybe there's a lot fewer bugs in "modern production Internet service environments." -- JimShore
Anecdotal evidence: The system in use in my shop is incredibly buggy. However, the incidents which are severe enough that we have to shut it down and go back to paper while it's being serviced are quite often due to human error (I'd put it at about 75%): stuff like kicking out the server plug, managers incorrectly entering inventory for weeks at a time, stuff like that. -- WilliamUnderwood
Okay, I misinterpreted "failure" as "defect," when what they were talking about was something so severe that it means shutting the system down for a time. That makes more sense. -- JimShore
See also: FailFast, FaultTolerance