Recovery Oriented Computing

The Recovery-Oriented Computing (ROC) project is a joint CalBerkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services.

See http://roc.cs.berkeley.edu/

From the web site (http://roc.cs.berkeley.edu/roc_overview.html):

The ROC approach takes the following three assumptions as its basic tenets:

failure rates of both software and hardware are non-negligible and increasing
systems cannot be completely modeled for reliability analysis, and thus their failure modes cannot be predicted in advance
human error by system operators and during system maintenance is a major source of system failures

These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

These are similar to the assumptions of the ErlangLanguage designers, as described in the 1990 paper Erlang - An Experimental Telephony Programming Language, http://www.ericsson.com/cslab/publications/iss90.ps

The third assumption, which to me reads as "user error is a major source of system failures" seems completely incorrect. I would say that programmer error is a major source of system failures... i.e., bugs. Perhaps I'm misreading the statement, or maybe there's a lot fewer bugs in "modern production Internet service environments." -- JimShore

Anecdotal evidence: The system in use in my shop is incredibly buggy. However, the incidents which are severe enough that we have to shut it down and go back to paper while it's being serviced are quite often due to human error (I'd put it at about 75%): stuff like kicking out the server plug, managers incorrectly entering inventory for weeks at a time, stuff like that. -- WilliamUnderwood

Okay, I misinterpreted "failure" as "defect," when what they were talking about was something so severe that it means shutting the system down for a time. That makes more sense. -- JimShore