Recovery Oriented Computing

The Recovery-Oriented Computing (ROC) project is a joint CalBerkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services.

See http://roc.cs.berkeley.edu/


From the web site (http://roc.cs.berkeley.edu/roc_overview.html):

The ROC approach takes the following three assumptions as its basic tenets:

These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

These are similar to the assumptions of the ErlangLanguage designers, as described in the 1990 paper Erlang - An Experimental Telephony Programming Language, http://www.ericsson.com/cslab/publications/iss90.ps


The third assumption, which to me reads as "user error is a major source of system failures" seems completely incorrect. I would say that programmer error is a major source of system failures... i.e., bugs. Perhaps I'm misreading the statement, or maybe there's a lot fewer bugs in "modern production Internet service environments." -- JimShore

Anecdotal evidence: The system in use in my shop is incredibly buggy. However, the incidents which are severe enough that we have to shut it down and go back to paper while it's being serviced are quite often due to human error (I'd put it at about 75%): stuff like kicking out the server plug, managers incorrectly entering inventory for weeks at a time, stuff like that. -- WilliamUnderwood

Okay, I misinterpreted "failure" as "defect," when what they were talking about was something so severe that it means shutting the system down for a time. That makes more sense. -- JimShore


See also: FailFast, FaultTolerance


EditText of this page (last edited November 15, 2004) or FindPage with title or text search