Crash Only Software

Crash-only software was described in a nice paper in the 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), May 2003;

Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by crashing it—and only one way to bring it up—by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.9585&rep=rep1&type=pdf


I've only seen Crash-only computing described in the paper above. There are no references in the Wiki. The idea of only having one code path for starting and one for stopping seems like a really good example of OnceAndOnlyOnce.

But:


Crash only computing seems a bit similar to Erlang's "LetItCrash".

(1) I used Crash-Only design to coordinate distributed workers and manager in a continuous integration test farm at a Large Well Known Search Engine Company. Setting time-outs correctly was a tuning point for the system. - JeffreyMiller


I've seen transactional databases essentially run in this way. Bring it up and it only goes down on a crash, whereupon you recover from the logs. There was a procedure for a clean shutdown, but it never got used. And the clean startup only got used once.


Please note that allowing a system to crash may be acceptable in handling data transactions for, say, DVD rentals out of a Redbox, but this is not an acceptable behavior for, say, a medication delivery cabinet. Let's set some limits here, folk.


This is how the Universe is coded


CategoryReallyReallyBadIdea


EditText of this page (last edited March 17, 2014) or FindPage with title or text search