Crash Only Software

Crash-only software was described in a nice paper in the 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), May 2003;

: Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by crashing it—and only one way to bring it up—by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.9585&rep=rep1&type=pdf

I've only seen Crash-only computing described in the paper above. There are no references in the Wiki. The idea of only having one code path for starting and one for stopping seems like a really good example of OnceAndOnlyOnce.

But:

has anyone really tried this? (1)
aren't "recursive microreboots" a truly horrible idea?

Crash only computing seems a bit similar to Erlang's "LetItCrash".

(1) I used Crash-Only design to coordinate distributed workers and manager in a continuous integration test farm at a Large Well Known Search Engine Company. Setting time-outs correctly was a tuning point for the system. - JeffreyMiller

I've seen transactional databases essentially run in this way. Bring it up and it only goes down on a crash, whereupon you recover from the logs. There was a procedure for a clean shutdown, but it never got used. And the clean startup only got used once.

Please note that allowing a system to crash may be acceptable in handling data transactions for, say, DVD rentals out of a Redbox, but this is not an acceptable behavior for, say, a medication delivery cabinet. Let's set some limits here, folk.

This is how the Universe is coded

CategoryReallyReallyBadIdea