Let It Crash

Positive view of a CoreDump.

See FailFast and DontCatchExceptions, maybe also ReportAndDie, but not LimpVersusDie.

This is also a coding philosophy for ErlangLanguage. The view is that you don't need to program defensively. If there are any errors, the process is automatically terminated, and this is reported to any processes that were monitoring the crashed process. In fact, defensive programming in Erlang is frowned upon.

This explains Erlang's popularity in the commercial world.

"Crash" is perhaps a bad term here. Processes terminate for various reasons, including the process intentionally exiting. But a process also terminates if there's an execution error, and the reason for the termination is reported. The monitoring process can then restart the process or take other action.

That seems only to shift the burden up a level, doesn't it? -- PeteHardie

It is hoped that it shifts the burden up to a level that can manage it intelligently.

One could as much hope that the intelligence could be built into the lower level and avoid the crash. If code is handling the crash, it's really not a crash. -- PeteHardie

Uh, no. Erlang/OTP always handles crashes - errors are orthogonal to stability (an Erlang/OTP system is always stable even if all its worker processes crash all the time). Using exceptions and 'error handling' adds complexity to code and adds a lower bound below which you do not catch bugs GordonGuthrie?

There is an even more extremist idea; CrashOnlySoftware which seems related to this.

Just by the way, this whole idea is completely anathema to those of us who create EmbeddedSystems. Crashing means a piece of equipment stops operating. In my case, it means a piece of factory automation, a scientific instrument, or, in the absolute worst case, a Class III patient intrusive medical device stops operating. No good, my man.

Errors need to be handled at the lowest level at which an informed decision can be made as to their disposition. Only when there is no recourse is the error passed all the way up to the user (Abort, Retry, Fail?). The idea of crashing as a problem handling method gives me the willies.

Having worked on successful safety-critical embedded systems for a decade, I disagree with the comment above. There is nothing wrong with following FailFast practices in embedded systems; I would even argue that it is a superior approach. This is not to say that you shouldn't handle errors at all. What we're talking about is the program crashing due to programming errors, in which case it probably isn't in a state where it can safely/reliably perform its function anyway. Ideally, you should not be shipping your embedded system with programming errors, and if you do, then there should be an outside monitor to handle this contingency. The point of FailFast is for the program to crash as soon as possible during testing, so that you can quickly catch, diagnose, and fix the errors before deployment. Of course, this assumes that you are thoroughly testing your code and achieving an appropriate level of statement and condition coverage for the application.

Uh, oh. Mission critical applications can simply not be allowed to die during operation just because something is outside of allowable parameters. Is such a failure acceptable on your automotive anti-lock braking system? How about your commercial aircraft approach vector mapper? Or how about that embedded defibrillator in your chest?

All of those systems can fail fast, so long as they also recover fast (and automatically) and do not enter a repeating failure-recovery loop.

There is no "fail fast" when the system simply must continue to operate. Error detection, trapping, and correction need to occur at the lowest possible level, as close to the source of error as can be had. The lowest level of decision making that can accommodate an error needs to do so in order to keep the system operational despite problems. So it goes.

There are many ways to keep a system operational despite problems. One could LetItCrash but have backup or fallback processes ready to go, for a form of GracefulDegradation. I've heard of Helicopters with seven different kinds of AI cooperating together. It wouldn't be a big deal if a couple crash and recover. There is more than one way to achieve robustness, resilience, system survivability.

Ahh. Now we're getting somewhere. You are suggesting that a single subsystem in a redundant control scheme be allowed to crash whilst another, parallel backup take over? Then, perhaps, the original error can be corrected in the primary system before the primary is allowed to regain control? That sounds like a plan.

What ya'll are missing, is that Erlang was designed to be fault-tolerant. To handle faults you need two computers - one to fail, the other to handle a failure. This is why Erlang lets processes crash, and expects other processes to do the error recovery. It allows things like power outages, net splits, and machine hardware failures to be handled.

That's hot standby. It has nothing to do with the language of choice and everything to do with the layout of supporting hardware. If your system architecture calls for two or more processors (think Class IIB and above medical devices) then the choice of language is completely secondary to the structure of error handling. If you are running two CPUs then you're already planning ahead to handle a raft of failure modalities, not just programmatic ones.

I think some people are missing the point, and attaching the wrong meaning to the word "crash", which isn't such a terrible concept in the Erlang/OTP world. In Erlang/OTP, a well-designed application is a hierarchy of supervisor and worker processes, with a top-level supervisor process. (A "process" here being an Erlang process, not an OS process or thread)

Supervisors start workers and lower-level supervisors and, should a worker crash, can be configured to restart the worker (or some subset of its subordinate processes). This alleviates the worker process from having to detect and handle every single potential error condition. It handles what makes sense, and allows itself to terminate with an error ("crash") in other cases.

The philosophy is not just to write code without error checks and expect higher-level supervisors to save the day; it is to avoid defensive coding, namely to detect and correct what is appropriate, and allow other errors to cause process termination and propagate up to the supervisor. If the supervision hierarchy is well-designed, you can end up with an extremely fault-tolerant system. That's in addition to other fault-tolerance techniques such as hardware redundancy.