Cars Break Sitting In The Garage

[Note: As of May 2011 this page is pretty useless. With server virtualization and such things the use of hardware and load leveling is in a whole new ball park.]

Unless you're like me and have abandoned VMs as largely useless technology after declaring "The old is better." --JoshuaHudson


If you want your hot-spare server (car) to be useful (start when you turn the key), you must fail-over to it (start the car) on a scheduled basis. Otherwise it is guaranteed not to run when you need it the most.

What this really means is if you don't try to use it, you won't know that it really works. If you don't try to use it regularly, you won't have the incentive to develop foolproof ways to keep your hot server and backup server synchronized, and to make it easy to direct the rest of the system to the active server. Over time, the unused fail-over server falls out of sync with the active server, and people forget how to do a failover.

The best way to work a hot-spare server is not to call your servers "main" and "hot spare." Instead, call them "A" and "B", and designate that at any given time, one of them will be "prime" and the other will be "backup." Now switch which one is prime and which one is backup on a regular basis (such as weekly). The "A" and "B" names are important -- if you call one of them "hot spare," some fool will decide that you shouldn't run on it for a long period of time, and one of the computers (cars) will soon end up parked in the garage again, becoming untrustworthy (leaking oil from dry seals) and falling out of sync (getting flat spots on the tires).

According to WayneConrad, a distributed control system that controls the pipeline supplying water to San Jose, California has been working with no unscheduled down time for over 10 years. The failover was used routinely, and worked every time. The reason? Failovers were defined as a normal system operation and implemented accordingly.


Yes they do. If you have a classic, get it out and drive it. I hate the idea of garage queens. When I strike it rich, I am going to buy a Porsche 959 and use it every day - like it was supposed to.


Similarly, if you actually want to be able to restore from backups when you really need to, then you have to regularly run through restore scenarios and make sure that you can get what you need. Otherwise you're almost certain to find (at the worst possible time) that the backups that have apparently been running flawlessly for 3 years haven't really been running after all. Somebody added something to the system without adding it to the backup schedule (and proceeded to put the most critical data there), or the schedule that was good enough five years ago no longer fits your needs so that you're not able to rollback to the time point that you need. Or something was misconfigured and each backup overwrote the last. Or you have to restore the entire system to get one file, etc.

Untested contingencies give a false sense of security. If you haven't had any fire drills, you might find out too late that the emergency exit doors were locked two years ago by someone who didn't know any better and is now long gone.


This is a good example of "SuccessOrientation?".


EditText of this page (last edited November 10, 2014) or FindPage with title or text search