The System as Organism

Building fault-tolerant, high-availability (FTHA) systems has a dark side; an engineering trade off of between reliability and predictability. This should be obvious to anyone who has studied biological organisms. I define an organism as a complex system partitioned into highly specialized and tightly interrelated parts. This can describe a cell, a human being, or a FTHA system whether it supports an e-commerce site, vital infrastructure control, process control, or command and control.

Every time a system is designed with another layer of virtualization, failover, clusters, parallelization, etc. the system becomes that much harder to diagnose when trouble does strike. A visit to the doctor will tell you as much, as will the history of physical failover systems like dikes and levies. Construct the most secure reliable system you want and you will create an even bigger Black Swan scenario that shouldn't have, couldn't have happened and yet it did or will.

Therefore designing for manageable degradation is much more predictable over the long term than the vain attempt to create a perfect system that never crashes, except of course when it does, and it usually will at the worst possible moment.