Systems Engineering and Redundancy

I posted this to ServerFault.com today. Folks loose site of requirements and systems engineering and it drives me batty. Here was my response to the guys question about redundancy and RAID/COOP/etc.

——–

Every design and architecture should be requirements driven. Good systems engineering calls for defining the constraints of the design and implementing a solution that meets that. If you have a SLA with your customers that calls for a .99999, then your solution of N+N redundancy should account for all those LRU (line replaceable units) that could fail. RAID, PS, and COOP planning should all account for that. In addition your SLA with vendors should be the 4 hour response time type or account for a large number of spares onsite.

Availability (Ao from here out) is that study. If you are doing all these things because it seems like the right thing to do then you are wasting your time and your customers money. If pressed, everyone would desire 5×9’s, but few can afford this. Have an honest discussion about the availability of the data and system in the perspective of cost.

The questions and answers posed thus far do not take into account the requirements. The chain assumes that N+N redundancy with hardware and policies is the key. Rather, I would say let the requirements from your customers and SLA drive the design. Maybe your mom’s flat and your old laptop will suffice.

Us geeks sometimes go looking for a problem just so we can implement a cool solution.

Updated: