This morning I woke up thinking that if I were to write a system emulator like PearPC or Bochs, I'd make it emulate reliable hardware like the Tandem Non-stop architecture. I only have a passing familiarity with it, but those systems are designed such that they can continue operating in the event of a single failure. For example, much like hard drive RAID-1, multiple CPUs can be run in lock-step and the results checked. In the event of a discrepancy, the CPU cluster will be taken off line and the application re-started at it's last checkpoint on another CPU.
I did some searching around on Tandem and found an article titled A Census of Tandem System Availability Between 1985 and 1990. Being a compulsive geek who has always been interested in resilient computing, I had to read it.
One of the things mentioned in the report is that over the 5 year period from 1985 to 1990, hard drive failures dropped quite significantly from about once every year. That combined with much simpler interconnects with the hard drives had a huge impact on outages. A system with 100 discs would on average have disc maintenance required twice per week.
While hard drives were the biggest contributor to hardware down-time, most other hardware got significantly more reliable during this time as well. In fact, they say that the majority of hardware problems related to the newer hardware were firmware-related. In other words, software.
It's now 15 years since the end of the time period when that study occurred, and we now have commodity hardware that is incredibly reliable. We offer managed hosting services, and have racks of computers that hardly ever fail. Some of our equipment is 3 years old or older, and even that hardware rarely has problems. We've been able to select hardware which just runs and runs. Of course, there wasn't much choice for us, because besides upsetting customers if the hardware goes down, it requires maintenance from us. If we used less reliable hardware, we'd have higher costs not only in hardware replacement, but also maintenance and operation.
I'm not sure that hard drives have increased in reliability over the last 15 years. It seems like about 5 years is the sweet-spot for lifetime of hard drives. Of course, during this time drives have gotten faster and cheaper and higher capacities, but reliability has stayed about the same.
As hardware got more reliable, software, operations, and facilities problems were larger contributors to outages. In fact, software outages tended to increase, because of the use of more complex software. I won't harp on software outages, though – we all know that software isn't built up to the standards of most hardware, for a variety of reasons.
It was very interesting to hear of the significance of operations problems, though. It's something we've found with High Availability deployments we've done. A HA cluster can continue operating through certain problems, or at least return to service quickly and without manual intervention. That's it's biggest benefit. If the power supply in one machine goes down, in seconds or minutes your applications can be back up and running on the spare.
This is why I like to think of services like Heartbeat as “downtime shifting” as opposed to thinking in terms of increasing availability. A cluster is a much more complicated system to set up, maintain, and operate than a single system. Because of the increased complexity, you really need to plan to test the fail-over after various kinds of maintenance to ensure that the fail-over still works.
While we often recommend reboot testing to ensure that a system comes back up unattended after a power outage or maintenance task, the instances where you should do fail-over testing of a cluster are much more frequent than when you should do a reboot test of a single server. So in that way, running a cluster can lead to more scheduled outages. The trick is that scheduled outages don't tend to be counted against system availability, even though they may have the same result – users can't access the system.
Clusters biggest benefit is that they reduce the down-time associated with unscheduled outages. Even with 1 hour average resolution time for hardware problems, a HA cluster has it beat hands down. Depending on the applications involved, a cluster fail-over may take dozens of seconds to a few minutes to complete. Usually less time than a normal reboot would have taken.
Tandem found that sometimes the increased operator complexity resulted in outages. For example, an operator tries restarting a failed CPU set after a replacement was done, but instead they do the restart on the CPUs which are running the application. More complex systems have more places where they can go wrong. On a Linux cluster, you have to realize that some operations such as firewall rule modifications, killing processes, and reboots of networking gear can lead to fail-overs or split-brain operation. In general, heartbeat isn't too bad in that respect, but it's still one extra thing that has to be accounted for.
There are procedures that can be done to reduce the impact of mistakes, of course. For example, last week when we had 2 different RAID arrays on our managed hosted machines die, before doing the RAID recovery onto a new disc we made sure we had a good backup of the latest data. We intentionally ran it in single-user mode for the duration of the backup, increasing down-time in exchange for reducing the impact of a number of problems that may have happened during recovery. On this system it was an easy decision to make, since the system is a development server and is rarely used after hours – the impact of the down-time was minimal.
The other RAID failure shows a place where a cluster shines, though. In this case, the RAID failure occurred on the secondary machine in the cluster. Unlike the RAID recovery above though, this system had serious problems eventually resulting in the system being non-bootable. I resigned myself to having to do a full recovery of the system and decided to just pull the system and take it home to spend some quality time on it. It wasn't urgent because the application was up on the other system.
I ended up being able to manually recover the RAID array without having to do a full re-load. I re-created the RAID arrays and the system just happened to mirror from the drive with the correct data on it to the new drive, but it just as easily could have been been the other way around. So in this case the system was down for around 4 hours, but the application wasn't down at all during this time.
Another case study is the routers at our server facility. We run a pair of routers running in a hot/hot configuration. Both routers are running in exactly the same way, but through a few tricks incoming and outgoing traffic only tends to go to one of the machines at at time. In the event of a hardware failure on the primary, more powerful router, traffic will automatically switch to the secondary with no intervention.
This morning I decided to do a software upgrade on the secondary router, which required a reboot. Because of the cluster setup, there was no down-time involved. I'm going to let it run a few days and then upgrade the primary as well. That one also will result in little if any noticeable down-time because of the cluster.
The way the system is configured, a scheduled shut down of the primary router results in no noticeable outage as the switch over occurs. The primary continues routing packets as normal, but notifies our up-stream ISPs to start sending traffic to the other router. At the same time, local machines are notified to start sending outgoing traffic to the other router. Only then is routing shut down.
In this way, the cluster is a huge win, not only in the event of a hardware failure on the primary router, but also for maintenance. However, it's very important that the higher operational complexity issues be weighed against the benefits of a cluster before deciding if it makes sense to go that route. That was true in 1990, just as it's true now.comments powered by Disqus