Over the last few weeks I've been fairly quite on the journal front. It's been quite busy with the Big Move ™. I'm pleased to announce that as of 3am this morning, the tummy.com hosting division has completed the move to the new facility. Things went incredibly smoothly, with only one system experiencing downtime beyond what was planned for.
It was, as you can imagine, quite a big project. The move involved all sorts of routing changes, building of an appropriate power and networking infrastructure, and more.
Even with picking an existing facility to put equipment in, there's quite a lot of architecture that has to go on around that choice to ensure that you are getting the most out of it. For example, our facility is totally five nines (99.999%), but if power is fed from only one PDU over one circuit and one breaker, you won't get five nines out of the power system. Ditto for the network, if you aren't using two connections to two upstream routers with dynamic routing, you won't get five nines.
Even things as simple as cable management play a part. Using the wrong cables can significantly impact the availability of the facility. Messy cables can result in unintended outages when the wrong cable is pulled or a cable is snagged while other work is being done.
Of course, part of this complexity is simply because we want a facility that can live up to hosting High Availability clusters. We are set up so that single dedicated systems basically only have a single point of failure: the power to an individual cabinet (which is fed from two different PDUs). HA clusters have absolutely no single point of failure, because the nodes exist in different cabinets. A failure would require some combination of 3 generators, 3 UPSs, 2 electric substations, 4 circuits, 2 PDUs, or two routers/switches/network circuits. Or, two backhoe accidents on different sides of the building…
But, I digress… As you can probably tell, I'm pretty happy with our new hosting facility. I'm not easy to impress, either. I used to work in bomb-proof facilities with federally mandated availability requirements for the Emergency 911 system.
Lots of planning and testing went into making the move fairly uneventful. One of the tricks we used was a long-haul bridge between the network at the new facility and the one at the old facility. Because we have portable IP address space, we didn't have to renumber as part of the move. However, in order to be able to do the move incrementally, in small, testable chunks with a clear back-out procedure, we needed to have machines work with the same IP space in either location.
A long-haul bridge over a tunnel allowed this to happen. We could literally take a machine down at one location, move it to the new location and bring it up, with absolutely no network changes. Tricks could be done with static routing, of course, but that requires a fair bit of work and testing for every machine moved. When you're moving a whole pile of machines, the bridging is a clear and significant win. Taking the whole network down and moving it just isn't an option when you have more than a few machines.
So, what about the one machine that had problems? The problems were totally unrelated to the move. It was not running the stock kernel, because the root file-system was JFS and the stock kernel did not support that for the root file-system. The kernel on the system had worked in the past, but had also sporadically produced problems after disabling SELinux. Scott came up with the fix that ended up working: copy the file-system off to the backup host, re-format the root file-system as ext3, and copy the file-system back. Then switch to the stock kernel.
In conclusion, I can say that if you are given enough time to do design, architecture, and testing, you can successfully move racks and racks of gear with little or no trouble.comments powered by Disqus