Last weekend I upgraded my laptop to Fedora Core 3, and switched from JFS to XFS. I've found it may not be the best file-system for a laptop. In other news, it's been an interesting week for software RAID under Linux. You know, in that “Chinese Curse” sort of way.
I've noticed over the last week that XFS seems to hold a lot of buffers dirty without flushing them for a long time. Even if the disc sub-system (and indeed the whole system) is otherwise idle. I also haven't been able to find a way to enable data journaling like you can in ext2fs.
The instance I really noticed it this weekend was while we were out and about. I had to use my CDMA card for net access, because the Saxby's coffee shop's net was down. Probably because their upstream is wireless and it was snowing and the upstream equipment got whacked either on the roof or at the head end. Anyway, I had built a new xchat RPM and installed it, and was in the process of copying down a new software suspend kernel from Kevin. This was scheduled to take about an hour, because I was only getting around 10KB/sec (not bad for CDMA).
So, I left the system running, put it in my bag, and we went and did some shopping. After the shopping, we decided to try out another coffee shop we had noticed. This one had wireless, so I decided to use it. I popped out the CDMA card, and the system froze. This was about an hour after I had built the xchat RPM.
On reboot, I found that the xchat files, while they were there, were all messed up. The RPMs and installed binaries were toast. Oddly, my .spec file and the patch I made for the xchat RPM were both OK. Of course, the kernel image I spent a hour downloading was also toast. However, I now had faster net, and was able to download the new kernel in 15 minutes.
I've noticed that “sync"ing (the first part of a software suspend) will often take a VERY long time. 5 or 10 or 15 seconds on a machine that's basically idle. I'm thinking this may have something to do with my RPMs being messed up.
In other news… In our hosting facility, we have made use of both software and hardware RAID. In general the software RAID has worked really well, as long as you set up processes to notify you when the RAID array loses a drive. Earlier this week we got a report that one of the systems we host had a drive drop out of the array. I decided to try re-adding the drive to the array and see if it dropped out again. It did within a day.
We scheduled some downtime and replaced the drive. One of the disadvantages of software RAID is that it usually isn't hot-swappable (depending on the card and the cabling and the carriers). The software can handle it, it's the hardware that can't. We did a backup and swapped the drive and used "raidhotadd” to add the new partitions back into the array. Everything was happy. Lastly we updated the boot-blocks so that if the primary drive failed we could still boot.
Today I had some other maintenance scheduled. This involved a lot of stuff, including adding a network card, rebooting into a new kernel, and upgrading the pretty much everything in the system except for the hard drives. First I decided to just reboot and see if the new kernel was happy. It was most decidedly not. It looked like one of the drives had failed. It seemed to be hanging bringing up the array.
At this point I decided to try adding the new drive, but that just made things worse. Now it couldn't mount the root partition on /dev/md* at all. I finally gave up and decided to re-install it. We have good backups of these machines, and it's a standby node in a cluster so it wasn't a huge deal. I could mount up the root partition on the underlying /dev/hda, but /dev/md was not happy.
I decided to partition the new drive (copying the current partition table over to the new drive, actually), then force a re-initialization of the RAID sub-system. I let the RAID sync finish, and decided to check the partition again. It seemed to have all the data (so I was lucky and it synced the right way). I decided to try a reboot, and the system came up fine.
So, we had one really easy RAID repair, and one really hard one. That's the way it goes sometimes.comments powered by Disqus