Your Linux Data Center Experts

Dear Lazyweb, I would like a devicemapper module kind of like the device snapshot module, which would memorize disc access patterns and then store these sectors sequentially. In the future, this disc access pattern could be replayed, streaming from the disc instead of having to seek all over the place.

The reasoning for this is that if a disc sector is, say, 1KB, and a disc can seek 120 times per second (8.5ms average seek time), worst case performance when booting is around 120KB/sec on the disc. This same disc, when streaming, can run around 55MB/sec.

If you memorize a common disc pattern, like the pattern that happens when you are booting, and later replay those blocks into the buffer cache by streaming instead of seeking, it could be a huge win. Thanks for getting right on this, Lazyweb. More discussion follows.

One problem with disc storage is that data throughput goes way down if you have to seek. If the average seek time on a hard drive is 8.5ms (one 120th of a second), and the rotational speed is 7200RPM (120RPS). If the file-system has a block size of 1KB, then accessing a bunch of small pieces of data would be dominated by seek time, not by transfer speed.

Now, the file-system will try to cluster common data close to each other, but even small reads where the disc arm doesn't have to move very far can take a lot of time. Depending on where the data is on a track, even an short seek may need to wait 120th of a second for the disc to rotate around to where the required data is. This is one reason that 15K RPM discs tend to have much lower (roughly half) average seek times than 7200RPM discs.

The Linux disc I/O system takes advantage of this fact by over-reading data. When it has a request for one sector of the disc, it will actually read a bit more, called “read ahead”. See the “-a” option of “hdparm” for more information.

This is what I refer to as a data locality issue. It's particularly noticeable in database. For example, I once had a database that I loaded from data sorted by one key, and then was trying to access it based on another key. The data was about twice the size of RAM, so I couldn't cache it. Even with an index, the record size was around 20 bytes, but I was limited to 120-ish seeks per second (or 120*20 bytes per second, 2.4KB/sec of real data throughput per second).

It was actually much faster for me to have two databases, one populated based on one key and one populated based on the other key.

I imagine it wouldn't be too hard to implement a device mapper similar to how the snapshot mapper works. At boot time, the mapper could replay it's history into the buffer cache. Some sort of control mechanism could be used to tell the mapper that the end of recording is done, and cause the memorized read pattern to be streamed onto the history device.

The history device could be another partition on the same device, probably smaller in size than physical RAM, or it could instead be another device. Say, a solid state disc, possibly even something like an SD or CF card on a laptop (efm's Laptop has an SD slot built-in).

This would provide a general-purpose way of making use of the new discs that include a small solid-state component and a larger traditional spinning disc.

I don't expect, as a mapper, it would be particularly hard to implement. I've had this idea kicking around for 6 months or so, I just won't have the time to implement it. So, I'm posting it for Lazyweb to implement. :-)

comments powered by Disqus

Join our other satisfied clients. Contact us today.