Optimizing Disk Subsystems for Random I/O
by Gian-Paolo D. Musumeci, coauthor of System Performance Tuning, 2nd Edition
In my last article I discussed the 307 Corporation's ethanol production-monitoring environment, and how a little bit of Perl and the Solaris kernel-tracing facility can give us meaningful information about the load placed on a disk subsystem. To recap, let's summarize what we know about the workload in a sentence:
307 Corporation's production vessel monitoring system produces randomly distributed disk writes of, on average, 512 bytes, and requires at present approximately 66 I/Os per second.
We are trying to sustain 307 Corporation's goals of scaling up their production line by a factor of four. Therefore, our performance goal is to get the I/O subsystem to a point where we can sustain a minimum of 265 (265 = 66 I/Os per second * scaling up by a factor of 4) random 512-byte write operations per second.
A good first step is figuring out how many 512-byte random writes a single disk can absorb. There are lots of publicly available benchmark applications that specialize in exactly this sort of thing, but in the interest of clarity, let's write our own. We'll do this in C, because it is both fast and portable (we never know when we might need this sort of tool again). You can view the source code here.
Now that we have our benchmark, we can use it to figure out exactly how fast a
single disk is for what we're trying to accomplish. However, we also need to
find out how many concurrent write threads the 307 Corporation's application
is using. One approach to doing this is to use
ps -eL and make an educated
guess. After asking around, it looks like they're using four threads for
writes, so we'll run four copies of the benchmark and sum the answers. It
looks like one disk can do about 80 I/O operations per second.
So we're going to need about four disks, which we would expect
to give us about 320 I/O operations per second. Three disks would only give us
240 I/Os per second, which is awfully close -- however, we really want an additional
20-30 percent (50 to 80 I/Os) to provide some margin for overhead.
(As a reminder, the 307 Corporation is using a Sun Ultra Enterprise 3500 --
the four disks are Fibre Channel and are installed internally as devices
In order to use those four disks effectively, we'll build a striped disk array
(RAID 0) with the Solaris DiskSuite tool, which can be found in the
directory on the second Solaris install CD. While using a striped volume
increases our chances of data loss, as losing one of our four disks will
entail losing all the information stored on the array, it's a necessary
The 307 Corporation doesn't want to spend the money to install eight disks, which is how many are necessary for RAID 0+1 (striped mirrors). Unfortunately, our workload is also a classic example of the weakness of RAID 5 (parity-protected striping) arrays, which perform small writes very poorly. Without much of a choice, then, we'll set up a striped array, and be sure that our final report to the 307 Corporation emphasizes the importance of backing up data.
The first step in building the disk array is creating "metadevice database replicas," used to store information about what disk arrays are configured:
# metadb -c 2 -a -f \ /dev/dsk/c1t4d0s7 /dev/dsk/c1t5d0s7 /dev/dsk/c1t6d0s7 /dev/dsk/c1t7d0s7 # metadb flags first blk block count a u 16 1034 /dev/dsk/c1t4d0s7 a u 1050 1034 /dev/dsk/c1t4d0s7 a u 16 1034 /dev/dsk/c1t5d0s7 a u 1050 1034 /dev/dsk/c1t5d0s7 a u 16 1034 /dev/dsk/c1t6d0s7 a u 1050 1034 /dev/dsk/c1t6d0s7 a u 16 1034 /dev/dsk/c1t7d0s7 a u 1050 1034 /dev/dsk/c1t7d0s7
Now that we have a working metadevice database, we can construct the stripe itself:
# metainit d0 1 4 \ /dev/dsk/c1t4d0s0 /dev/dsk/c1t5d0s0 /dev/dsk/c1t6d0s0 /dev/dsk/c1t7d0s0 \ -i 128k d0: Concat/Stripe is setup
(For more information on Solaris’ DiskSuite, consult Sun's online documentation, or check out Chapter 6 of System Performance Tuning, which covers disk array design in much more detail.)
We can run the same benchmark again on the new metadevice,
then only write a little bit of data. This is
sort of like getting a reservation at Le Cirque and then only ordering a
soufflé -- it's perfectly fine and you get what you wanted out of it, but
wouldn't you rather sit down and have an entire meal while you're there?
So there is definitely a pair of nice opportunities for algorithmic change here: we can try and move away from a random disk access pattern, and we can work to aggregate our writes into big chunks instead of 512-byte pieces.
First, what if we modify their application to do sequential, rather than
random, writes to the disk? We can quickly change our benchmark we developed
above to test that exact situation, simply by removing the
lseek64() system call and the surrounding code. If we run our modified benchmark, we sustain
about 409 I/O operations per second.
This is great -- we've far exceeded our performance goal with just one disk! This nicely illustrates just how much opportunity there is in transforming random disk operations to sequential ones.
Another opportunity is apparent. The 512 bytes isn't much data to write to disk
-- maybe we could write some software to aggregate all those little writes
into big writes. Let's say we think we can gather 16 records into a single
write. We can change the I/O size by using the
-s flag to our benchmark and get an idea of how fast we'll go. A quick experiment with the
option reports that we can now sustain 70 I/Os per second -- but each I/O is
actually 16 records, so we're writing *1,120* records to disk per second!
That's pretty darn good. Note that if we use both of these two changes, we have quite an impressive performance jump: up to 218 I/Os (3,488 records) per second.
This about wraps up our treatment of the 307 Corporation and their need to understand and optimize disk writes. Let's summarize the most important things we've touched on:
The first step is always understanding. We used Perl and the Solaris kernel-tracing facility (see last month's article) to improve our conception of what sort of work the system was doing. If you don't know what's happening, it's very hard to figure out what's broken, and that makes it in turn very difficult to fix.
We wrote a benchmark, based on our understanding, to try and simulate the workload.
We met our performance goal by selecting an appropriate disk array technology, deploying it, and validating our performance improvements via our disk benchmark.
We looked at the possibility for algorithmic change, and found that with the right sort of underlying modifications, we could completely obviate the necessity for the disk array. This illustrates a very important principle: you can usually buy your way out of a problem (deploying a disk array), but it's often cheaper and faster to think your way out of it (by changing an algorithm or fixing a poorly performing implementation of a good algorithm).
Gian-Paolo D. Musumeci is a research engineer in the Performance and Availability Engineering group at Sun Microsystems, where he focuses on network performance.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.