oreilly.comSafari Books Online.Conferences.


System Performance Tuning, 2nd Edition

Optimizing Disk Subsystems for Random I/O
A Case Study in Performance Analysis

by Gian-Paolo D. Musumeci, coauthor of System Performance Tuning, 2nd Edition

In my last article I discussed the 307 Corporation's ethanol production-monitoring environment, and how a little bit of Perl and the Solaris kernel-tracing facility can give us meaningful information about the load placed on a disk subsystem. To recap, let's summarize what we know about the workload in a sentence:

307 Corporation's production vessel monitoring system produces randomly distributed disk writes of, on average, 512 bytes, and requires at present approximately 66 I/Os per second.

We are trying to sustain 307 Corporation's goals of scaling up their production line by a factor of four. Therefore, our performance goal is to get the I/O subsystem to a point where we can sustain a minimum of 265 (265 = 66 I/Os per second * scaling up by a factor of 4) random 512-byte write operations per second.

A good first step is figuring out how many 512-byte random writes a single disk can absorb. There are lots of publicly available benchmark applications that specialize in exactly this sort of thing, but in the interest of clarity, let's write our own. We'll do this in C, because it is both fast and portable (we never know when we might need this sort of tool again). You can view the source code here.

Now that we have our benchmark, we can use it to figure out exactly how fast a single disk is for what we're trying to accomplish. However, we also need to find out how many concurrent write threads the 307 Corporation's application is using. One approach to doing this is to use ps -eL and make an educated guess. After asking around, it looks like they're using four threads for writes, so we'll run four copies of the benchmark and sum the answers. It looks like one disk can do about 80 I/O operations per second.

So we're going to need about four disks, which we would expect to give us about 320 I/O operations per second. Three disks would only give us 240 I/Os per second, which is awfully close -- however, we really want an additional 20-30 percent (50 to 80 I/Os) to provide some margin for overhead. (As a reminder, the 307 Corporation is using a Sun Ultra Enterprise 3500 -- the four disks are Fibre Channel and are installed internally as devices c1t4, c1t5, c1t6, and c1t7.)

In order to use those four disks effectively, we'll build a striped disk array (RAID 0) with the Solaris DiskSuite tool, which can be found in the EA directory on the second Solaris install CD. While using a striped volume increases our chances of data loss, as losing one of our four disks will entail losing all the information stored on the array, it's a necessary tradeoff here.

The 307 Corporation doesn't want to spend the money to install eight disks, which is how many are necessary for RAID 0+1 (striped mirrors). Unfortunately, our workload is also a classic example of the weakness of RAID 5 (parity-protected striping) arrays, which perform small writes very poorly. Without much of a choice, then, we'll set up a striped array, and be sure that our final report to the 307 Corporation emphasizes the importance of backing up data.

The first step in building the disk array is creating "metadevice database replicas," used to store information about what disk arrays are configured:

# metadb -c 2 -a -f \
/dev/dsk/c1t4d0s7 /dev/dsk/c1t5d0s7 /dev/dsk/c1t6d0s7 /dev/dsk/c1t7d0s7
# metadb
    flags      first blk    block count
   a    u     16       1034      /dev/dsk/c1t4d0s7
   a    u     1050      1034      /dev/dsk/c1t4d0s7
   a    u     16       1034      /dev/dsk/c1t5d0s7
   a    u     1050      1034      /dev/dsk/c1t5d0s7
   a    u     16       1034      /dev/dsk/c1t6d0s7
   a    u     1050      1034      /dev/dsk/c1t6d0s7
   a    u     16       1034      /dev/dsk/c1t7d0s7
   a    u     1050      1034      /dev/dsk/c1t7d0s7

Now that we have a working metadevice database, we can construct the stripe itself:

# metainit d0 1 4 \
/dev/dsk/c1t4d0s0 /dev/dsk/c1t5d0s0 /dev/dsk/c1t6d0s0 /dev/dsk/c1t7d0s0 \
-i 128k
d0: Concat/Stripe is setup

(For more information on Solaris’ DiskSuite, consult Sun's online documentation, or check out Chapter 6 of System Performance Tuning, which covers disk array design in much more detail.)

Related Reading

System Performance Tuning
By Gian-Paolo D. Musumeci, Mike Loukides

We can run the same benchmark again on the new metadevice, /dev/md/rds, and then only write a little bit of data. This is sort of like getting a reservation at Le Cirque and then only ordering a soufflé -- it's perfectly fine and you get what you wanted out of it, but wouldn't you rather sit down and have an entire meal while you're there?

So there is definitely a pair of nice opportunities for algorithmic change here: we can try and move away from a random disk access pattern, and we can work to aggregate our writes into big chunks instead of 512-byte pieces.

First, what if we modify their application to do sequential, rather than random, writes to the disk? We can quickly change our benchmark we developed above to test that exact situation, simply by removing the lseek64() system call and the surrounding code. If we run our modified benchmark, we sustain about 409 I/O operations per second.

This is great -- we've far exceeded our performance goal with just one disk! This nicely illustrates just how much opportunity there is in transforming random disk operations to sequential ones.

Another opportunity is apparent. The 512 bytes isn't much data to write to disk -- maybe we could write some software to aggregate all those little writes into big writes. Let's say we think we can gather 16 records into a single write. We can change the I/O size by using the -s flag to our benchmark and get an idea of how fast we'll go. A quick experiment with the -s 8192 option reports that we can now sustain 70 I/Os per second -- but each I/O is actually 16 records, so we're writing *1,120* records to disk per second!

That's pretty darn good. Note that if we use both of these two changes, we have quite an impressive performance jump: up to 218 I/Os (3,488 records) per second.

This about wraps up our treatment of the 307 Corporation and their need to understand and optimize disk writes. Let's summarize the most important things we've touched on:

  1. The first step is always understanding. We used Perl and the Solaris kernel-tracing facility (see last month's article) to improve our conception of what sort of work the system was doing. If you don't know what's happening, it's very hard to figure out what's broken, and that makes it in turn very difficult to fix.

  2. We wrote a benchmark, based on our understanding, to try and simulate the workload.

  3. We met our performance goal by selecting an appropriate disk array technology, deploying it, and validating our performance improvements via our disk benchmark.

  4. We looked at the possibility for algorithmic change, and found that with the right sort of underlying modifications, we could completely obviate the necessity for the disk array. This illustrates a very important principle: you can usually buy your way out of a problem (deploying a disk array), but it's often cheaper and faster to think your way out of it (by changing an algorithm or fixing a poorly performing implementation of a good algorithm).

Gian-Paolo D. Musumeci is a research engineer in the Performance and Availability Engineering group at Sun Microsystems, where he focuses on network performance.

Return to

Sponsored by: