|
Improving Compilation Times: A Case Study in Performance Analysisby Gian-Paolo D. Musumeci, coauthor of System Performance Tuning, 2nd Edition02/21/2002 |
The software engineers at e-Gizmo, a software development firm developing B2C applications for the emerging, high-performance canine consumables market, are not particularly happy--a library critical to their environment is taking longer to compile than they'd like.
Their development server, polaris, consists of a Sun Enterprise 2 server with a pair of 250MHz UltraSPARC II processors, 512MB of memory, and two 18GB disks--one for 1GB of swap and the root
file system and the other for /export/src, where all their
compilation work is done. The system is running Solaris 7 with all the
current patches, and the developers are using the Forte Developer 6 update 2 compiler suite. They haven't done any tuning work.
The complaint of the software developers is that compiling one of their products, which consists of about 100,000 lines of C, takes too long--they've timed it at about 18 minutes. They want to see compilation times around 12 minutes, or about a 35% speedup, without any change to the compiler output. Furthermore, management has established that there are no funds available for this project, so they'll have to make do with the hardware they have.
Because this piece of software is pretty critical to the performance
of their entire application suite, it's compiled with a very high
level of optimizations: -fast -xarch=v8plusa.
Before we start looking at the system's workload, let's figure out what we might suspect to be bottlenecks and plan a tuning strategy. Based on how we think compilation stresses the system, we suspect that we will need to look at memory consumption, CPU usage, and disk I/O. Since NFS isn't involved, networking isn't likely to be a problem. So, we can formulate three questions:
Are we running short of memory? Our best evidence for a memory shortage is going to come from the
srfield in the output ofvmstat. This is discussed in more detail in Chapter 4 of System Performance Tuning, 2nd Edition.Are we running into disk I/O problems? Compilation is often very disk-intensive. We'll look at this through the
iostatcommand. Disk I/O is covered in Chapter 5.Are we saturating the CPUs? We are in a flat-out race to compile this library as fast as possible, so we want both CPUs to be working 100% of the time. For a more in-depth discussion of processor performance, see Chapter 3.
|
Related Reading System Performance Tuning |
Let's take a look at the data that we actually get during our first
trial run. I'll summarize representative output from vmstat 5,
iostat -xnP 30, and mpstat 5 in terms of the three questions we posed above.
Is there a memory shortage? Let's take a look at the vmstat output:
polaris% vmstat 5
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 s1 s2 -- in sy cs us sy id
0 0 0 1425960 396952 0 727 4 1 1 0 0 0 3 0 0 141 2220 121 38 12 50
0 0 0 1426912 397856 0 1207 8 0 0 0 0 0 4 0 0 148 3513 153 29 20 51
0 0 0 1426072 396896 0 628 6 0 0 0 0 0 2 0 0 330 7017 930 40 14 46
Our top indicator of a memory shortage is the sr column, but it's consistently zero. If we were short of memory, the first thing we'd try is turning on priority paging. One thing that does look suspicious is the fact that we have some idle time, as shown by the id column; we'll have to take a look at that.
Are we disk I/O bound? iostat -xnP 5 output seems to indicate
otherwise pretty conclusively:
polaris% iostat -xnP 5
...
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s2
0.2 2.6 0.6 8.6 0.0 0.0 0.0 14.0 0 2 c0t1d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t1d0s2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 polaris:vold(pid208)
...
We really aren't doing much disk activity at all, and it's all being done to /export/src (c0t1d0s0); note that c0t0d0s0 is swap, and c0t0d0s1 is /.
Finally, let's take a look at mpstat and see what's going on with our CPUs:
polaris% mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 294 0 115 214 13 92 9 6 0 0 599 9 5 1 85
1 489 0 67 28 14 39 13 8 1 0 1875 63 22 1 14
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 147 0 155 216 12 96 12 5 1 0 871 5 9 0 85
1 1098 1 71 32 17 60 14 10 1 0 2890 51 32 3 14
It looks like we're only consuming one of our processors with much useful work. On a multiprocessor system like the one that e-Gizmo is using for software development, we can use dmake, which parallelizes the compilation process. Let's see if we can use that to get the CPU utilization up.
To test this, we should start by running a make clean to clean out all the
old object files. Then, we'll unmount and remount /export/src in order to clear out any file system cache entries. This way, we'll be running the same test again, but this time with a parallel make. We'll start by trying two threads of compilation since we have two processors:
polaris% /bin/time -p dmake -j 2
...
real 614.14
user 867.14
sys 219.82
Our performance target is the wall-clock compilation time, which /bin/time reports as "real," so that's pretty good--we've already beat our performance target by almost two minutes! A quick look at the output of mpstat shows us that we have indeed gotten rid of almost all the idle time.
polaris% mpstat 5
...
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 448 0 80 221 12 83 33 9 1 0 1372 76 15 1 8
1 440 0 105 39 14 48 24 8 1 0 1218 84 13 1 2
...
We might try "oversubscribing" our processors by adding more parallel
execution threads, so we'll run a few more tests, increasing the value of j each time:
-j real time (m:s) user time (m:s) sys time (m:s)
1 18:03.03 14:14.34 3:27.34
2 10:14.14 14:27.14 3:39.82
3 9:41.86 14:28.37 3:39.45
4 9:34.66 14:33.45 3:39:22
At this point, we're starting to see diminishing returns. We could probably experiment some more and drive the compilation time down by another few seconds, but the time that e-Gizmo is willing to give us for testing is starting to run awfully short.
So, it looks like -j 4 is going to be our best bet. Finally, we'll check the output of iostat and vmstat to make sure that we haven't got another easy tuning target elsewhere in the system, but it looks like we're out of luck:
polaris% vmstat 5
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 s1 s2 -- in sy cs us sy id
1 0 0 1412432 368488 0 1583 11 0 0 0 0 0 5 0 0 512 5491 1135 71 28 0
2 0 0 1409768 365920 0 1552 11 0 0 0 0 0 10 0 0 530 5417 812 73 27 0
2 0 0 1412144 367768 0 1460 1 0 0 0 0 0 4 0 0 561 4954 518 73 27 0
polaris% iostat -xnP 5
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0s2
1.2 5.0 8.6 229.0 0.0 0.2 0.0 32.3 0 3 c0t1d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t1d0s2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 islington:vold(pid208)
Nope, no easy performance pickings left, and we're about out of time on the client system, so we'll call it a day. The e-Gizmo engineers are pretty happy with their performance boost.
As we look back on this, it was a pretty straightforward problem with a
pretty straightforward solution, but we know they aren't all this easy. The important thing to take away from this case study isn't how much speedup we achieved by using dmake to split the compilation across multiple CPUs, and especially not the exact value of -j that we found to be optimal, but rather it's about how we approached the problem. We figured out what the performance requirements were, formulated some hypotheses about where we might look for tuning opportunities, gathered some data while the system was running, and then acted on our data.
O'Reilly & Associates recently released (February 2002) System Performance Tuning, 2nd Edition.
Sample Chapter 4, Memory, is available free online.
You can also look at the Table of Contents, Index, and Full Description of the book.
For more information, or to order the book, click here.
Gian-Paolo D. Musumeci is a research engineer in the Performance and Availability Engineering group at Sun Microsystems, where he focuses on network performance.
Return to ONLamp.com.

