ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


SVG Essentials

An SVG Histogram

by J. David Eisenberg, author of SVG Essentials
02/07/2002

SVG is an XML markup language for describing scalable vector graphics.

In An Introduction to Scalable Vector Graphics on XML.com, you'll find a brief introduction to SVG. The example in that article, an advertisement for a camera store, was drawn by hand. In this article, we'll generate a graphic from existing data. Specifically, we'll write a Perl program that draws a graph of the distribution of file sizes in a directory and its subdirectories.

Acquiring the Data

Before we can start the graph, we must collect the data. We'll use the File::Find module to help us search the directory. The program starts off with variable definitions:

use warnings;
use strict;

use File::Find;

my @file_sizes;
   # array of file sizes
my $total_files;
   # number of files in directory
   # and its subdirectories
my $max_file_size;
   # maximum file size

It then tests to see if you've provided the name of a directory for it to analyze:

if (scalar @ARGV != 1)
{
    print "Create SVG diagram showing file size ",
        " distribution for a directory\n";
    print "Usage: $0 directory\n";
    exit(0);
}

Next, the program sets up the counter variables and scans the directory:

$total_files = 0;
$max_file_size = 0;
find(\&accumulate, $ARGV[0]);

Here's the function that accumulates the data. All the file sizes go into the file_size array while the function keeps track of the total number of files and the size of the largest file.

sub accumulate
{
   my @info;

   if (-f $_)
   {
      @info = stat($_);
      push @file_sizes, $info[7];
      if ($info[7] > $max_file_size)
      {
         $max_file_size = $info[7];
      }
      $total_files++;
   }
}

Creating the Graphic

Now that the data is available, we are prepared to create the graphic. We could create a histogram with the file-size range on the x-axis and the number of files in each range on the y-axis, but that would be ordinary and boring. Instead, we'll create a graph that shows a horizontal bar divided into 100 sections. Each section represents 1 percent of the range of file sizes. Instead of showing the number of files in a range by a bar height, we'll show it by the color of that section of the bar. The more intense the color, the more files there are in that range. Here's what the graph will look like. (The image has been cropped to fit nicely on this page.)

Resulting graph.
The Histogram Graph.


First, let's start with the standard SVG header:

print <<"SVG_HEAD";
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN"
    "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">

Then we need to establish the dimensions of the drawing. To make our lives easier, we'll set the viewBox so that one unit in our SVG coordinate system will be equivalent to one pixel. The preserveAspectRatio attribute is set so that the drawing always appears at the upper left of the viewing area.

<svg width="700px" height="80px"
    viewBox="0 0 700 80"
    preserveAspectRatio="xMinYMin meet">

The root element is followed by a title and description:

    <title>File sizes for $ARGV[0]</title>
    <desc>
        Distribution of file sizes
        of $ARGV[0] and its  subdirectories
    </desc>

Next, we'll define an empty bar (which we will subdivide) and some gray bars that show every 10 percent of the range of file sizes. To keep life easy, we'll make the bar 600 units long and 20 units tall. The empty bar has a one-half unit outset from this 600-by-20 area so that the one-unit-wide stroke outline won't overlap the area that depicts the data. Similarly, the gray dividing lines have a stroke width of one half unit.

<defs>
<g id="empty-bar">
   <rect x="-1" y="19" width="602" height="22"
      style="fill: none; stroke: black;"/>
</g>

<g id="gray-dividers"
   style="stroke: gray; fill: none;
   stroke-width: 0.5;">
   <path d="M 60 20 v 20
      M 120 20 v 20
      M 180 20 v 20
      M 240 20 v 20
      M 300 20 v 20
      M 360 20 v 20
      M 420 20 v 20
      M 480 20 v 20
      M 540 20 v 20"/>
</g>
</defs>
SVG_HEAD

Now we're ready to draw the part of the graphic that is data-dependent. We'll need some variables, as described here:

my $size;       # size of an individual file
my $i;          # the ubiquitous loop counter
my $x;          # x-offset of a size range
my $pct_count;  # % of files in a range
my $hue;        # amount of blue for a range
                # equal to 100 - file percentage
my @slots;      # number of files in each range

The first step is to look at all the file sizes and distribute them through the 1 percent slots of the file range. We'll cheat a bit with the 100 percent range, and shoehorn it in with the 99 percent range, since there are 101 numbers in the range zero to one hundred.

foreach $size (@file_sizes)
{
   $pct_count = int(100 * $size/$max_file_size);
   if ($pct_count == 100)
   {
      $pct_count = 99;
   }        
   $slots[$pct_count]++;
}

Next, we'll offset the graph away from the origin. The SVG is produced by this Perl statement:

print qq!<g transform="translate(10, 15)">\n!;

The next step is to display the directory name with the total number of files. Here's what the SVG code should look like when it's produced:

   <text x="0" y="15"
      style="font-size: 12px;">
   directory ($total_files files)
   </text>

And here's the Perl that produces it:

print qq!<g transform="translate(10, 15)">\n!;
print qq!\t<text x="0" y="15" !,
      qq!style="font-size: 12px;">\n!;
print qq!\t\t$ARGV[0] ($total_files files)\n!;
print qq!\t</text>\n!;

Now we'll go through all the slots, and if there are files in that slot, we'll draw a blue rectangle of the appropriate intensity. We do this by setting the blue to 100 percent, and desaturating it in inverse proportion to the number of files. When calculating the red and green amounts, we subtract from 95 rather than 100, thus ensuring that a range with a small number of files will be visibly different from white (which signifies a range with no files).

For example, if the fifth percentage of the file-size range contains 20 percent of the total number of files, we want to produce this SVG:

<rect x="36" y="20", width="6" height="20"
   style="fill: rgb(75%, 75%, 100%);"/>

Here's the Perl code that generates all the rectangles:

for ($i=0; $i < 100; $i++)
{
   next if (!$slots[$i]);
   $hue =
      95 - int(100 * $slots[$i]/$total_files);
   if ($hue < 0 )
   {
      $hue = 0;
   }
   $x = $i * 6;

   # display the rectangle  
   print qq!\t<rect x="$x" y="20" !,
         qq!width="6" height="20"\n!;
   print qq!\t\tstyle="fill: !,
         qq!rgb(${hue}%, ${hue}%, 100%);"/>\n!;
}

Now, we'll draw the enclosing bar and dividers:

print qq!\t<use xlink:href="#empty-bar"/>\n!;
print qq!\t<use xlink:href="#gray-dividers"/>\n!;

Finally, we need to label the bar with file sizes in kilobytes. We'll label the left, middle, and end only.

print qq!\t<g style="font-size: 10px; !,
   qq!text-anchor:middle;">\n!;

print qq!\t\t<text x="0" y="50">!;
print "0";
print qq!</text>\n!;

print qq!\t\t<text x="300" y="50">!;
print int($max_file_size/2048), "k";
print qq!</text>\n!;

print qq!\t\t<text x="600" y="50">!;
print int($max_file_size/1024), "k";
print qq!</text>\n!;

print qq!\t</g>\n!;

The next bit of Perl ends the program by closing the SVG elements that are still open.

print qq!</g>\n!;
print qq!</svg>\n!;

Here is the result of our program when it is applied to the usr/include directory.

Resulting graph.
The Histogram Graph.


Adding Numbers to the Graphic

At this point, you may be thinking, "So? I could create a similar graphic in a JPEG or a PNG format with many of the tools available in Perl. What has SVG added to the party?"

Related Reading

SVG Essentials
By J. David Eisenberg

Well, the graphic gives you a good idea of the distribution, but doesn't give you as much information as you might want. Your eye can't easily distinguish between a segment holding 15 percent of the files and 20 percent of the files. On a normal histogram, we could get this information from the y-axis. On this graph, we're going to put the number of files for each section inside the rectangle. If you're about to object that the numbers will be illegibly small, please hold the objection--and that thought--for a little while.

The code for doing this has to make a few decisions. First, if the rectangle's color is fairly intense, we want white text instead of black text so that it shows up clearly. Second, if there are fewer than ten files in a range, we'll center the digit in the rectangle. Otherwise, we'll squeeze all the digits into a five-unit space by using the textLength attribute. We'll set the lengthAdjust attribute to have the value spacingAndGlyphs, which means that SVG will shrink both the spacing between characters and the width of the characters themselves in order to fit them into the given textLength.

As an example, here's the SVG for two rectangles, one containing the number 437 and the other containing the number 5:

<rect x="96" y="20" width="6" height="20"
   style="fill: rgb(68%, 68%, 100%);"/>
<text x="96" y="34" textLength="5"
   lengthAdjust="spacingAndGlyphs"
   style="fill: black;">
   437
</text>

<rect x="102" y="20" width="6" height="20"
    style="fill: rgb(90%, 90%, 100%);"/>
<text x="105" y="34"
    style="text-anchor:middle; fill: black;">
    5
</text>

Here's the additional Perl code needed to bring this about; it goes inside the for loop, and results in the following graphic.

if ($slots[$i] < 10)
{
   print
   qq!\t<text x="!, $x+3, qq!" y="34" !,
      qq!style="text-anchor:middle; fill: !,
      ($hue < 30) ? "white" : "black",
      qq!;">$slots[$i]</text>\n\n!;
}
else
{
   print
   qq!\t<text x="!, $x, qq!" y="34" !,
      qq!textLength="5" !,
qq!lengthAdjust="spacingAndGlyphs"\n!,
qq!\t\t!,
      qq!style="fill: !,
      ($hue < 30) ? "white" : "black",
      qq!;">$slots[$i]</text>\n\n!;
}

Histogram example.
Histogram with number of files in each section.

As you can see, the numbers are indeed too small to read. (If you keep reading it, you'll go blind.) If we were creating a PNG or a JPEG graphic, we'd be stuck with a useless graph. Zooming in to see a larger image would give us something like this:

Example of bad zooming.
What zooming in on a PNG or JPG would look like.

However, SVG creates scalable graphics drawn by vectors (lines), not by individual pixels. This means we can load the SVG file into our viewer program, zoom in on it, and see the numbers clearly. This technique, by the way, could be used for map data--you could download a map containing all the information for an area of a city, and zoom in to the details without having to make repeated trips to the server.

SVG zoom example.
Zooming in on the SVG.

And there you have it. The file sizegraph.txt shows the entire program.

We've used a little bit of Perl, a little bit of imagination, and the scalability of vector graphics to come up with a compact display of file information.

Further References

Of course, the source authority for all things SVG is the World Wide Web consortium's SVG page. For more details about SVG, you can read my book, SVG Essentials.

If you need an SVG viewer program, you can obtain a plug-in from Adobe. A viewer program and SVG tools are also available from the Apache Batik project.

If you're looking for some other way to use Perl to generate SVG, you'll find a module for that purpose at the RO IT Systems site. This site also has many interesting examples that showcase the interactive aspects of SVG.

J. David Eisenberg


O'Reilly & Associates recently released (February 2002) SVG Essentials.

Return to the O'Reilly Network.

Copyright © 2009 O'Reilly Media, Inc.