ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Generating Database Server-Side Cross Tabulations

by Giuseppe Maxia
12/04/2003

If you are a DBA or a database programmer, your manager may have asked you to prepare a breakdown of employees by location and department, or a breakdown of products by category and provider. If your boss is a techie, instead of a breakdown she might have mentioned a pivot table, or, if she likes to use "the right words," a cross-tabulation.

Cross-tabulations, also known as "pivot tables" or crosstabs, are statistical reports that group data by one field, creating one column for each distinct value of another field. In colloquial terms, this way of representing data is called "breaking down the data by X and Y," where X and Y are the names of two columns in the dataset.

In the good old days before computers were on every desk, you would take the list of employees and manually count how many males and females were in each department. After spreadsheets became a common working tool, you would use some cute formulas to count the values. Advanced spreadsheets available since the early 1990s could do the whole report for you from a bare list.

In its basic form, a crosstab is as simple as this: starting from a list of values, we want to group them by the values of field A and create a column for each distinct value of field B. The desired result is a table with one column for field A and several columns for each value of field B.

Now, you would think that a database engine, being more powerful than a spreadsheet, would solve this problem quite easily. When your dataset is so large that it won't fit into a spreadsheet, you start digging for an SQL solution. Unfortunately, in standard SQL there is no predefined function to accomplish this task. Nonetheless, you can get the job done with some creativity. Let's assume that your bare data resembles this table.

+------------------------------+
|            person            |
+----+--------+--------+-------+
| id | name   | gender | dept  |
+----+--------+--------+-------+
|  1 | John   | m      | pers  |
|  2 | Mario  | m      | pers  |
|  7 | Mary   | f      | pers  |
|  8 | Bill   | m      | pers  |
|  3 | Frank  | m      | sales |
|  5 | Susan  | f      | sales |
|  6 | Martin | m      | sales |
|  4 | Otto   | m      | dev   |
|  9 | June   | f      | dev   |
+----+--------+--------+-------+

Given the column gender, having distinct values m and f, a crosstab between dept and gender would generate a row for each dept, with a column for the dept, one for m, and one for f.

The task is not trivial, even when the distinct values are known in advance and are relatively scarce. A solution that should work for most DBMS engines from this simple table is:

SELECT dept,
    COUNT(CASE WHEN gender ='m' THEN id ELSE NULL END) AS 'm',
    COUNT(CASE WHEN gender ='f' THEN id ELSE NULL END) AS 'f'
 FROM
    person 
 GROUP BY
    dept

It's easy to get discouraged when the possible values increase by a dozen, especially if they may change.

Easy solutions

The easy way of performing a cross-tabulation, as I mentioned before, is to export your data locally, load it into a spreadsheet, and use its pivot table capabilities (also called "data pilot," depending on the flavor of your spreadsheet). This solution, affordable when you have to perform the task just once and with a limited dataset, becomes intolerable when you can count your records by the millions. In such cases, a spreadsheet is not the proper tool, and you should turn your attention to DBMS calculation skills.

A hybrid solution would be to export only aggregated data and finish the job with a spreadsheet. Again, this is acceptable when you have to do it once or twice, but if you need those tables as support for your data warehouse, then you are putting too much effort on the client side, and perhaps devoting too much time dealing with macros and supporting code, to accomplish your task in a manner that looks smooth to the final user. There are, of course, specialized tools that take the burden off of your shoulders to do the dirty work for you. The good ones are on the server side and give you the final product without sweating. The bad ones are often colorful and rich in features but they shift the workload to the client machine, resulting in a slow process that makes everybody unhappy.

A viable alternative, then, is a server-side cross-tabulation calculated by the database server. For programmers unaccustomed to SQL, it may seem a Herculean task, and even the smart programmer may look at the complexity ahead with a wary eye. What is needed is a tool that reduces the complexity of the task — building a sometimes-huge SQL statement — to a few parameters that will be translated into the actual query.

DBIx::SQLCrosstab is a Perl module that can do exactly that. But before plunging into details, it would be better to analyze what a crosstab report exactly is, how complex it can get, and what you can ask the DBMS to do.

Cross-Tabulation Basics

Using the same raw data seen earlier, let's look at some trivial possibilities.

A simple crosstab between dept and gender gives us:

+-------------------------+
|    dept by gender       |
+-------+----+----+-------+
| dept  | m  | f  | total |
+-------+----+----+-------+
| dev   |  1 |  1 |     2 |
| pers  |  3 |  1 |     4 |
| sales |  2 |  1 |     3 |
+-------+----+----+-------+

The query to create this result is:

SELECT
    dept,
    COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS m,
    COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS f,
    COUNT(*) AS total
FROM
    person
GROUP BY
    dept

This crosstab is trivial for several reasons:

Before going on to explore further reasons for complexity, I want to explore the basic method of cross-tabulating with SQL, assisted by Perl. Suppose that we don't know in advance which values are stored for gender. In that case, we need to scan the dataset first.

SELECT DISTINCT
    gender
FROM
    person

Then we can take the result values to create the COUNT lines in the query. Here is when a supporting high-level programming language comes in handy. If your DBMS supports stored procedures, you can do that directly on the server, but its portability is questionable, given the variety of dialects in that field.

A Perl implementation of this task should be:

my $columns = $dbh->selectcol_arrayref(
	"SELECT DISTINCT gender FROM person"
) or die "query failed: $DBI::errstr\n";

my $query = "SELECT dept\n";

for (@$columns) {
    $query .= ",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS '$_'\n";
}
$query .= ", COUNT(*) AS total\n"
$query .= "FROM person\n"
        . "GROUP BY dept";

The resulting query would be:

SELECT dept
,COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS 'm'
,COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS 'f'
,COUNT(*) AS total
FROM person
GROUP BY dept

This is almost the same as the one we did manually — minus some stylistic tidbits perhaps, but it gets the job done.

You must have realized that there are some issues with this already. For example, the header column is hardcoded, as is the total line and the GROUP BY clause. This code, while saving you some keystrokes, is not useful in a general case. You may be tempted to make a template out of it, but wait! There is more to be seen before coding a solution.

Introducing Complex Crosstabs

Cross-tabulations can get complex in several ways, especially when the level of headers increases, and when several operations are performed in the same result. The level of headers can be further complicated when the values in the column headers have internal dependencies. Consider the case where you have a first level of countries and a second level of locations, where a given location can only belong to a single country. I will explain each issue in detail.

Multi-Level Cross-Tabulations

If single-level cross-tabulations haven't worried you, multiple level tables should give you something to ponder. In addition to everything already mentioned, multiple-level crosstabs have:

A B C1 C2 Total column header 1
D1 D2 Total D1 D2 Total column header 2
E1 E2 Total E1 E2 Total E1 E2 Total E1 E2 Total column header 3
A1 B1 - - - - - - - - - - - - - - -  
B2 - - - - - - - - - - - - - - -  
Total - - - - - - - - - - - - - - - row sub total
A2 B1 - - - - - - - - - - - - - - -  
B2 - - - - - - - - - - - - - - -  
Total - - - - - - - - - - - - - - - row sub total
Total --- - - - - - - - - - - - - - - - row total
row
header
1
row
header
2
    col
sub
total
    col
sub
total
col
sub
total
    col
sub
total
    col
sub
total
col
sub
total
col
total

Solving this type of cross-tabulation follows the general guidelines for single-level tabulations. First we collect the values for each column header, and then we produce the lines by combining the distinct values for each row.

Consider the following test database. (A sample DBIx::SQLCrosstab database schema is also available.)

+-------------------+--------------------+
| categories        | countries          |
| ================= | ================== |
| cat_id category   | country_id country |
| ------ ---------- | ---------- ------- |
|      1 employee   |          1      UK |
|      2 contractor |          2 Germany |
|      3 consultant |          3   Italy |
+-------------------+--------------------+
+-------------------+--------------------------+
| depts             | locs                     |
| ================= | ======================== |
| dept_id  dept     | loc_id loc    country_id |
| -------  -------- | ------ ------ ---------- |
|       1  pers     |      1 Rome            3 |
|       2  sales    |      2 London          1 |
|       3  dev      |      3 Munich          2 |
|       4  research |      4 Berlin          2 |
|                   |      5 Bonn            2 |
+-------------------+--------------------------+
+----------------------------------------------+
| person                                       |
| ============================================ |
| id name  dept_id loc_id gender salary cat_id |
| -- ----- ------- ------ ------ ------ ------ |
|  1 John        1      2      m   5000      2 |
|  2 Mario       1      1      m   6000      1 |
|  3 Frank       2      5      m   5000      1 |
|  4 Otto        3      4      m   6000      1 |
|  5 Susan       2      3      f   5500      3 |
|  6 Martin      2      2      m   5500      2 |
|  7 Mary        1      4      f   5500      1 |
|  8 Bill        1      3      m   5000      1 |
|  9 June        3      1      f   6000      3 |
+----------------------------------------------+

We want to get the number of persons by crossing country and location with department and gender.

Step 1. Collect the Values

SELECT DISTINCT dept FROM depts;    # results in 'pers', 'sales, 'dev'
SELECT DISTINCT gender FROM person; # results in 'm', 'f'

Step 2. Assemble the Query

SELECT 
    country, loc AS location,
    COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'pers-f',
    COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'pers-m',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'sales-f',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'sales-m',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'dev-f',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'dev-m',
    COUNT(*) AS total
FROM
    person
INNER JOIN
	depts ON (person.dept_id=depts.dept_id)
INNER JOIN
	locs ON (locs.loc_id=person.loc_id)
INNER JOIN
	countries ON (locs.country_id=countries.country_id)
GROUP BY
    country, loc

You can see that making this task automatic is a bit trickier than the previous one. Notice that now the column names can be composite, so we need to add a character to separate their components. The choice of such a character can be difficult if the column values already contain non-alphabetic letters. Notice I said "values," not "names," since the values from a column become column names in the crosstab. In addition, if you want subtotals by column, you need to add a count by department in the appropriate places, and our query would become:

SELECT 
    country, loc AS location,
    COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'pers-f',
    COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'pers-m',
    COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
	    AS 'pers',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'sales-f',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'sales-m',
    COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
	    AS 'sales',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'dev-f',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'dev-m',
    COUNT(CASE WHEN dept = 'dev'  THEN id ELSE NULL END)
	    AS 'dev',
    COUNT(*) AS total
FROM
    person
INNER JOIN
	depts ON (person.dept_id=depts.dept_id)
INNER JOIN
	locs ON (locs.loc_id=person.loc_id)
INNER JOIN
	countries ON (locs.country_id=countries.country_id)
GROUP BY
    country, loc

The result is here (don't worry about the formatting, for now. We'll come to it soon).

country location pers sales dev total
f m total f m total f m total
Germany Berlin 1 0 1 0 0 0 0 1 1 2
Bonn 0 0 0 0 1 1 0 0 0 1
Munich 0 1 1 1 0 1 0 0 0 2
Italy Rome 0 1 1 0 0 0 1 0 1 2
UK London 0 1 1 0 1 1 0 0 0 2

What about row subtotals and total? You can calculate them in the client, but assuming that you want them all at once, you can use a UNION query, such as this monster:

SELECT 
    country, loc AS location,
    COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'pers-f',
    COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'pers-m',
    COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
	    AS 'pers',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'sales-f',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'sales-m',
    COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
	    AS 'sales',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'dev-f',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'dev-m',
    COUNT(CASE WHEN dept = 'dev'  THEN id ELSE NULL END)
	    AS 'dev',
    COUNT(*) AS total
FROM
    person
INNER JOIN
	depts ON (person.dept_id=depts.dept_id)
INNER JOIN
	locs ON (locs.loc_id=person.loc_id)
INNER JOIN
	countries ON (locs.country_id=countries.country_id)
GROUP BY
    country, location
UNION
SELECT 
    country, 'zzzz' AS location,
    COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'pers-f',
    COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'pers-m',
    COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
	    AS 'pers',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'sales-f',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'sales-m',
    COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
	    AS 'sales',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'dev-f',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'dev-m',
    COUNT(CASE WHEN dept = 'dev'  THEN id ELSE NULL END)
	    AS 'dev',
    COUNT(*) AS total
FROM
    person
INNER JOIN
	depts ON (person.dept_id=depts.dept_id)
INNER JOIN
	locs ON (locs.loc_id=person.loc_id)
INNER JOIN
	countries ON (locs.country_id=countries.country_id)
GROUP BY
    country, location
UNION 
SELECT 
    'zzzz' AS country, 'zzzz' AS location,
    COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'pers-f',
    COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'pers-m',
    COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
	    AS 'pers',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'sales-f',
    COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'sales-m',
    COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
	    AS 'sales',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
	    AS 'dev-f',
    COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
	    AS 'dev-m',
    COUNT(CASE WHEN dept = 'dev'  THEN id ELSE NULL END)
	    AS 'dev',
    COUNT(*) AS total
FROM
    person
INNER JOIN
	depts ON (person.dept_id=depts.dept_id)
ORDER BY
    country, location

The first UNION creates the subtotals by country. I replaced the location name with zzzz to be sure that it will get sorted at the end of the normal rows for each country. Similarly, the second UNION has both country and location replaced by zzzz, so that they will be listed at the very end of the result set. The first UNION subquery will create one row for each country. The last subquery will create only one row, for the grand total. Notice that the last UNION subquery doesn't have a GROUP BY clause. The ORDER BY clause at the end of the query it will affect the whole result set.

The result, nicely formatted, with the zzzz translated into a more readable total, is this:

country location pers sales dev total
f m total f m total f m total
Germany Berlin 1 0 1 0 0 0 0 1 1 2
Bonn 0 0 0 0 1 1 0 0 0 1
Munich 0 1 1 1 0 1 0 0 0 2
Total 1 1 2 1 1 2 0 1 1 5
Italy Rome 0 1 1 0 0 0 1 0 1 2
Total 0 1 1 0 0 0 1 0 1 2
UK London 0 1 1 0 1 1 0 0 0 2
Total 0 1 1 0 1 1 0 0 0 2
Total Total 1 3 4 1 2 3 1 1 2 9

Automating this task is much harder than a simple crosstab, and the full explanation is beyond the scope of this article. Stay tuned to Perl Monks if you are interested in the implementation details.

Side-by-side, Multiple Operation Cross-Tabulations

A crosstab can do much more. When I worked for a large international organization, I was often asked to provide counts, summaries, and averages in the same table. If you are doing this manually, it just means cutting and pasting the column calculation lines in the query, replacing COUNT with SUM or whichever function is suitable (MIN, MAX, AVG, STD) and adding an appropriate item to the column name.

Here is an example of such a table.

country location count sum
pers sales dev total pers sales dev total
f m total f m total f m total f m total f m total f m total
Germany Berlin 1 0 1 0 0 0 0 1 1 2 5,500 0 5,500 0 0 0 0 6,000 6,000 11,500
Bonn 0 0 0 0 1 1 0 0 0 1 0 0 0 0 5,000 5,000 0 0 0 5,000
Munich 0 1 1 1 0 1 0 0 0 2 0 5,000 5,000 5,500 0 5,500 0 0 0 10,500
Total 1 1 2 1 1 2 0 1 1 5 5,500 5,000 10,500 5,500 5,000 10,500 0 6,000 6,000 27,000
Italy Rome 0 1 1 0 0 0 1 0 1 2 0 6,000 6,000 0 0 0 6,000 0 6,000 12,000
Total 0 1 1 0 0 0 1 0 1 2 0 6,000 6,000 0 0 0 6,000 0 6,000 12,000
UK London 0 1 1 0 1 1 0 0 0 2 0 5,000 5,000 0 5,500 5,500 0 0 0 10,500
Total 0 1 1 0 1 1 0 0 0 2 0 5,000 5,000 0 5,500 5,500 0 0 0 10,500
Total Total 1 3 4 1 2 3 1 1 2 9 5,500 16,000 21,500 5,500 10,500 16,000 6,000 6,000 12,000 49,500

Of course, we keep adding more and more lines to our query to achieve our goal. Even for our simple example database, using a three-level column header with all the available functions creates a query with more than 200 columns. That's not something you'd like to do manually.

Column-Value Interdependency

There is more to the crosstab complexity. The method we've seen for combining multiple-level column values won't work well if two or more levels are linked together by some dependency. Let's suppose we want to partially transpose one of the examples above, using the department and category as row headers, and country and location as column headers. A brute-force permutation of the values would produce impossible combinations, such as Germany#Rome, Italy#London, or UK#Munich, which are both a waste of calculation resources and a nuisance in the result set.

For such cases, a different strategy is necessary. Instead of gathering the columns separately, we need to collect them all at once:

SELECT DISTINCT
    country, loc AS location, gender
FROM
    person
    INNER JOIN locs ON (person.loc_id=locs.loc_id)
    INNER JOIN countries ON (locs.country_id=countries.country_id) 
ORDER BY
    country, location, gender

+---------+----------+--------+
| country | location | gender |
+---------+----------+--------+
| Germany |  Munich  | f      | 
| Germany |  Munich  | m      | 
| Germany |  Bonn    | f      | 
| Germany |  Bonn    | m      | 
| Germany |  Berlin  | f      | 
| Germany |  Berlin  | m      | 
| Italy   |  Rome    | f      | 
| Italy   |  Rome    | m      | 
| UK      |  London  | m      | 
+---------+----------+--------+

Each row in the result represents a combination of values to create a column in our query. If subtotals are required, then the result is first loaded into a tree, which is the main mechanism used by DBIx::SQLCrosstab to produce the necessary permutations with subtotals as needed.

With so many issues to consider, you can still do your queries manually. Perhaps, having good organizational skills and a powerful editor, you may be able to create your queries in a few minutes. However, if your application needs crosstabs on demand, as a data warehouse would, then you should consider using a specialized tool that can create queries at runtime.

Using DBIx::SQLCrosstab

To use DBIx::SQLCrosstab, you need to download it from CPAN and install it in your system. Its only prerequisite is Tree::DAG_Node. If you want some specialized output, then you may also optionally install YAML and Spreadsheet::WriteExcel.

A DBIx::SQLCrosstab object requires an argument containing the parameters to build the query, passed as a hash reference.

my $xtab = DBIx::SQLCrosstab->new($params)
    or die "error building object ($DBIx::SQLCrosstab::errstr)\n";

For every method in the module that is returning something, you can check the variable $DBIx::SQLCrosstab::errstr, which will contain the reason for failure.

Templating Your Query

A few parameters are required to build our query. Let's create a set of them to build the crosstab seen in the above "Multi-level cross-tabulations" section.

my $params = {
    dbh        => $dbh,     # a database handler
    op         => 'COUNT',  # the operation to perform
    op_col     => 'id',     # which column to count
    from       =>  qq{ 
        person 
        inner join locs ON (locss.loc_id=person.loc_id)
        inner join countries ON (countries.country_id=locs.country_id)
    },
    # ... 
};

The database handler identifies the database where we operate. op is the operation to perform, such as COUNT, SUM, or AVG. op_col is the column to count. It could be any non-null column for a count, or an appropriate numeric column for summaries and averages. The from item is the same thing you would indicate in the FROM clause in a SQL statement. You need to indicate any INNER or OUTER join as necessary, so that all of the values you later indicate for columns or row headers are properly referenced. Next comes the definition of rows and column headers. Each one is an array reference, as follows:

my $params = {
    # ... 
    rows   => [
                { col => 'country', alias => 'country' },
                { col => 'loc',     alias => 'location' }
    ],
};

For each header, only one item (col) is mandatory, while alias is just for readability purposes. Let me remind you that such columns are related to the from item, which must have the appropriate JOIN parts to include the table containing such columns.

$params = {
    # ... 
    cols   => [
                { id => 'dept_id', value => 'dept',     from => 'depts' },
                { id => 'cat_id',  value => 'category', from => 'categories' },
                {
                    id       => 'gender',
                    from     => 'person'
                    col_list => [
                        { id => 'f', value => 'f' },
                        { id => 'm', value => 'm' }
                    ],
                }
    ],
};

For column headers, the mandatory fields are id and from. Any value is used to create the final column name. You may notice that while depts and categories are in the column list, they are not in the main from description. This is an optimization that is further described in "Inner JOINs" later. The description of gender has another peculiarity. It provides a list of values to be used by the engine without querying the database. This is also an optimization, useful for such cases when you know the values in advance (for example, when a CONSTRAINT in your query limits the values to a given set).

Now, to get the exact result as our example below, we need to add a few optional parameters:

$params = {
    # ...
    col_total      => 1,   # adds a total column at the right end of the table
    col_sub_total  => 1,   # adds sub-totals columns where appropriate
    row_total      => 1,   # adds a grand total at the bottom end of the table
    row_sub_total  => 1,   # adds sub-totals rows where appropriate
    remove_if_zero => 1,   # remove any column with all zeroes
    add_colors     => 1,   # display the table with default colors
    commify        => 1,   # adds thousand-separating commas to numbers
};

Column-Header Choice Strategies

Spending a few minutes at the drawing desk before actually writing your query parameters could spare you time and headaches afterwards. If you are at liberty to choose, try to designate those columns with larger numbers of values as row headers, since the DBMS will handle them better. Assigning them as row headers will increase the number of final columns to generate, requiring more calculation effort and making the table less readable. It's also worth it to assign columns as row headers when the column depends on other columns. Although the module can handle these cases just fine, the strain on the DBMS is higher, because in several cases, you need to query the main data set twice: once to find the value and once to get the final results.

Sometimes, though, you need a JOINed query to find the appropriate values and save resources. Consider the case of a lookup table for categories, where you have a few hundred values. If you only need to use a dozen, it would be better to define the column header with a from clause, such as:

$params =  {
    # ...
    cols => [
                {
                    id    => 'cat_id',
                    value => 'category',
                    from  => qq{
                                categories 
                                INNER JOIN person 
                                ON (categories.cat_id=person.cat_id')
                                }
                },
            # ...
    ],
};

This way, only the relevant values are chosen, sparing the DBMS engine the useless calculation of a few hundred unused categories.

When the values of a column header come from the main data set, you may consider running the query once, finding the distinct values, and then providing them as a col_list for the subsequent runs. This practice is also useful when you only need to crosstab a specific subset.

Fine-Tuning Your Query

There are many parameters that can be passed to your DBIx::SQLCrosstab object, and you can use them to keep control of what is going to happen.

The most important options are where and having. They will limit the recordset to the records you want. There is no limitation to what you can require in these options, provided that they are legal SQL syntax that apply to the tables being queried. You may want, for instance, to limit your query for employees to the ones hired in the last month. If there is a field identifying the start of contract date, and your database engine supports date calculation, you can add a where clause saying "start_of_contract >= now() - interval 30 days" or simply "start_of_contract >= 'YYYY-MM-DD' ", depending on the particular SQL dialect of your DBMS. With having clauses, you can limit the result set by stating which criteria the aggregated values must have, so you can say something like "having => pers > 10000" to define that you want only those rows where a sum of salaries for pers is bigger than 10000.

$params =  {
    # ...
    where   => 'start_of_contract >= now() - interval 30 days',
    having  => 'pers > 10000',
};

When defining the column headers, we have already seen that we can explicitly define a list of values. Or we can explicitly exclude one or more values, to instruct the engine to retrieve all of them from the database, minus the ones that we list:

$params = {
    # ...
    cols => [
                { id => 'dept_id', value=> 'dept',
                  exclude_values => [ 'dev', 'pers' ] 
                },
                #...
     ]
};

If we modify the previous request with this definition for the dept header, the engine will only retrieve values for sales department. Another method of excluding columns is to use the general option col_exclude, which removes one column from the query. You need to figure out how the query names would be created, and then you may say:

$params = {
    # ...
    col_exclude => [ 'pers#contractor#f', "sales#consultant#m" ]
    # ...
};

These rows were present in our example query, but their results were thrown out because all the values were zeroes. If we know that for sure, then we can improve the query execution speed by removing them in the first place.

To exclude a value for a row header, use a where clause. For example, "where => country != 'Germany'".

Keeping an Eye on Performance Issues

When using a wrapper module to develop database applications, you usually trade ease of use for performance. DBIx::SQLCrosstab is not a true wrapper, because it is not a replacement for the whole database interface. Instead, it's a query builder — a very specific one, but a builder nonetheless. Most builders are peculiar in that they create queries according to some general rules, and the resulting statements often perform poorly. This module is no exception. You can use all of the options offered, but you need to understand the possible drawbacks.

A sensitive issue surrounds subtotals. Regarding column headers, a subtotal is not much of a problem, since it is only a few more calculated fields in the query. If your query grows to several hundred columns, though, it may become a problem. Keep this in mind. Also consider the purpose of your result. If you need to feed the result to a chart generator, you probably don't need subtotals. They could be misleading, anyway, if they are treated as normal columns and therefore generate wrong sums. The same principle applies if you want to export your results to a spreadsheet to perform further calculations.

Row totals and subtotals are a different case. While the same caveats apply when you want to export to a chart, the performance penalty of calculating subtotals can be huge, because DBIx::SQLCrosstab generates a UNION query for each row header (if there are more than one), plus another UNION query for the grand total. Any decent database server can handle this task for a few hundred thousand records. When it comes to millions, however, you may be kept waiting for much longer than you expected. If you have very large databases, make some measurements (even better, do it when normal users are disconnected), and decide if you want to trade waiting time for the additional hassle of dealing with the subtotals outside of the server. I have successfully tested subtotals in a four-million-record database, and the results came in acceptable times. It could be that the design of my database is better than average, or my DBMS is inherently optimized for these tasks. Either way, I can't predict the effects of using subtotal queries on a huge database with ten or twenty million records.

I'd like giving a piece of general advice for dealing with large crosstabs, and this is to index all of the columns involved in the operation -- both the headers and the one being calculated. The optimization in most database engines can take advantage of indexes in most of the operations related to cross-tabulation: DISTINCT queries, GROUP BY clauses, summaries, and JOINs.

Hidden JOINs

In the example given when explaining the parameters, I mentioned an internal optimization in DBIx::SQLCrosstab to minimize the number of JOINs necessary to perform the query. This feature is related to column headers whose values come from a lookup table. If you indicate a key column that is in both the main table and the lookup table, then you don't need to join the table in the final query.

Let me explain better. In our example, we used a column header description for departments including:

$params = {
    # ...
    cols => [
                {id => 'dept_id', value ='dept' },
                # ...
    ],
};

The resulting line in the query is:

COUNT(CASE WHEN dept_id = 1 THEN id ELSE NULL END) AS 'pers',

Since dept_id is in both person and depts, you don't need to add depts to your from clause, because the values for depts are already stored in the crosstab engine. I call this an "hidden join," because it performs the same link as a JOIN but without its burden.

On the contrary, if your definition was:

$params = {
    # ...
    cols => [
                {id => 'dept' },
                # ...
    ],
};

Then the resulting SQL would have been:

COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers',

Since dept is not a column of person, the you should have been forced to include depts in your from clause, thus querying the table twice. Therefore, your final query, the expensive one, would be slower. Think about this possibility whenever circumstances allow it.

Building Multiple Column Headers at Once

When talking about column interdependency, I said that DBIx::SQLCrosstab can deal with these cases appropriately. You can instruct the module to use this method by adding a group option to the first column definition. When this option is selected, the crosstab engine uses the from clause of the first column definition to build the entire headers tree.

$params = {
    # ...
    cols  => [
                { 
                  id      => 'country', 
                  from    =>  q{
                                person INNER JOIN locs  
                                ON (person.loc_id=locs.loc_id) 
                                INNER JOIN countries  
                                ON (countries.country_id = locs.country_id)
                                },
                  group   =>  1,
                  orderby =>  'country, loc, gender',
                },
                { id      => 'loc',     from    => '1' },
                { id      => 'gender',  from    => '1' },
    ],
};

When the crosstab engine finds this definition, instead of trying to retrieve the column values separately, it will create a single query. Notice that the from clause in the second and third column definitions is just 1, as a placeholder, because only the first definition is used for that purpose.

SELECT DISTINCT
    country, loc, gender
FROM
    person
    INNER JOIN locs ON (person.loc_id=locs.loc_id)
    INNER JOIN countries ON (locs.country_id=countries.country_id) 
ORDER BY
    country, loc, gender

It will use the result as a list of permutations, from which it will create the different combinations.

The drawback of this method is that you might possibly query the main data set twice. You need to balance the benefits of having just the right column calculation against calculating all of the combinations and then removing the unwanted ones using the option remove_if_zero.

You can also use this method when there is no dependency between column headers, but you just found out that the engine is creating more combinations than necessary.

Exporting Results

The only useful results you can get from DBIx::SQLCrosstab are the query and the result set.

my $query = $xtab->get_query() 
    or die "Error $DBIx::SQLCrosstab::errstr\n";

my $records = $xtab->get_recs()
    or die "Error $DBIx::SQLCrosstab::errstr\n";

Sometimes, this is just what you need. If you plan to use your query in a static environment, all you want is to create the query and then paste it into your application code. Or you may want to do some checking in the records and display what you want in a custom routine.

In most cases, though, you may need more than that. Therefore, instead of using DBIx::SQLCrosstab, you can use a derived class, DBIx::SQLCrosstab::Format, which has exactly the same interface, with additional capability to format your results in many different ways.

Thus, instead of creating the object as in the previous example, you'd do it using the child class:

my $xtab = DBIx::SQLCrosstab::Format->new( $params )
    or die "...";

Then, you have the choice about the format of your results.

# first, be sure that everything is OK
if ( $xtab->get_query() and $xtab->get_recs() )
{
    my $html = $xtab->as_html() or die "error ...";
    print $html;
}

The method as_html() returns a well-formatted HTML table, properly enriched with column and row spans, optionally colored with the colors of your choice (see the module documentation for details).

HTML is not the only format. This module can also handle XML, YAML, XLS (Excel spreadsheet), and CSV formats, with the relative methods:

Being a Perl module, it is also quite easy to export Perl data structure, so you can use the method as_perl_struct() with one of the following parameters:

The actual list of formats is over, but the possibilities are much wider. The module provides some handles that a skilled programmer can use to create more specialized views. I will describe these issues in another article.

To get an idea of the module potential, you can play with an interactive example, which can show you the parameters for complex crosstabs and some interesting output formats.

Then it's your turn. Do your own testing and put DBIx::SQLCrosstab to good use!

Thanks to Stefano Rodighiero (a.k.a. larsen) for proofreading this article and for his constructive criticism. Thanks also to chromatic for making a true article from a difficult draft.

Giuseppe Maxia is a QA developer in MySQL community team. A system analyst with 20 years of IT experience, he has worked as a database consultant and designer for several years. He is a frequent speaker at open source events and he's the author of many articles. He lives in Sardinia (Italy).


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.