ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Hierarchical SQL

by Joe Celko
08/05/2004

In the early days of SQL, one of the objections to the relational approach to data was that it could not handle hierarchies and tree structures. In those days, IMS from IBM was the major storage tool in large corporations. It still is.

IMS is a pure tree structure that the user navigates. This was the basis for all serious, large-scale DP work, so people assumed that if SQL could not handle such structures, it would never be a serious production tool. You might use it for a reporting tool or something off to the side, but not as the main database for the enterprise.

Well, IMS is still out there and it holds a lot of data, but SQL has moved into serious production databases. As it turned out, the problem was not that SQL could not model hierarchies. The problem was programmers had to learn how to model hierarchies. New tools are like that; you surprise yourself when you finally get the feel for them.

There are many different ways to represent trees in SQL and this short article discusses one of them. This material is taken, slightly modified, from Chapter 3 of my new book, Trees & Hierarchies in SQL for Smarties from Morgan-Kaufmann Publishers.

Path Enumeration Models

Related Reading

SQL in a Nutshell
A Desktop Quick Reference
By Kevin Kline

One of the properties of trees is that there is one and only one path from the root to every node in the tree. The path enumeration model stores that path as a string by concatenating either the edges or the keys of the nodes in the path. Searches are done with string functions and predicates on those path strings.

There are two methods for enumerating the paths, edge enumeration and node enumeration. The node enumeration is the most common, and there is little difference in the basic string operations on either model. However, the edge enumeration model has some useful numeric properties.

It is probably a good idea to give the nodes a CHAR(n) identifier of a known size and format to make the path concatenations easier to handle. The other alternative is to use VARCHAR(n) strings, putting a separator character between each node identifier in the concatenation — a character that does not appear in the identifier itself.

To keep the examples as simple as possible, let's use my five person Personnel_OrgChart table and a CHAR(1) identifier column to build a Path Enumeration model.

-- path is a reserved word in SQL-99
-- CHECK() constraint prevents separator in the column.)

CREATE TABLE Personnel_OrgChart
(emp_name CHAR(10) NOT NULL,
 emp_id CHAR(1) NOT NULL PRIMARY KEY
        CHECK(REPLACE (emp_id, '/', '') = emp_id)),
 path_string VARCHAR(500) NOT NULL);


CREATE TABLE OrgChart 
(emp CHAR(10) NOT NULL PRIMARY KEY, 
  boss CHAR(10) DEFAULT NULL REFERENCES OrgChart(emp), 
  salary DECIMAL(6,2) NOT NULL DEFAULT 100.00);

OrgChart 
emp       boss      salary 
===========================
'Albert'  NULL      1000.00
'Bert'    'Albert'   900.00
'Chuck'   'Albert'   900.00
'Donna'   'Chuck'    800.00
'Eddie'   'Chuck'    700.00
'Fred'    'Chuck'    600.00

Another way of representing trees is to show them as nested sets. Since SQL is a set-oriented language, this is a better model than the usual adjacency-list approach you see in most textbooks. Let us define a simple Personnel-Orgchart like this. It would look like this as a directed graph:

        Albert
         /   \
        /     \
    Bert     Chuck 
            /  |  \
           /   |   \
          /    |    \
      Donna  Eddie Fred

Personnel_OrgChart
 emp_name  emp_id   path_string
===============================
'Albert'   'A'     'A'
'Bert'     'B'     'AB'
'Chuck'    'C'     'AC'
'Donna'    'D'     'ACD'
'Eddie'    'E'     'ACE'
'Fred'     'F'     'ACF'

Note that I have not broken the sample table into Personnel (emp_id, path_string) and OrgChart (emp_id, emp_name) tables. This would be a better design, but allow me this bit of sloppiness to make the code simpler to read. REPLACE (<str_exp_1>, <str_exp_2>, <str_exp_3>) is a common vendor string function. It searches the first string expression for all occurrences of the second string expression. If it finds any, it replaces them with the third string expression. The third string expression can be the empty string as in the CHECK () constraint just given.

Another problem is how to prevent cycles in the graph. A cycle would be represented as a path string in which at least one emp_id string appears twice, such as ABCA in my sample table. The following constraint with a subquery can fix this:

 CHECK (NOT EXISTS
        (SELECT *
           FROM Personnel_OrgChart AS D1,
                Personnel_OrgChart AS P1
          WHERE CHAR_LENGTH (REPLACE (D1.emp_id, P1.path_string, ''))
                < (CHAR_LENGTH(P1.path_string)
                   - 1) -- size of one emp_id string
        ))

Another fact about such a tree is that no path can be longer than the number of nodes in the tree.

CHECK ((SELECT MAX(CHAR_LENGTH(path_string))
           FROM Personnel_OrgChart AS P1)
         <= (SELECT COUNT(emp_id) * CHAR_LENGTH(emp_id)
               FROM Personnel_OrgChart AS P2))

This assumes that the emp_id has a fixed length and there are no separators between them in the path_string. Unfortunately, the SQL-92 feature of a subquery in a constraint is not widely implemented yet.

Finding the Depth of the Tree

If you have used a fixed-length emp_id string, then the depth is the length of the path divided by the length of the emp_id string, CHAR_LENGTH(emp_id).

CHAR_LENGTH(path_string)/ CHAR_LENGTH(emp_id)

I have made it easy to compute by using a single-character emp_id code. This is not usually possible in a real tree with several hundred nodes.

If you used a varying length emp_id, then the depth is:

CHAR_LENGTH(path_string) - CHAR_LENGTH (REPLACE (path_string, '/', '')) +1

As explained earlier, the REPLACE() function is not a Standard SQL string function, but it is quite common in actual SQL products. This approach counts the separators.

Searching for Subordinates

Given a parent, find all of the subtrees under it. The immediate solution is this:

SELECT *
   FROM Personnel_OrgChart
  WHERE path_string LIKE '%' || :parent_emp_id || '%';

The problem is that searches with LIKE predicates, whose pattern begins with a % wildcard, are slow, because they usually generate a table scan. Also, note that using _% in the front of the LIKE predicate pattern will exclude the root of the subtree from the answer. Another approach is this query:

SELECT *
   FROM Personnel_OrgChart
  WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
                    WHERE emp_id = :parent_emp_id) || '%';

The subquery will use the indexing on the emp_id column to find the "front part" of the path string from the root to the given parent.

Traveling down the tree is easy. Instead of the % wildcard, use a string of underscore (_) wildcards of the right length. For example, this will find the immediate children of a given parent emp_id:

SELECT *
   FROM Personnel_OrgChart
  WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
                    WHERE emp_id = :parent_emp_id) ||'_';

Many SQL products have a function that will pad a string with repeated copies of an input string or return a string of repeated copies of an input string. For example, SQL Server has a REPLICATE (<character exp>, <integer exp>) and Oracle has LPAD() and RPAD(). This can be useful for generating a search pattern of underscores on the fly.

SELECT *
   FROM Personnel_OrgChart
  WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
                    WHERE emp_id = :parent_emp_id)
                  || REPLICATE ('_', :n);

To find the immediate subordinates, assuming a numeric path string using periods, like the structure of a numeric outline:

SELECT * 
  FROM Personnel_OrgChart
 WHERE path_string LIKE '01.02.01.%'
   AND path_string NOT LIKE '01.02.01.%.%';

The second search condition is there to prevent a table scan and to restrict the results to the immediate subordinates.

Searching for Superiors

Given a node, find all of its superiors. This requires disassembling the path back into the identifiers that constructed it. We can use a table of sequential integers to find the required substrings:

SELECT SUBSTRING (P1.path_string 
            FROM (seq * CHAR_LENGTH(P1.emp_id)) 
             FOR CHAR_LENGTH(P1.emp_id)) AS emp_id
   FROM Personnel_OrgChart AS P1,
        Sequence AS S1  
  WHERE P1.emp_id = :search_emp_id
    AND S1.seq <= CHAR_LENGTH(path_string)/CHAR_LENGTH(emp_id);

The problem is that this does not tell you the relationships among the superiors, only who they are. Those relationships are actually easier to report.

SELECT P2.*
   FROM Personnel_OrgChart AS P1,
        Personnel_OrgChart AS P2
  WHERE P1.emp_id = :search_emp_id
    AND POSITION (P2.path_string IN P1.path_string) = 1;

Deleting a Subtree

Given a node, delete the subtree rooted at that node. This can use the same predicate as for finding the subordinates.

DELETE FROM Personnel_OrgChart
 WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
                   WHERE emp_id = :dead_guy) || '%';

Deleting a Single Node

Once more we have to face the problem that removing a non-leaf node from a tree means that it is no longer a tree. We need to have rules for changing the structure.

Assuming that we simply move everyone up a level in the tree, we can first remove that node emp_id from the Personnel_OrgChart table and then remove that emp_id from the paths of the other nodes.

BEGIN ATOMIC
DELETE FROM Personnel_OrgChart
 WHERE emp_id = :dead_guy;
UPDATE Personnel_OrgChart
   SET path_string = REPLACE (path_string, :dead_guy, '');
END;

There are other methods of rebuilding the tree structure after deleting a node: promoting a subordinate based on some criteria to the newly vacant position or leaving a vacancy (a dummy employee named {{vacant}}) in the organizational chart are options. These options usually require some combination of node deletions and insertions.

Inserting a New Node

When inserting a new node, simply concatenate the new emp_id to the end of the path of its parent node.

INSERT INTO Personnel_OrgChart
VALUES (:new_guy, :new_emp_id,
        (SELECT path_string 
           FROM Personnel_OrgChart 
          WHERE emp_id = :new_guy_boss)
         || :new_emp_id);

This basic statement can work for insertion of a subtree with some modification:

INSERT INTO Personnel_OrgChart
SELECT N1.emp, N1.emp_id,
        (SELECT path_string 
           FROM Personnel_OrgChart 
          WHERE emp_id = :new_tree_boss)
         || N1.emp_id
  FROM NewTree AS N1;

Splitting Up a Path String

The path string contains information about the nodes in the path it represents, so you will often want to split it back into the nodes it represents. This is easier to do if the path string uses a separator character, such as a comma or slash. I will use a slash so this will look like a directory path in UNIX. You will also need a table called Sequence that contains a set of integers from 1 to (n).

CharIndex(<search string>, <target string>, <starting position>) is a vendor version of the Standard SQL function POSITION(<search string> IN <target string>). It begins the search at a position in the target string, thus when the <starting position< = 1, the two are equivalent. One definition is:

CREATE FUNCTION CharIndex (IN search_str VARCHAR(1000), 
    IN target VARCHAR(1000), IN start_point INTEGER) RETURNS INTEGER 
RETURN
   (POSITION (search_str 
              IN SUBSTRING (target FROM start_point)) + start_point -1);

Version 1

SELECT CASE WHEN SUBSTRING('/' || P1.path_string || 
                                      '/' FROM S1.seq FOR 1) = '/'
             THEN SUBSTRING('/' || P1.path_string || '/' FROM (S1.seq +1) 
                FOR CharIndex('/', '/' || P1.path_string || '/', S1.seq +1) 
                   - S1.seq - 1)
             ELSE NULL END AS emp_id
   FROM Sequence AS S1, Personnel_OrgChart AS P1
  WHERE S1.seq BETWEEN 1 AND CHAR_LENGTH('/' || P1.path_string || '/') - 1
    AND SUBSTRING('/' || P1.path_string || '/' FROM S1.seq FOR 1) = '/'

Version 2

This uses the same idea, but with two sequence numbers to bracket the emp_id embedded in the path string. It also returns the position of the subordinate emp_id in the path.

CREATE VIEW Breakdown (emp_id, step_nbr, subordinate_emp_id)
AS
SELECT emp_id, 
       COUNT(S2.seq),
       SUBSTRING ('/' || P1.path_string || '/', MAX(S1.seq || 1)
                   FROM (S2.seq - MAX(S1.seq || 1))
  FROM Personnel_OrgChart AS P1, Sequence AS S1, Sequence AS S2
 WHERE SUBSTRING ('/' || P1.path_string || '/', S1.seq, 1) = '/'
   AND SUBSTRING ('/' || P1.path_string || '/', S2.seq, 1) = '/'
   AND S1.seq < S2.seq
   AND S2.seq <= CHAR_LENGTH(P1.path_string) +1
 GROUP BY P1.emp_id, P1.path_string, S2.seq;

The S1 and S2 copies of Sequence help to locate bracketing pairs of separators. The entire set of substrings located between them is extracted in one step. The trick is to be sure that the left-hand separator of the bracketing pair is the closest one to the second separator. The step_nbr column tells you the relative position of the subordinate employee to the employee in the path.

Version 3

This version is the same as above, but more concise and easy to comprehend.

SELECT SUBSTRING('/' || P1.path_string || '/' 
                  FROM S1.seq +1
                   FOR CharIndex('/', 
                                 '/' || P1.path_string || '/', 
                                 S1.seq +1)- S1.seq - 1) AS node
   FROM Sequence AS S1, Personnel_OrgChart AS P1
  WHERE SUBSTRING('/' || P1.path_string || '/'
                   FROM S1.seq FOR 1) = '/'
    AND seq < CHAR_LENGTH('/' || P1.path_string || '/');

Version 4

Here's another way, using the LIKE predicate:

SELECT SUBSTRING(P1.path_string 
                  FROM seq +1 
                   FOR CharIndex('/', P1.path_string, S1.seq +1) - (S1.seq +1))
   FROM Sequence AS S1
        INNER JOIN 
        (SELECT '/' || path_string || '/' 
           FROM Personnel_OrgChart) AS P1.(path_string)
        ON S1.seq <= CHAR_LENGTH(P1.path_string)
           AND SUBSTRING(P1.path_string 
                         FROM S1.seq 
                          FOR CHAR_LENGTH(P1.path_string)) 
               LIKE '/_%';

The Edge Enumeration Model

So far, we have seen the node enumeration version of the path enumeration model. In the edge enumeration model, the "driving directions" for following the path from the root to each node are integers. The path column contains a string of the edges that make up a path from the root (Albert) to each node, numbering them from left to right at each level in the tree.

Personnel_OrgChart
emp_name edge_path  
==================
'Albert'   '1.'
'Bert'     '1.1.'
'Chuck'    '1.2.' 
'Donna'    '1.2.1.' 
'Eddie'    '1.2.2.' 
'Fred'     '1.2.3.'

For example, Donna is the second child of the first child (Chuck) of the root (Albert). This assigns a partial ordering to the nodes of the trees. The main advantage of this notation is that you do not have to worry about long strings, but there is no real difference in the manipulations. The numbering does give an implied ordering to siblings that might have meaning.

Summary

Path enumeration models have problems with deeper trees and with trees that do not have a natural way of naming the nodes or edges. Maintaining proper constraints can be difficult in actual SQL products because of the lack of support for Full SQL-92 features.

On the plus side, path enumeration is a very fast and natural technique for trees without great depth and frequent changes to the structure. If you perform most of your searches and aggregations from the root down, it can handle surprisingly wide tree structures.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee from 1987 to 1997 and helped write the ANSI/ISO SQL-89 and SQL-92 standards.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.