Hierarchical SQL
by Joe Celko08/05/2004
In the early days of SQL, one of the objections to the relational approach to data was that it could not handle hierarchies and tree structures. In those days, IMS from IBM was the major storage tool in large corporations. It still is.
IMS is a pure tree structure that the user navigates. This was the basis for all serious, large-scale DP work, so people assumed that if SQL could not handle such structures, it would never be a serious production tool. You might use it for a reporting tool or something off to the side, but not as the main database for the enterprise.
Well, IMS is still out there and it holds a lot of data, but SQL has moved into serious production databases. As it turned out, the problem was not that SQL could not model hierarchies. The problem was programmers had to learn how to model hierarchies. New tools are like that; you surprise yourself when you finally get the feel for them.
There are many different ways to represent trees in SQL and this short article discusses one of them. This material is taken, slightly modified, from Chapter 3 of my new book, Trees & Hierarchies in SQL for Smarties from Morgan-Kaufmann Publishers.
Path Enumeration Models
|
Related Reading
SQL in a Nutshell |
One of the properties of trees is that there is one and only one path from the root to every node in the tree. The path enumeration model stores that path as a string by concatenating either the edges or the keys of the nodes in the path. Searches are done with string functions and predicates on those path strings.
There are two methods for enumerating the paths, edge enumeration and node enumeration. The node enumeration is the most common, and there is little difference in the basic string operations on either model. However, the edge enumeration model has some useful numeric properties.
It is probably a good idea to give the nodes a CHAR(n)
identifier of a known size and format to make the path concatenations easier to
handle. The other alternative is to use VARCHAR(n)
strings, putting a separator character between each node identifier in the
concatenation — a character that does not appear in the identifier
itself.
To keep the examples as simple as possible, let's use my five person
Personnel_OrgChart table and a CHAR(1) identifier
column to build a Path Enumeration model.
-- path is a reserved word in SQL-99
-- CHECK() constraint prevents separator in the column.)
CREATE TABLE Personnel_OrgChart
(emp_name CHAR(10) NOT NULL,
emp_id CHAR(1) NOT NULL PRIMARY KEY
CHECK(REPLACE (emp_id, '/', '') = emp_id)),
path_string VARCHAR(500) NOT NULL);
CREATE TABLE OrgChart
(emp CHAR(10) NOT NULL PRIMARY KEY,
boss CHAR(10) DEFAULT NULL REFERENCES OrgChart(emp),
salary DECIMAL(6,2) NOT NULL DEFAULT 100.00);
OrgChart
emp boss salary
===========================
'Albert' NULL 1000.00
'Bert' 'Albert' 900.00
'Chuck' 'Albert' 900.00
'Donna' 'Chuck' 800.00
'Eddie' 'Chuck' 700.00
'Fred' 'Chuck' 600.00
Another way of representing trees is to show them as nested sets. Since SQL is a set-oriented language, this is a better model than the usual adjacency-list approach you see in most textbooks. Let us define a simple Personnel-Orgchart like this. It would look like this as a directed graph:
Albert
/ \
/ \
Bert Chuck
/ | \
/ | \
/ | \
Donna Eddie Fred
Personnel_OrgChart
emp_name emp_id path_string
===============================
'Albert' 'A' 'A'
'Bert' 'B' 'AB'
'Chuck' 'C' 'AC'
'Donna' 'D' 'ACD'
'Eddie' 'E' 'ACE'
'Fred' 'F' 'ACF'
Note that I have not broken the sample table into Personnel (emp_id,
path_string) and OrgChart (emp_id, emp_name) tables. This
would be a better design, but allow me this bit of sloppiness to make the code
simpler to read. REPLACE (<str_exp_1>,
<str_exp_2>, <str_exp_3>) is a common vendor
string function. It searches the first string expression for all occurrences
of the second string expression. If it finds any, it replaces them with the
third string expression. The third string expression can be the empty string
as in the CHECK () constraint just given.
Another problem is how to prevent cycles in the graph. A cycle would be
represented as a path string in which at least one emp_id string
appears twice, such as ABCA in my sample table. The following
constraint with a subquery can fix this:
CHECK (NOT EXISTS
(SELECT *
FROM Personnel_OrgChart AS D1,
Personnel_OrgChart AS P1
WHERE CHAR_LENGTH (REPLACE (D1.emp_id, P1.path_string, ''))
< (CHAR_LENGTH(P1.path_string)
- 1) -- size of one emp_id string
))
Another fact about such a tree is that no path can be longer than the number of nodes in the tree.
CHECK ((SELECT MAX(CHAR_LENGTH(path_string))
FROM Personnel_OrgChart AS P1)
<= (SELECT COUNT(emp_id) * CHAR_LENGTH(emp_id)
FROM Personnel_OrgChart AS P2))
This assumes that the emp_id has a fixed length and there are
no separators between them in the path_string. Unfortunately, the
SQL-92 feature of a subquery in a constraint is not widely implemented yet.
Finding the Depth of the Tree
If you have used a fixed-length emp_id string, then the depth
is the length of the path divided by the length of the emp_id
string, CHAR_LENGTH(emp_id).
CHAR_LENGTH(path_string)/ CHAR_LENGTH(emp_id)
I have made it easy to compute by using a single-character
emp_id code. This is not usually possible in a real tree with
several hundred nodes.
If you used a varying length emp_id, then the depth is:
CHAR_LENGTH(path_string) - CHAR_LENGTH (REPLACE (path_string, '/', '')) +1
As explained earlier, the REPLACE() function is not a Standard
SQL string function, but it is quite common in actual SQL products. This
approach counts the separators.
Searching for Subordinates
Given a parent, find all of the subtrees under it. The immediate solution is this:
SELECT *
FROM Personnel_OrgChart
WHERE path_string LIKE '%' || :parent_emp_id || '%';
The problem is that searches with LIKE predicates, whose pattern
begins with a % wildcard, are slow, because they usually generate a
table scan. Also, note that using _% in the front of the
LIKE predicate pattern will exclude the root of the subtree from
the answer. Another approach is this query:
SELECT *
FROM Personnel_OrgChart
WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
WHERE emp_id = :parent_emp_id) || '%';
The subquery will use the indexing on the emp_id column to find
the "front part" of the path string from the root to the given parent.
Traveling down the tree is easy. Instead of the % wildcard,
use a string of underscore (_) wildcards of the right length. For
example, this will find the immediate children of a given parent
emp_id:
SELECT *
FROM Personnel_OrgChart
WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
WHERE emp_id = :parent_emp_id) ||'_';
Many SQL products have a function that will pad a string with repeated
copies of an input string or return a string of repeated copies of an input
string. For example, SQL Server has a REPLICATE (<character
exp>, <integer exp>) and Oracle has
LPAD() and RPAD(). This can be useful for generating
a search pattern of underscores on the fly.
SELECT *
FROM Personnel_OrgChart
WHERE path_string LIKE (SELECT path_string FROM Personnel_OrgChart
WHERE emp_id = :parent_emp_id)
|| REPLICATE ('_', :n);
To find the immediate subordinates, assuming a numeric path string using periods, like the structure of a numeric outline:
SELECT *
FROM Personnel_OrgChart
WHERE path_string LIKE '01.02.01.%'
AND path_string NOT LIKE '01.02.01.%.%';
The second search condition is there to prevent a table scan and to restrict the results to the immediate subordinates.
Pages: 1, 2 |



