ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Managing Many-to-Many Relationships with PL/pgSQL

by David Wheeler
06/29/2006

A common pattern when managing the relationship between object-oriented applications and databases is the many-to-many relationship. Object-relational mappers usually manage these relationships as collections of objects, wherein one class has an accessor that returns a collection of related objects. For example, imagine that you're creating (yet another) blogging application. You want to associate your blog entries with tags. Tags can be used over and over again for different blog entries, and each blog entry can, of course, have more than one tag. In this scenario, the blog entry class might have a method that returns a collection of tag objects. (Whether the tag class has a method that returns a collection of blog entry objects is an implementation-specific issue that I'll just leave up to you.)

As an added twist, you might want an ordered relationship. That is to say, for a given blog entry, the order of the tags might be meaningful. Perhaps the first tag is the most important, and the last tag is the least important. The collection of tags is then an ordered set. To manage this relationship in a database, you typically create a join table to manage the relationship:

-- Create a table for blog entries.
CREATE TABLE entry (
  id      SERIAL PRIMARY KEY,
  title   TEXT,
  content TEXT
);

-- Create a table for tags.
CREATE TABLE tag (
  id   SERIAL PRIMARY KEY,
  name TEXT
);

-- Create a join table to collect tags for entries.
CREATE TABLE entry_coll_tag (
  entry_id  INTEGER REFERENCES entry(id)
                    ON UPDATE CASCADE
                    ON DELETE CASCADE,
  tag_id    INTEGER REFERENCES tag(id)
                    ON UPDATE CASCADE
                    ON DELETE CASCADE,
  tag_order SMALLINT,
  PRIMARY KEY (entry_id, tag_id)
);

-- Keep things orderly.
CREATE UNIQUE INDEX idx_entry_coll_tag_order
ON entry_coll_tag (entry_id, tag_order);

So far so good. Blog entries live in the entry table and tags in the tag table, and the entry_coll_tag table manages the one-to-many relationship. Note that in the entry_coll_tag table, the combination of entry_id and tag_id forms the primary key, as each entry can only be associated with a given tag only once, and vice versa. There is a unique index on entry_id and tag_order, so that a given blog entry has its tag associations ordered, but each order number can appear only once per blog entry. (Otherwise they wouldn't really be ordered, would they?)

To select all the tags in the collection for a given blog entry, execute a query like:

SELECT tag.id, tag.name
FROM   tag, entry_coll_tag
WHERE  tag.id = entry_coll_tag.tag_id
       AND entry_coll_tag.entry_id = ?
ORDER BY entry_coll_tag.tag_order;

To insert the data for a new blog entry, you might write in the entry class something like this pseudo-code (error handling elided for clarity):

# Use prepared statements.
insert = dbh.prepare('INSERT INTO entry (title, content) VALUES (?, ?)');
sel_id = dbh.prepare("SELECT CURRVAL('entry_id_seq')");

ins_coll = dbh.prepare('
    INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
    VALUES (?, ?, ?)
');

# Do everything inside a transaction.
dbh.begin;

# Insert the new entry.
insert.execute(entry.title, entry.content);
sel_id.execute;
entry.id = sel_id.fetch;

# Associate the tags with the entry.
i = 0;
foreach tag in (tag_array) {
    ins_coll.execute(entry.id, tag.id, ++i);
}

# Make it so!
dbh.commit;

To reorder the associated tags for an existing blog entry, just execute the foreach loop with the new order. If you want to add new tags to an existing collection, it gets a bit more complicated:

# Transactions make us safer!
dbh.begin;

# Get the current highest value of tag_order.
max_ord = dbh.selectcol(
    'SELECT COALESCE(MAX(tag_order), 0) FROM entry_coll_tag WHERE entry_id = ?',
    entry.id
);

# Add the new tags to the entry.
i = 0;
foreach tag in (new_tag_array) {
    ins_coll.execute(entry.id, tag.id, max_ord + ++i);
}

dbh.commit;

Indeed, this is the approach taken by most object-relational mapping applications to manage ordered many-to-many relationships. However, there's a major problem with this approach: there is a race condition. What happens if two different processes are updating the collection of tags on the same blog entry at the same time? They might both grab the same value for the previous maximum value of the tag_order column, but one commit would fail. While it's good and proper to maintain the integrity of the data in this way, it's not a great idea to expose customers to an unexpected error screen indicating a constraint violation.

What's more, this code is slow. It makes a lot of calls to the database here. What if there were 100 tags associated with a blog entry? That's 100 calls for each insert into the join table. It also lengthens the duration of the transaction, increasing the likelihood of an error due to the race condition.

Wouldn't it be a lot cleaner if there were a way to tell the database to manage the collection associations? What if there were functions to call that eliminated the race condition and updated the collection relationships in a single database call?

Fortunately for PostgreSQL users, there is. You can move the collection relationship maintenance code out of the application layer and into database functions. In "Writing PostgreSQL Functions with PL/pgSQL," I introduced PL/pgSQL with some impractical examples. Now I want to get practical by using PL/pgSQL to solve this object-relational mapping problem.

Clearing a Collection

Given that the example deals with collections of objects, I've decided to create four functions to manage them: one to clear the collection (empty it of all tags for a given entry), one to remove specific tags from the collection, one to set all of the tags in the collection at once, and one to add new tags to the collection without changing the existing collection members. I want to demonstrate how to create each in that order, because I listed them in order from the simplest to the most complex. Here's the clear function:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_clear (
2      obj_id  integer
3  ) RETURNS VOID AS $$
4  BEGIN
5      DELETE FROM entry_coll_tag WHERE entry_id = obj_id;
6  END;
7  $$ LANGUAGE plpgsql;

If you've read my previous article or are already familiar with PL/pgSQL, this function should be pretty easy to figure out. The body of the function constitutes all of one line, the one with the DELETE statement. Of course, you could simply execute that DELETE statement from client code, but doing so would mean that there were essentially two different interfaces for managing collections: this DELETE statement and some of the more complex functions. I'll explain those in a moment. I provide this function for the sake of interface consistency.

To use this function to clear a blog entry with the ID 24 of all tags, call it like this:

SELECT entry_coll_tag_clear(24);

Deleting Select Tags from a Collection

The delete function can remove any number of specific tags from the collection at once. Here's the function:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_del (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      DELETE FROM entry_coll_tag
7      WHERE  entry_id = obj_id
8             AND tag_id = ANY(coll_ids);
9  END;
10 $$ LANGUAGE plpgsql;

There are only a few differences here. First, there is a second argument in the signature, coll_ids, an array of integers, which allows you to pass multiple tag IDs to the function at once. In client code, you'll likely have to use the stringified representation of an array. Thus, to remove tags with the IDs 5, 7, 10, and 14 from the collection associated with entry 24, call the function as:

SELECT entry_coll_tag_del(24, '{5,7,10,14}');

Here, '{5,7,10,14}' represents an array with the IDs 5, 7, 10, and 14. See the PostgreSQL documentation on PostgreSQL arrays, PostgreSQL array functions, and PostgreSQL array constructors for more details.

The other thing that's different about this function is the use of the ANY hyper operator in the WHERE clause. ANY is similar to the familiar SQL IN expression, except that it acts on an array of values instead of a list. So that last part of the WHERE expression deletes all rows that have a value in the tag_id column equal to any of the values in the array. Note that ANY expressions may not be as fast as their IN counterparts when comparing a lot of rows, although the PostgreSQL 8.2 release slated for later this year has addressed this issue.

As with entry_coll_tag_clear(), the client application could easily handle this code, but again, I provide it for the sake of interface consistency. It's time to make things a little more interesting.

Setting All Tags in a Collection

Here's a function to set all of the objects in a collection at once:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
7  
8      UPDATE entry_coll_tag
9      SET    tag_order = -tag_order
10     WHERE  entry_id = obj_id
11
12     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
13         IF coll_ids[iloop] IS NULL THEN
14             CONTINUE;
15         END IF;
16 
17         UPDATE entry_coll_tag
18         SET    tag_order = iloop
19         WHERE  entry_id = obj_id
20                AND tag_id = coll_ids[iloop];
21 
22         IF FOUND IS false THEN
23             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
24             VALUES (obj_id, coll_ids[iloop], iloop);
25         END IF;
26     END LOOP;
27 
28     DELETE FROM entry_coll_tag
29     WHERE  entry_id = obj_id AND tag_order < 0;
30 END;
31 $$ LANGUAGE plpgsql;

The idea behind this function is that you can call it to assign all of the tags to the collection for a given entry at once:

SELECT entry_coll_tag_set(24, '{67,43,1,104}');

When the function exits, all of the tags with the IDs 67, 43, 1, and 104 will be associated in that order with entry 24. As you might guess, this single call to the database incurs a lot less overhead than calling INSERT or UPDATE for each tag ID--especially if you associate a lot of tags with an entry!

This function introduces quite a bit more syntax. Looking it over, the first line with something new is the unfamiliar PERFORM statement in line 6. This is a special PL/pgSQL statement that you can use in lieu of a SELECT statement when you want to ignore the rows returned by the SELECT statement. This line also uses the SELECT FOR UPDATE statement. What is the FOR UPDATE part? Essentially, this statement tells PostgreSQL to lock the row in the entry table associated with the tags. This lock overcomes the race condition. Because the code only needs to create the lock, it's possible to use PERFORM rather than fetching values returned by a SELECT statement.

You might reasonably ask why it's necessary to lock the entry object though the code makes no modifications to it. The problem with the race condition is that, while you could lock all of the existing rows associated with the entry ID in the entry_coll_tag table and thus prevent a race condition on existing rows, that wouldn't stop another process from inserting new rows associated with the entry ID. Thus the race condition still exists.

This lock prevents any rows with a foreign key relationship with entry from being inserted or updated. Why not? Because for all PostgreSQL knows, you might be changing the primary key value in the entry table. (But you never, ever actually do that, right?) Of course, the entry_coll_tag table has a foreign key relationship with the entry table. Therefore, locking the entry row effectively prevents any other process from inserting or updating rows in entry_coll_tag that reference that entry. The race condition is thus eliminated.

After getting the lock (which may be blocked until another transaction that has a lock on the same row finishes), the code executes an UPDATE statement (lines 8-10) to set all existing rows associated with the entry ID so that the value of the tag_order column is negative. Doing so avoids problems with the unique constraint. This is necessary because you might be updating a collection with some existing records, and thus inserting the new tag_order values as in an existing row. Because the new tag_order values are always positive, the code can avoid the issue by updating the existing records so that their tag_order columns are negative.

Why not just delete it? Perhaps there are other dependencies on this row, such as foreign key constraints or an ON DELETE trigger that you don't want to set off unless you're actually eliminating the relationship. Furthermore, the same object might still be in the collection with a different tag_order value, so why move it to a new row? This technique allows you to keep existing records around until you know that you no longer need them. Note that the simple application code in the first section of this article failed to take this subtlety into account.

Next up, line 12 uses a FOR loop to iterate over each of the values in the coll_ids array passed to the function. It starts with index 1 (because SQL arrays start at 1, not 0 as in most programming languages with which you might be familiar), and continues up to the last index in the array, which is returned by the call to array_upper(coll_ids, 1). The FOR loop implicitly creates the iloop variable to store each index while iterating over the array.

As the code iterates over each value in the array, it first checks to see if it is NULL, and if it is, executes the CONTINUE statement to restart the loop with the next value. Now, PostgreSQL does not currently allow NULLs in arrays, but that is another change slated for the 8.2 release, due out later this year. This is some defensive coding.

There's a better reason to check for NULLs even in earlier versions of PostgreSQL. A reviewer reading an earlier draft of this article pointed out a problem with iterating over the array starting at index 1: in PostgreSQL arrays, the lower bound isn't necessarily 1! Consider this example:

try=% select array_lower('[3:5]={1,2,3}'::int[], 1);
 array_lower 
-------------
           3
(1 row)

This feature actually violates the SQL standard, and so will likely change in a future version of PostgreSQL. In the meantime, the solution is either to start with the return value of array_lower() instead of 1, or to check each value while iterating over the array. What is the value stored in index 1 when the lower bound is not index 1? Look:

try=% select coalesce(foo.bar[1], 0)
try-% from (select '[3:5]={1,2,3}'::int[]) as foo(bar);
 coalesce 
----------
        0
(1 row)

The standard COALESCE() function returns the first non-NULL value passed to it. It here returns 0, so the value stored in index 1 is NULL. The defensive coding against future improvements to arrays in PostgreSQL has covered this issue, as well.

For each non-NULL value in coll_ids, the code executes a couple of SQL statements. Lines 17-20 attempt to update an existing row with the relevant entry ID and tag ID to set the tag_order column to the current value of iloop. This avoids breaking any foreign key dependencies or triggering any ON DELETE (or even ON INSERT) triggers on the row by not deleting it and inserting a new row.

Of course, a given tag might not have been previously associated with the entry, so lines 22-25 check to see if the update actually updated a row, and if not, insert a new one. First, the IF-THEN block checks the FOUND Boolean variable, which is always available in PL/pgSQL functions, and is set to true when an UPDATE or INSERT statement affects a row, among other events (see the PL/pgSQL result status documentation). For the purpose here, if it's false, the UPDATE statement updated no rows. It's safe to insert a new row.

Having updated or inserted all of the rows to properly represent the tags associated with the given entry and in the proper order, the code must clean up any extras. Perhaps earlier in life the collection had a tag that has since been discarded, or the collection used to have more tags than it currently has. The DELETE FROM statement at lines 28-29 deletes any rows associated with the entry with a tag_order less than 0 (because the code earlier may have set it to a negative value and nothing else has updated it with a new value).

As you can see, this function does quite a lot of processing in order to avoid race conditions and to try to be as careful as possible in setting up the collection. Old rows are preserved, new ones are created, and discarded rows are properly deleted. The application code in the first section hadn't caught all of these conditions. Even if it had, it would have been even slower and more difficult to maintain. By factoring the collection management into the database, your application code becomes much simpler (just a single database function call) and consumes far fewer network resources.

Adding New Tags

This isn't everything quite done yet. Sometimes you might want to just add new tags to an existing collection. It seems silly to take the time and resources to load all of the existing tags in the collection just to append to that list. Here's a PL/pgSQL function that takes on the responsibility for adding the new tags to the collection:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_add (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  DECLARE
6      last_ord smallint;
7      got_ord  boolean;
8      iord     integer := 1;
9  BEGIN
10     PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
11 
12     SELECT INTO last_ord COALESCE(max(tag_order), 0)
13     FROM   entry_coll_tag
14     WHERE  entry_id = obj_id;
15 
16     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
17         IF coll_ids[iloop] IS NULL THEN
18             CONTINUE;
19         END IF;
20 
21         SELECT INTO got_ord true 
22         FROM   entry_coll_tag
23         WHERE  entry_id = obj_id
24                AND tag_id = coll_ids[iloop];
25 
26         IF got_ord IS NULL THEN
27             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
28             VALUES (obj_id, coll_ids[iloop], last_ord + iord);
29             iord := iord + 1;
30         END IF;
31     END LOOP;
32 END;
33 $$ LANGUAGE plpgsql;

The syntax for using this function is identical to that of entry_coll_tag_set():

SELECT entry_coll_tag_add(24, '{223,12,52}');

This call will add the tags with the IDs 223, 12, and 52 to the collection, without molesting the previously existing tags in the collection. If any of the tag IDs in the array are already in the collection, entry_coll_tag_add() will simply leave them where they are.

This function declares several variables, and again, the first thing it does is to get a lock on the entry object so as to add tags to the collection with impunity. Then, lines 12-14 use the special PL/pgSQL SELECT INTO expression to collect the current maximum value for the tag_order column for the blog entry. Just in case no row exists in the entry_coll_tag table already (in which case MAX() will return NULL), the COALESCE() function forces the result to default to 0.

Line 16 starts looping through the IDs in the coll_ids array, just as in person_coll_tag_set(). Lines 17-19 again skip NULL values. Now, it could be that a call is made to person_coll_tag_add() with one or more tag IDs that already exist in the collection. So lines 21-24 once again use the SELECT INTO expression, this time to fetch true into the got_ord variable if the row exists. If it doesn't exist, got_ord will be NULL. Indeed, this is the purpose of the check at line 26. If it's NULL, the code inserts the new value, using iord + last_ord to set the value of the tag_order column.

Once again, creating a function that can be called once from application code to safely add any number of tags to the collection at once saves a lot of overhead. If you were to do it all in application space, you'd have to send many more queries to the server--two for each tag, plus one for the lock and one to determine the previous maximum value of the tag_order column.

Benchmarks

How much time do these functions really save in the performance department? I hope that I've established that you can take quite a few precautions in the functions to ensure that they work as smoothly as possible, avoiding race conditions and duplicate-record errors. That's all well and good, but of course you can get the same results by executing the same queries in the application space, viz. the entry table row lock, the UPDATE statements to check for existing records, etc. Its a lot of SQL to maintain in application space, but of course doable.

The real advantage of the functions versus multiple database calls comes in the form of performance improvements. To test the difference, I wrote a Perl script using the extremely fast DBI database interface to test both approaches and compare their performance. The script inserts 100,000 entry records and about 500,000 tag records (a random number of tags between 1 and 10 for each entry) before running the benchmarks, as an empty database is always a fast database, and therefore would not provide accurate benchmark results. The script also runs the PostgreSQL VACUUM and ANALYZE commands before each test, to ensure that the database is as clean and the statistics as up-to-date as possible. Each approach to collection management runs the following code 300 times (this is the function-call version; the Perl version does the equivalent, but much more verbosely, of course):

BEGIN;
SELECT entry_coll_tag_set(100, '{1,2,3,4,5,6,7,8}');
COMMIT;

BEGIN;
SELECT entry_coll_tag_del(100, '{3,5,7}');
COMMIT;

BEGIN;
SELECT entry_coll_tag_add(100, '{9,10,11}');
COMMIT;

BEGIN;
SELECT entry_coll_tag_set( 100, '{11,10,9,8,7,6,5,4,3,2,1}');
COMMIT;

BEGIN;
SELECT entry_coll_tag_clear(100);
COMMIT;

(That certainly is very clean code to maintain in the application space, no?)

The idea is to have a variety of typical collection management queries execute in order to measure the overall performance of collection management. I ran the benchmark on a single-processor Intel Pentium 2.26 GHz box with 512MB of RAM running Fedora Core release 4 (Stentz), Perl 5.8.8, DBI 1.50, and PostgreSQL 8.1.3, and with no other non-essential processes running. The results were stunning, to say the least:

func: 13.52 wallclock seconds (0.13 usr + 1.79 sys = 1.92 CPU) @ 22.19/s
perl: 42.39 wallclock seconds (0.29 usr + 7.09 sys = 7.38 CPU) @  7.08/s

Yes, the use of the PL/pgSQL functions was over three times faster than the execution of the same code from the application space. Furthermore, the application approach used nearly 3.85 times more CPU time. Not only is there a huge performance boost in terms of overall wallclock seconds, but with the functional approach, the application uses far less CPU time, freeing valuable processor time for other tasks.

Curiously, after rebooting my server, I saw somewhat different results as I was putting the final tweaks on this article:

func:  5.71 wallclock seconds (0.02 usr + 0.58 sys = 0.60 CPU) @ 52.53/s
perl: 14.78 wallclock seconds (0.10 usr + 2.05 sys = 2.15 CPU) @ 20.29/s

The use of PL/pgSQL function is still nearly 2.6 times faster than just doing the work in application space, so it's still a huge win. Not sure about these results? Download the test for yourself and see what results you get.

Conclusion

Writing database functions in languages like PL/pgSQL offers the potential for huge performance boosts for your applications, facilitates the the practice of maintaining database integrity, and keeps application code much cleaner and thus easier to maintain. By using PL/pgSQL functions, you can rest easy in the knowledge that the integrity of your data will be maintained, and all applications that access the database will hum along happily. Give it a shot! You won't regret it.

Acknowledgments

Many thanks to Josh Berkus for reading a draft of this article and providing feedback and PL/pgSQL implementation hints. This article would not have been possible without his generous help! I'm also grateful to AndrewSN for bringing up a number of issues with the functions as they were originally written. The functions are much better, and the article therefore also improved, thanks to his feedback.

David Wheeler is a developer at Portland, Oregon-based Values of n, where he writes the code that makes Stikkit's little yellow notes think.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.