ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Managing Many-to-Many Relationships with PL/pgSQL
Pages: 1, 2, 3

Clearing a Collection

Given that the example deals with collections of objects, I've decided to create four functions to manage them: one to clear the collection (empty it of all tags for a given entry), one to remove specific tags from the collection, one to set all of the tags in the collection at once, and one to add new tags to the collection without changing the existing collection members. I want to demonstrate how to create each in that order, because I listed them in order from the simplest to the most complex. Here's the clear function:



1  CREATE OR REPLACE FUNCTION entry_coll_tag_clear (
2      obj_id  integer
3  ) RETURNS VOID AS $$
4  BEGIN
5      DELETE FROM entry_coll_tag WHERE entry_id = obj_id;
6  END;
7  $$ LANGUAGE plpgsql;

If you've read my previous article or are already familiar with PL/pgSQL, this function should be pretty easy to figure out. The body of the function constitutes all of one line, the one with the DELETE statement. Of course, you could simply execute that DELETE statement from client code, but doing so would mean that there were essentially two different interfaces for managing collections: this DELETE statement and some of the more complex functions. I'll explain those in a moment. I provide this function for the sake of interface consistency.

To use this function to clear a blog entry with the ID 24 of all tags, call it like this:

SELECT entry_coll_tag_clear(24);

Deleting Select Tags from a Collection

The delete function can remove any number of specific tags from the collection at once. Here's the function:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_del (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      DELETE FROM entry_coll_tag
7      WHERE  entry_id = obj_id
8             AND tag_id = ANY(coll_ids);
9  END;
10 $$ LANGUAGE plpgsql;

There are only a few differences here. First, there is a second argument in the signature, coll_ids, an array of integers, which allows you to pass multiple tag IDs to the function at once. In client code, you'll likely have to use the stringified representation of an array. Thus, to remove tags with the IDs 5, 7, 10, and 14 from the collection associated with entry 24, call the function as:

SELECT entry_coll_tag_del(24, '{5,7,10,14}');

Here, '{5,7,10,14}' represents an array with the IDs 5, 7, 10, and 14. See the PostgreSQL documentation on PostgreSQL arrays, PostgreSQL array functions, and PostgreSQL array constructors for more details.

The other thing that's different about this function is the use of the ANY hyper operator in the WHERE clause. ANY is similar to the familiar SQL IN expression, except that it acts on an array of values instead of a list. So that last part of the WHERE expression deletes all rows that have a value in the tag_id column equal to any of the values in the array. Note that ANY expressions may not be as fast as their IN counterparts when comparing a lot of rows, although the PostgreSQL 8.2 release slated for later this year has addressed this issue.

As with entry_coll_tag_clear(), the client application could easily handle this code, but again, I provide it for the sake of interface consistency. It's time to make things a little more interesting.

Setting All Tags in a Collection

Here's a function to set all of the objects in a collection at once:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
7  
8      UPDATE entry_coll_tag
9      SET    tag_order = -tag_order
10     WHERE  entry_id = obj_id
11
12     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
13         IF coll_ids[iloop] IS NULL THEN
14             CONTINUE;
15         END IF;
16 
17         UPDATE entry_coll_tag
18         SET    tag_order = iloop
19         WHERE  entry_id = obj_id
20                AND tag_id = coll_ids[iloop];
21 
22         IF FOUND IS false THEN
23             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
24             VALUES (obj_id, coll_ids[iloop], iloop);
25         END IF;
26     END LOOP;
27 
28     DELETE FROM entry_coll_tag
29     WHERE  entry_id = obj_id AND tag_order < 0;
30 END;
31 $$ LANGUAGE plpgsql;

The idea behind this function is that you can call it to assign all of the tags to the collection for a given entry at once:

SELECT entry_coll_tag_set(24, '{67,43,1,104}');

When the function exits, all of the tags with the IDs 67, 43, 1, and 104 will be associated in that order with entry 24. As you might guess, this single call to the database incurs a lot less overhead than calling INSERT or UPDATE for each tag ID--especially if you associate a lot of tags with an entry!

This function introduces quite a bit more syntax. Looking it over, the first line with something new is the unfamiliar PERFORM statement in line 6. This is a special PL/pgSQL statement that you can use in lieu of a SELECT statement when you want to ignore the rows returned by the SELECT statement. This line also uses the SELECT FOR UPDATE statement. What is the FOR UPDATE part? Essentially, this statement tells PostgreSQL to lock the row in the entry table associated with the tags. This lock overcomes the race condition. Because the code only needs to create the lock, it's possible to use PERFORM rather than fetching values returned by a SELECT statement.

You might reasonably ask why it's necessary to lock the entry object though the code makes no modifications to it. The problem with the race condition is that, while you could lock all of the existing rows associated with the entry ID in the entry_coll_tag table and thus prevent a race condition on existing rows, that wouldn't stop another process from inserting new rows associated with the entry ID. Thus the race condition still exists.

This lock prevents any rows with a foreign key relationship with entry from being inserted or updated. Why not? Because for all PostgreSQL knows, you might be changing the primary key value in the entry table. (But you never, ever actually do that, right?) Of course, the entry_coll_tag table has a foreign key relationship with the entry table. Therefore, locking the entry row effectively prevents any other process from inserting or updating rows in entry_coll_tag that reference that entry. The race condition is thus eliminated.

After getting the lock (which may be blocked until another transaction that has a lock on the same row finishes), the code executes an UPDATE statement (lines 8-10) to set all existing rows associated with the entry ID so that the value of the tag_order column is negative. Doing so avoids problems with the unique constraint. This is necessary because you might be updating a collection with some existing records, and thus inserting the new tag_order values as in an existing row. Because the new tag_order values are always positive, the code can avoid the issue by updating the existing records so that their tag_order columns are negative.

Why not just delete it? Perhaps there are other dependencies on this row, such as foreign key constraints or an ON DELETE trigger that you don't want to set off unless you're actually eliminating the relationship. Furthermore, the same object might still be in the collection with a different tag_order value, so why move it to a new row? This technique allows you to keep existing records around until you know that you no longer need them. Note that the simple application code in the first section of this article failed to take this subtlety into account.

Next up, line 12 uses a FOR loop to iterate over each of the values in the coll_ids array passed to the function. It starts with index 1 (because SQL arrays start at 1, not 0 as in most programming languages with which you might be familiar), and continues up to the last index in the array, which is returned by the call to array_upper(coll_ids, 1). The FOR loop implicitly creates the iloop variable to store each index while iterating over the array.

As the code iterates over each value in the array, it first checks to see if it is NULL, and if it is, executes the CONTINUE statement to restart the loop with the next value. Now, PostgreSQL does not currently allow NULLs in arrays, but that is another change slated for the 8.2 release, due out later this year. This is some defensive coding.

There's a better reason to check for NULLs even in earlier versions of PostgreSQL. A reviewer reading an earlier draft of this article pointed out a problem with iterating over the array starting at index 1: in PostgreSQL arrays, the lower bound isn't necessarily 1! Consider this example:

try=% select array_lower('[3:5]={1,2,3}'::int[], 1);
 array_lower 
-------------
           3
(1 row)

This feature actually violates the SQL standard, and so will likely change in a future version of PostgreSQL. In the meantime, the solution is either to start with the return value of array_lower() instead of 1, or to check each value while iterating over the array. What is the value stored in index 1 when the lower bound is not index 1? Look:

try=% select coalesce(foo.bar[1], 0)
try-% from (select '[3:5]={1,2,3}'::int[]) as foo(bar);
 coalesce 
----------
        0
(1 row)

The standard COALESCE() function returns the first non-NULL value passed to it. It here returns 0, so the value stored in index 1 is NULL. The defensive coding against future improvements to arrays in PostgreSQL has covered this issue, as well.

For each non-NULL value in coll_ids, the code executes a couple of SQL statements. Lines 17-20 attempt to update an existing row with the relevant entry ID and tag ID to set the tag_order column to the current value of iloop. This avoids breaking any foreign key dependencies or triggering any ON DELETE (or even ON INSERT) triggers on the row by not deleting it and inserting a new row.

Of course, a given tag might not have been previously associated with the entry, so lines 22-25 check to see if the update actually updated a row, and if not, insert a new one. First, the IF-THEN block checks the FOUND Boolean variable, which is always available in PL/pgSQL functions, and is set to true when an UPDATE or INSERT statement affects a row, among other events (see the PL/pgSQL result status documentation). For the purpose here, if it's false, the UPDATE statement updated no rows. It's safe to insert a new row.

Having updated or inserted all of the rows to properly represent the tags associated with the given entry and in the proper order, the code must clean up any extras. Perhaps earlier in life the collection had a tag that has since been discarded, or the collection used to have more tags than it currently has. The DELETE FROM statement at lines 28-29 deletes any rows associated with the entry with a tag_order less than 0 (because the code earlier may have set it to a negative value and nothing else has updated it with a new value).

As you can see, this function does quite a lot of processing in order to avoid race conditions and to try to be as careful as possible in setting up the collection. Old rows are preserved, new ones are created, and discarded rows are properly deleted. The application code in the first section hadn't caught all of these conditions. Even if it had, it would have been even slower and more difficult to maintain. By factoring the collection management into the database, your application code becomes much simpler (just a single database function call) and consumes far fewer network resources.

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: