oreilly.comSafari Books Online.Conferences.


Managing Many-to-Many Relationships with PL/pgSQL
Pages: 1, 2, 3

Adding New Tags

This isn't everything quite done yet. Sometimes you might want to just add new tags to an existing collection. It seems silly to take the time and resources to load all of the existing tags in the collection just to append to that list. Here's a PL/pgSQL function that takes on the responsibility for adding the new tags to the collection:

1  CREATE OR REPLACE FUNCTION entry_coll_tag_add (
2      obj_id   integer,
3      coll_ids integer[]
6      last_ord smallint;
7      got_ord  boolean;
8      iord     integer := 1;
10     PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
12     SELECT INTO last_ord COALESCE(max(tag_order), 0)
13     FROM   entry_coll_tag
14     WHERE  entry_id = obj_id;
16     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
17         IF coll_ids[iloop] IS NULL THEN
18             CONTINUE;
19         END IF;
21         SELECT INTO got_ord true 
22         FROM   entry_coll_tag
23         WHERE  entry_id = obj_id
24                AND tag_id = coll_ids[iloop];
26         IF got_ord IS NULL THEN
27             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
28             VALUES (obj_id, coll_ids[iloop], last_ord + iord);
29             iord := iord + 1;
30         END IF;
31     END LOOP;
32 END;
33 $$ LANGUAGE plpgsql;

The syntax for using this function is identical to that of entry_coll_tag_set():

SELECT entry_coll_tag_add(24, '{223,12,52}');

This call will add the tags with the IDs 223, 12, and 52 to the collection, without molesting the previously existing tags in the collection. If any of the tag IDs in the array are already in the collection, entry_coll_tag_add() will simply leave them where they are.

This function declares several variables, and again, the first thing it does is to get a lock on the entry object so as to add tags to the collection with impunity. Then, lines 12-14 use the special PL/pgSQL SELECT INTO expression to collect the current maximum value for the tag_order column for the blog entry. Just in case no row exists in the entry_coll_tag table already (in which case MAX() will return NULL), the COALESCE() function forces the result to default to 0.

Line 16 starts looping through the IDs in the coll_ids array, just as in person_coll_tag_set(). Lines 17-19 again skip NULL values. Now, it could be that a call is made to person_coll_tag_add() with one or more tag IDs that already exist in the collection. So lines 21-24 once again use the SELECT INTO expression, this time to fetch true into the got_ord variable if the row exists. If it doesn't exist, got_ord will be NULL. Indeed, this is the purpose of the check at line 26. If it's NULL, the code inserts the new value, using iord + last_ord to set the value of the tag_order column.

Once again, creating a function that can be called once from application code to safely add any number of tags to the collection at once saves a lot of overhead. If you were to do it all in application space, you'd have to send many more queries to the server--two for each tag, plus one for the lock and one to determine the previous maximum value of the tag_order column.


How much time do these functions really save in the performance department? I hope that I've established that you can take quite a few precautions in the functions to ensure that they work as smoothly as possible, avoiding race conditions and duplicate-record errors. That's all well and good, but of course you can get the same results by executing the same queries in the application space, viz. the entry table row lock, the UPDATE statements to check for existing records, etc. Its a lot of SQL to maintain in application space, but of course doable.

The real advantage of the functions versus multiple database calls comes in the form of performance improvements. To test the difference, I wrote a Perl script using the extremely fast DBI database interface to test both approaches and compare their performance. The script inserts 100,000 entry records and about 500,000 tag records (a random number of tags between 1 and 10 for each entry) before running the benchmarks, as an empty database is always a fast database, and therefore would not provide accurate benchmark results. The script also runs the PostgreSQL VACUUM and ANALYZE commands before each test, to ensure that the database is as clean and the statistics as up-to-date as possible. Each approach to collection management runs the following code 300 times (this is the function-call version; the Perl version does the equivalent, but much more verbosely, of course):

SELECT entry_coll_tag_set(100, '{1,2,3,4,5,6,7,8}');

SELECT entry_coll_tag_del(100, '{3,5,7}');

SELECT entry_coll_tag_add(100, '{9,10,11}');

SELECT entry_coll_tag_set( 100, '{11,10,9,8,7,6,5,4,3,2,1}');

SELECT entry_coll_tag_clear(100);

(That certainly is very clean code to maintain in the application space, no?)

The idea is to have a variety of typical collection management queries execute in order to measure the overall performance of collection management. I ran the benchmark on a single-processor Intel Pentium 2.26 GHz box with 512MB of RAM running Fedora Core release 4 (Stentz), Perl 5.8.8, DBI 1.50, and PostgreSQL 8.1.3, and with no other non-essential processes running. The results were stunning, to say the least:

func: 13.52 wallclock seconds (0.13 usr + 1.79 sys = 1.92 CPU) @ 22.19/s
perl: 42.39 wallclock seconds (0.29 usr + 7.09 sys = 7.38 CPU) @  7.08/s

Yes, the use of the PL/pgSQL functions was over three times faster than the execution of the same code from the application space. Furthermore, the application approach used nearly 3.85 times more CPU time. Not only is there a huge performance boost in terms of overall wallclock seconds, but with the functional approach, the application uses far less CPU time, freeing valuable processor time for other tasks.

Curiously, after rebooting my server, I saw somewhat different results as I was putting the final tweaks on this article:

func:  5.71 wallclock seconds (0.02 usr + 0.58 sys = 0.60 CPU) @ 52.53/s
perl: 14.78 wallclock seconds (0.10 usr + 2.05 sys = 2.15 CPU) @ 20.29/s

The use of PL/pgSQL function is still nearly 2.6 times faster than just doing the work in application space, so it's still a huge win. Not sure about these results? Download the test for yourself and see what results you get.


Writing database functions in languages like PL/pgSQL offers the potential for huge performance boosts for your applications, facilitates the the practice of maintaining database integrity, and keeps application code much cleaner and thus easier to maintain. By using PL/pgSQL functions, you can rest easy in the knowledge that the integrity of your data will be maintained, and all applications that access the database will hum along happily. Give it a shot! You won't regret it.


Many thanks to Josh Berkus for reading a draft of this article and providing feedback and PL/pgSQL implementation hints. This article would not have been possible without his generous help! I'm also grateful to AndrewSN for bringing up a number of issues with the functions as they were originally written. The functions are much better, and the article therefore also improved, thanks to his feedback.

David Wheeler is a developer at Portland, Oregon-based Values of n, where he writes the code that makes Stikkit's little yellow notes think.

Return to

Sponsored by: