ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Automating PostgreSQL Tasks

by Manni Wood
12/09/2004

You can and should automate routine database tasks, to remove the possibility of user error and to let yourself focus on other, more interesting work. If you use PostgreSQL on Unix or Linux, you can combine the shell with PostgreSQL's psql client to automate database chores.

Automating Database Creation

If you haven't yet created your database and you believe that you will be iteratively destroying and re-creating it as the project's needs become clearer, the first chore you should automate is the creation of the database itself.

I have the habit of creating a db_creation_scripts directory for my PostgreSQL projects, numbering the scripts inside according to the order in which they should execute. Here's a listing for the sample database I created to write this article:

prospero:db_creation_scripts$ ls
0_create_database_and_user.postgresql  1_create_main_tables.postgresql

A real project might have a few more scripts for pre-populating some of the tables or creating stored procedures, but our toy example is simple and needs just two scripts.

The 0_create_database_and_user.postgresql script is really a set of steps to follow manually. Take a look at it if you're interested in learning how I created the toy database used for the rest of this article.

The fun automation stuff begins with 1_create_main_tables.postgresql:

/* $Revision: 1.24 $ $Date: 2004/04/30 20:17:41 $
 
To create the bookstore database, run this file through psql like so:

prompt> psql -f 5_create_main_tables.postgresql bookstore bookstore_user
 
*/
 
drop sequence sales_row_num_seq;
 
create sequence sales_row_num_seq cycle;
 
drop table sales;
 
create table sales (
  row_num int not null default nextval('sales_row_num_seq'),
  first_name varchar(35) not null,
  middle_name varchar(35) null,
  last_name varchar(35) not null,
  city varchar(35) not null,
  state char(2) not null,
  zip varchar(5) not null,
  email_address varchar(50) not null,
  product_id int not null,
  product_name varchar(50) not null,
  product_sale_date date not null);
 
comment on table sales is 'Contains records of books sold from the bookstore.';
comment on column sales.row_num is 'For de-duping purposes.';

As the comment at the top of the file instructs, cd to the directory holding the file and run

$ psql -f 1_create_main_tables.postgresql bookstore bookstore_user

to create your database table.

Related Reading

Practical PostgreSQL
By John C. Worsley, Joshua D. Drake

If you decide that you should have broken the sales table into customer, product, and sales tables—certainly better relational design—just alter the file above and rerun the psql command to rebuild your database. This sort of automation is a real boon to the iterative design process early in a project. It also documents your database, and it makes it easy to re-create your database on a different host.

If you use CVS, you'll notice revision and date tags at the top of 0_create_database_and_user.postgresql. This is because I like to check my database creation scripts into CVS, or, lately, Subversion. It's a great way of keeping track of my database schema as it evolves, and it makes me more willing to experiment, knowing that I can recall earlier versions of the schema. It doesn't matter which version control system you use, of course. What matters is that you can.

Note that because I suggest in 0_create_database_and_user.postgresql to allow bookstore_user to log on without a password, the psql command runs without prompting for a password. Obviously, I have traded security for ease of use, especially for illustrative purposes in this article. You need to decide which balance you will strike.

Report Generation

When your database project is mature, you will likely be generating reports, often in formats that your clients or bosses can open in their spreadsheet programs. (The populate.postgresql file contains sample data that you might find useful while following along.)

If you want to capture the results to a file, use psql commands (they start with a backslash) in conjunction with SQL commands. You can log on to your database with psql bookstore bookstore_user and then execute the following commands:

\o report.txt
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

The resulting report.txt file, though easy on the eyes, is difficult to import into spreadsheets. You know that's what your bosses want.

Again, run psql bookstore bookstore_user. Then execute these additional formatting commands to produce something that will open in Excel or OpenOffice.org:

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

If you want to study more of psql's formatting commands, read the psql man page. There's even support for HTML table output!

Who wants to type all these commands every month just to generate a monthly sales report, though? Put all of the above commands into a file called generate_monthly_sales_report.postgresql:

-- generate_monthly_sales_report.postgresql
-- 
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date *gt;= '2004-10-01' and product_sale_date <= '2004-10-31';
\o

Now you just have to remember to run generate_monthly_sales_report.postgresql using psql's -f flag. You could even just make a nice wrapper script for that:

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql -f generate_monthly_sales_report.postgresql sales_db sales_user

Of course, there's still the issue of having to edit the dates in generate_monthly_sales_report.postgresql, and wouldn't it be nice to be able to name the output file as well?

As it happens, psql supports the setting of named variables from the command line. These appear as :varname in SQL statements. Modify generate_monthly_sales_report.postgresql like this:

-- generate_monthly_sales_report.postgresql
-- 
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.

\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where 
product_sale_date >= :start_date and product_sale_date <= :end_date;
\o

Then modify gen_sales_report.sh:

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql --set start_date=\'2004-10-01\' \
     --set end_date=\'2003-10-31\' \
     -f generate_monthly_sales_report.postgresql sales_db sales_user

Note the escaping of single quotes to preserve them for the SQL query, where the dates will need them.

Of course, really we've just moved the problem to the wrapper shell script: we still have to edit the dates, it's just that now we're doing it in gen_sales_report.sh instead of generate_monthly_sales_report.postgresql. The true solution is to grab arguments from the command-line to reuse both scripts without ever having to edit them again.

#!/bin/sh

# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql

psql --set start_date=\'$1\' \
     --set end_date=\'$2\' \
     -f generate_monthly_sales_report.postgresql sales_db sales_user

If you want to make your shell scripts really robust (and I encourage you to do so), you can write some validation code for $1 and $2 instead of trusting that the user (a) entered both command-line arguments and (b) entered correctly formatted dates for use in the SQL command. There's a fantastic book from No Starch Press called Wicked Cool Shell Scripts, by Dave Taylor, that has some great examples of argument validation using the shell. The book has raised my standards for shell script robustness. I strongly recommend it.

Data Feeds

Let us pretend that our toy database has a products table that has the following definition:

drop sequence products_row_num_seq;
 
create sequence products_row_num_seq cycle;
 
drop table products;
 
create table products (
  row_num int not null default nextval('products_row_num_seq'),
  id int primary key,
  name varchar(35) not null,
  description varchar(1024) null,
  price numeric(8, 2) not null);
 
comment on table products is 'Records of products sold at the bookstore.';
comment on column products.row_num is 'For de-duping purposes.';

Let us also pretend that for each new product that we sell, the marketing department provides us with a file in a standard format. We use that file to insert new products into the products table.

There are multiple approaches for inserting new data into a database.

One approach is to write software to parse the input file, and then use a connection library like Perl DB or JDBC to update the database. Many people do it this way, but as a matter of preference I don't much enjoy it. Configuring Perl DB or Java's JDBC always takes more time than I expect, and when the inevitable errors appear in the data feed, it's time-consuming to track down the source of the problem. (Think nested exceptions and tracking the line number of the input file.)

My favorite solution is to turn the data into a series of SQL insert or update statements that I put into a file. In turn, I feed that file to psql using its -f argument. This proves to be a very robust way of doing data input: psql already knows how to connect to PostgreSQL (so you don't have to tinker with Perl's DBI or Java's JDBC), and psql will report the exact line number of your SQL input file whenever an error occurs.

Also, it is trivial to add BEGIN/COMMIT SQL commands at the top and bottom of your input file to make the entire batch of inserts or updates a single transaction. This way, whenever an error occurs on line 3000 of your 5,000-line SQL command file, the whole transaction aborts (with a helpful error message, no less!), allowing you to fix the offending line and simply run the entire file again, confident that the last, successful run will be the only one that actually affects the database.

Let's pretend that the marketing department types up new product descriptions in a spreadsheet and saves to them to comma-separated value (.csv) files, which you must upload to the database. Here's a small sample file, products.csv, consisting of new products to add to the products table:

"id","name","description","price"
1001,"paperclips","box of 200",0.99
1002,"pencils","box of 10",1.99
1003,"birthday card",,2.99

The goal is to turn the .csv file into file called runme.postgresql containing commands like:

\set ON_ERROR_STOP 1
begin;
insert into products (id, name, description, price) values 
     (1001, 'paperclips', 'box of 200', 0.99);
insert into products (id, name, description, price) values 
     (1002, 'pencils', 'box of 10', 1.99);
insert into products (id, name, description, price) values 
     (1003, 'birthday card', null, 2.99);
commit;

I created the above file using the create_input_sql.pl Perl script, which you can use as inspiration for your own data upload scripts. The program is not as important as the idea that we can create a file of commands to feed to psql.

Our command file has two useful commands in addition to the insert statements. The first is where I set ON_ERROR_STOP to 1 (making it true), so that if there is a bad insert statement in the command file, psql will stop executing the rest of the file. The second is the set of begin/commit statements that wrap all the insert statements. This allows the update to execute as a single transaction, as discussed earlier.

Naturally, a shell script seems like a good way to pull this all together:

#!/bin/sh

# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.

./create_input_sql.pl products.csv  # outputs runme.postgresql

psql -f runme.postgresql sales_db sales_user

One enhancement I recommend to the script, when used in the real world on thousand-plus-line files, is to pipe all of psql's output to a log file so that you have a record of what happened. Also, I've noticed while running my own considerably large input files that psql runs much faster when its output goes to a file rather than to a terminal.

#!/bin/sh

# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.

./create_input_sql.pl products.csv  # outputs runme.postgresql

psql -f runme.postgresql sales_db sales_user 1> load_new_products.log 2>&1

"Can This Be Automated?"

I hope you found these automation samples useful. If this article has done its job, next time you have to do a repetitive task with PostgreSQL, you will ask yourself, "Can I automate this?" and you will have these samples to help you start.

Manni Wood leads teams at a Boston advertising company in building Java-based, database-backed web sites for clients like General Motors and FedEx.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.