Automating PostgreSQL Tasksby Manni Wood
You can and should automate routine database tasks, to remove the possibility of user error and to let yourself focus on other, more interesting work. If you use PostgreSQL on Unix or Linux, you can combine the shell with PostgreSQL's psql client to automate database chores.
Automating Database Creation
If you haven't yet created your database and you believe that you will be iteratively destroying and re-creating it as the project's needs become clearer, the first chore you should automate is the creation of the database itself.
I have the habit of creating a db_creation_scripts directory for my PostgreSQL projects, numbering the scripts inside according to the order in which they should execute. Here's a listing for the sample database I created to write this article:
prospero:db_creation_scripts$ ls 0_create_database_and_user.postgresql 1_create_main_tables.postgresql
A real project might have a few more scripts for pre-populating some of the tables or creating stored procedures, but our toy example is simple and needs just two scripts.
script is really a set of steps to follow manually. Take a look at it if
you're interested in learning how I created the toy database used for the rest of this
The fun automation stuff begins with
/* $Revision: 1.24 $ $Date: 2004/04/30 20:17:41 $ To create the bookstore database, run this file through psql like so: prompt> psql -f 5_create_main_tables.postgresql bookstore bookstore_user */ drop sequence sales_row_num_seq; create sequence sales_row_num_seq cycle; drop table sales; create table sales ( row_num int not null default nextval('sales_row_num_seq'), first_name varchar(35) not null, middle_name varchar(35) null, last_name varchar(35) not null, city varchar(35) not null, state char(2) not null, zip varchar(5) not null, email_address varchar(50) not null, product_id int not null, product_name varchar(50) not null, product_sale_date date not null); comment on table sales is 'Contains records of books sold from the bookstore.'; comment on column sales.row_num is 'For de-duping purposes.';
As the comment at the top of the file instructs,
cd to the
directory holding the file and run
$ psql -f 1_create_main_tables.postgresql bookstore bookstore_user
to create your database table.
If you decide that you should have broken the
sales table into
sales tables—certainly better relational design—just alter the file above
and rerun the
psql command to rebuild your database. This sort
of automation is a real boon to the iterative design process early in a
project. It also documents your database, and it makes it easy to re-create
your database on a different host.
If you use CVS, you'll notice revision and date tags at the top of
0_create_database_and_user.postgresql. This is because I like to check
my database creation scripts into CVS, or, lately, Subversion. It's a great way
of keeping track of my database schema as it evolves, and it makes me more
willing to experiment, knowing that I can recall earlier versions of the
schema. It doesn't matter which version control system you use, of course. What
matters is that you can.
Note that because I suggest in
0_create_database_and_user.postgresql to allow
bookstore_user to log on without a password, the
command runs without prompting for a password. Obviously, I have traded
security for ease of use, especially for illustrative purposes in this article.
You need to decide which balance you will strike.
When your database project is mature, you will likely be generating reports, often in formats that your clients or bosses can open in their spreadsheet programs. (The populate.postgresql file contains sample data that you might find useful while following along.)
If you want to capture the results to a file, use
(they start with a backslash) in conjunction with SQL commands. You can log
on to your database with
psql bookstore bookstore_user and then
execute the following commands:
\o report.txt select product_id, product_name, product_sale_date from sales where product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31'; \o
The resulting report.txt file, though easy on the eyes, is difficult to import into spreadsheets. You know that's what your bosses want.
psql bookstore bookstore_user. Then execute these
additional formatting commands to produce something that will open in Excel or
\o report.csv \a \f , select product_id, product_name, product_sale_date from sales where product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31'; \o
If you want to study more of
psql's formatting commands, read
psql man page. There's even support for HTML table output!
Who wants to type all these commands every month just to generate a monthly
sales report, though? Put all of the above commands into a file called
-- generate_monthly_sales_report.postgresql -- -- Outputs one month's-worth of sales to a CSV file that -- is easily imported into spreadsheet software. \o report.csv \a \f , select product_id, product_name, product_sale_date from sales where product_sale_date *gt;= '2004-10-01' and product_sale_date <= '2004-10-31'; \o
Now you just have to remember to run
-f flag. You could even just make a nice wrapper script for that:
#!/bin/sh # gen_sales_report.sh # # Wrapper script for generate_monthly_sales_report.postgresql psql -f generate_monthly_sales_report.postgresql sales_db sales_user
Of course, there's still the issue of having to edit the dates in
generate_monthly_sales_report.postgresql, and wouldn't it be nice to
be able to name the output file as well?
As it happens,
psql supports the setting of named variables from the
command line. These appear as
:varname in SQL statements.
generate_monthly_sales_report.postgresql like this:
-- generate_monthly_sales_report.postgresql -- -- Outputs one month's-worth of sales to a CSV file that -- is easily imported into spreadsheet software. \o report.csv \a \f , select product_id, product_name, product_sale_date from sales where product_sale_date >= :start_date and product_sale_date <= :end_date; \o
#!/bin/sh # gen_sales_report.sh # # Wrapper script for generate_monthly_sales_report.postgresql psql --set start_date=\'2004-10-01\' \ --set end_date=\'2003-10-31\' \ -f generate_monthly_sales_report.postgresql sales_db sales_user
Note the escaping of single quotes to preserve them for the SQL query, where the dates will need them.
Of course, really we've just moved the problem to the wrapper shell script:
we still have to edit the dates, it's just that now we're doing it in
gen_sales_report.sh instead of
generate_monthly_sales_report.postgresql. The true solution is to grab
arguments from the command-line to reuse both scripts without ever having to
edit them again.
#!/bin/sh # gen_sales_report.sh # # Wrapper script for generate_monthly_sales_report.postgresql psql --set start_date=\'$1\' \ --set end_date=\'$2\' \ -f generate_monthly_sales_report.postgresql sales_db sales_user
If you want to make your shell scripts really robust (and I encourage you to do so), you can write some validation code for
$2 instead of trusting that the user (a) entered both command-line
arguments and (b) entered correctly formatted dates for use in the SQL command.
There's a fantastic book from No Starch Press called Wicked Cool Shell
Scripts, by Dave Taylor, that has some great examples of argument
validation using the shell. The book has raised my standards for shell script
robustness. I strongly recommend it.
Let us pretend that our toy database has a
products table that
has the following definition:
drop sequence products_row_num_seq; create sequence products_row_num_seq cycle; drop table products; create table products ( row_num int not null default nextval('products_row_num_seq'), id int primary key, name varchar(35) not null, description varchar(1024) null, price numeric(8, 2) not null); comment on table products is 'Records of products sold at the bookstore.'; comment on column products.row_num is 'For de-duping purposes.';
Let us also pretend that for each new product that we sell, the marketing
department provides us with a file in a standard format. We use that file to
insert new products into the
There are multiple approaches for inserting new data into a database.
One approach is to write software to parse the input file, and then use a connection library like Perl DB or JDBC to update the database. Many people do it this way, but as a matter of preference I don't much enjoy it. Configuring Perl DB or Java's JDBC always takes more time than I expect, and when the inevitable errors appear in the data feed, it's time-consuming to track down the source of the problem. (Think nested exceptions and tracking the line number of the input file.)
My favorite solution is to turn the data into a series of SQL insert or
update statements that I put into a file. In turn, I feed that file to
psql using its
-f argument. This proves to be a very
robust way of doing data input:
psql already knows how to connect
to PostgreSQL (so you don't have to tinker with Perl's DBI or Java's JDBC), and
psql will report the exact line number of your SQL input file
whenever an error occurs.
Also, it is trivial to add
commands at the top and bottom of your input file to make the entire batch of
inserts or updates a single transaction. This way, whenever an error occurs on
line 3000 of your 5,000-line SQL command file, the whole transaction aborts
(with a helpful error message, no less!), allowing you to fix the offending
line and simply run the entire file again, confident that the last, successful run will be the only one that actually affects the database.
Let's pretend that the marketing department types up new product
descriptions in a spreadsheet and saves to them to comma-separated value
(.csv) files, which you must upload to the database. Here's a
small sample file, products.csv, consisting of new
products to add to the
"id","name","description","price" 1001,"paperclips","box of 200",0.99 1002,"pencils","box of 10",1.99 1003,"birthday card",,2.99
The goal is to turn the .csv file into file called runme.postgresql containing commands like:
\set ON_ERROR_STOP 1 begin; insert into products (id, name, description, price) values (1001, 'paperclips', 'box of 200', 0.99); insert into products (id, name, description, price) values (1002, 'pencils', 'box of 10', 1.99); insert into products (id, name, description, price) values (1003, 'birthday card', null, 2.99); commit;
I created the above file using the
create_input_sql.pl Perl script, which
you can use as inspiration for your own data upload scripts. The program is not
as important as the idea that we can create a file of commands to feed to
Our command file has two useful commands in addition to the insert
statements. The first is where I set
1 (making it
true), so that if there is a bad insert statement in the command file,
psql will stop executing the rest of the file. The second is the
set of begin/commit statements that wrap all the insert statements. This allows
the update to execute as a single transaction, as discussed earlier.
Naturally, a shell script seems like a good way to pull this all together:
#!/bin/sh # load_new_products.sh # # Wrapper script that transforms products.csv into runme.postgresql # and feeds runme.postgresql to psql. ./create_input_sql.pl products.csv # outputs runme.postgresql psql -f runme.postgresql sales_db sales_user
One enhancement I recommend to the script, when used in the real world on
thousand-plus-line files, is to pipe all of
psql's output to a log
file so that you have a record of what happened. Also, I've noticed while
running my own considerably large input files that
much faster when its output goes to a file rather than to a
#!/bin/sh # load_new_products.sh # # Wrapper script that transforms products.csv into runme.postgresql # and feeds runme.postgresql to psql. ./create_input_sql.pl products.csv # outputs runme.postgresql psql -f runme.postgresql sales_db sales_user 1> load_new_products.log 2>&1
"Can This Be Automated?"
I hope you found these automation samples useful. If this article has done its job, next time you have to do a repetitive task with PostgreSQL, you will ask yourself, "Can I automate this?" and you will have these samples to help you start.
Manni Wood leads teams at a Boston advertising company in building Java-based, database-backed web sites for clients like General Motors and FedEx.
Return to ONLamp.com.