You can and should automate routine database tasks, to remove the possibility of user error and to let yourself focus on other, more interesting work. If you use PostgreSQL on Unix or Linux, you can combine the shell with PostgreSQL's psql client to automate database chores.
If you haven't yet created your database and you believe that you will be iteratively destroying and re-creating it as the project's needs become clearer, the first chore you should automate is the creation of the database itself.
I have the habit of creating a db_creation_scripts directory for my PostgreSQL projects, numbering the scripts inside according to the order in which they should execute. Here's a listing for the sample database I created to write this article:
prospero:db_creation_scripts$ ls
0_create_database_and_user.postgresql 1_create_main_tables.postgresql
A real project might have a few more scripts for pre-populating some of the tables or creating stored procedures, but our toy example is simple and needs just two scripts.
The 0_create_database_and_user.postgresql
script is really a set of steps to follow manually. Take a look at it if
you're interested in learning how I created the toy database used for the rest of this
article.
The fun automation stuff begins with 1_create_main_tables.postgresql:
/* $Revision: 1.24 $ $Date: 2004/04/30 20:17:41 $
To create the bookstore database, run this file through psql like so:
prompt> psql -f 5_create_main_tables.postgresql bookstore bookstore_user
*/
drop sequence sales_row_num_seq;
create sequence sales_row_num_seq cycle;
drop table sales;
create table sales (
row_num int not null default nextval('sales_row_num_seq'),
first_name varchar(35) not null,
middle_name varchar(35) null,
last_name varchar(35) not null,
city varchar(35) not null,
state char(2) not null,
zip varchar(5) not null,
email_address varchar(50) not null,
product_id int not null,
product_name varchar(50) not null,
product_sale_date date not null);
comment on table sales is 'Contains records of books sold from the bookstore.';
comment on column sales.row_num is 'For de-duping purposes.';
As the comment at the top of the file instructs, cd to the
directory holding the file and run
$ psql -f 1_create_main_tables.postgresql bookstore bookstore_user
to create your database table.
|
Related Reading
|
If you decide that you should have broken the sales table into
customer, product, and sales tables—certainly better relational design—just alter the file above
and rerun the psql command to rebuild your database. This sort
of automation is a real boon to the iterative design process early in a
project. It also documents your database, and it makes it easy to re-create
your database on a different host.
If you use CVS, you'll notice revision and date tags at the top of
0_create_database_and_user.postgresql. This is because I like to check
my database creation scripts into CVS, or, lately, Subversion. It's a great way
of keeping track of my database schema as it evolves, and it makes me more
willing to experiment, knowing that I can recall earlier versions of the
schema. It doesn't matter which version control system you use, of course. What
matters is that you can.
Note that because I suggest in
0_create_database_and_user.postgresql to allow
bookstore_user to log on without a password, the psql
command runs without prompting for a password. Obviously, I have traded
security for ease of use, especially for illustrative purposes in this article.
You need to decide which balance you will strike.
When your database project is mature, you will likely be generating reports, often in formats that your clients or bosses can open in their spreadsheet programs. (The populate.postgresql file contains sample data that you might find useful while following along.)
If you want to capture the results to a file, use psql commands
(they start with a backslash) in conjunction with SQL commands. You can log
on to your database with psql bookstore bookstore_user and then
execute the following commands:
\o report.txt
select product_id, product_name, product_sale_date from sales where
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o
The resulting report.txt file, though easy on the eyes, is difficult to import into spreadsheets. You know that's what your bosses want.
Again, run psql bookstore bookstore_user. Then execute these
additional formatting commands to produce something that will open in Excel or
OpenOffice.org:
\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where
product_sale_date >= '2004-10-01' and product_sale_date <= '2004-10-31';
\o
If you want to study more of psql's formatting commands, read
the psql man page. There's even support for HTML table output!
Who wants to type all these commands every month just to generate a monthly
sales report, though? Put all of the above commands into a file called
generate_monthly_sales_report.postgresql:
-- generate_monthly_sales_report.postgresql
--
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.
\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where
product_sale_date *gt;= '2004-10-01' and product_sale_date <= '2004-10-31';
\o
Now you just have to remember to run
generate_monthly_sales_report.postgresql using psql's
-f flag. You could even just make a nice wrapper script for that:
#!/bin/sh
# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql
psql -f generate_monthly_sales_report.postgresql sales_db sales_user
Of course, there's still the issue of having to edit the dates in
generate_monthly_sales_report.postgresql, and wouldn't it be nice to
be able to name the output file as well?
As it happens, psql supports the setting of named variables from the
command line. These appear as :varname in SQL statements.
Modify generate_monthly_sales_report.postgresql like this:
-- generate_monthly_sales_report.postgresql
--
-- Outputs one month's-worth of sales to a CSV file that
-- is easily imported into spreadsheet software.
\o report.csv
\a
\f ,
select product_id, product_name, product_sale_date from sales where
product_sale_date >= :start_date and product_sale_date <= :end_date;
\o
Then modify gen_sales_report.sh:
#!/bin/sh
# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql
psql --set start_date=\'2004-10-01\' \
--set end_date=\'2003-10-31\' \
-f generate_monthly_sales_report.postgresql sales_db sales_user
Note the escaping of single quotes to preserve them for the SQL query, where the dates will need them.
Of course, really we've just moved the problem to the wrapper shell script:
we still have to edit the dates, it's just that now we're doing it in
gen_sales_report.sh instead of
generate_monthly_sales_report.postgresql. The true solution is to grab
arguments from the command-line to reuse both scripts without ever having to
edit them again.
#!/bin/sh
# gen_sales_report.sh
#
# Wrapper script for generate_monthly_sales_report.postgresql
psql --set start_date=\'$1\' \
--set end_date=\'$2\' \
-f generate_monthly_sales_report.postgresql sales_db sales_user
If you want to make your shell scripts really robust (and I encourage you to do so), you can write some validation code for $1 and
$2 instead of trusting that the user (a) entered both command-line
arguments and (b) entered correctly formatted dates for use in the SQL command.
There's a fantastic book from No Starch Press called Wicked Cool Shell
Scripts, by Dave Taylor, that has some great examples of argument
validation using the shell. The book has raised my standards for shell script
robustness. I strongly recommend it.
Let us pretend that our toy database has a products table that
has the following definition:
drop sequence products_row_num_seq;
create sequence products_row_num_seq cycle;
drop table products;
create table products (
row_num int not null default nextval('products_row_num_seq'),
id int primary key,
name varchar(35) not null,
description varchar(1024) null,
price numeric(8, 2) not null);
comment on table products is 'Records of products sold at the bookstore.';
comment on column products.row_num is 'For de-duping purposes.';
Let us also pretend that for each new product that we sell, the marketing
department provides us with a file in a standard format. We use that file to
insert new products into the products table.
There are multiple approaches for inserting new data into a database.
One approach is to write software to parse the input file, and then use a connection library like Perl DB or JDBC to update the database. Many people do it this way, but as a matter of preference I don't much enjoy it. Configuring Perl DB or Java's JDBC always takes more time than I expect, and when the inevitable errors appear in the data feed, it's time-consuming to track down the source of the problem. (Think nested exceptions and tracking the line number of the input file.)
My favorite solution is to turn the data into a series of SQL insert or
update statements that I put into a file. In turn, I feed that file to
psql using its -f argument. This proves to be a very
robust way of doing data input: psql already knows how to connect
to PostgreSQL (so you don't have to tinker with Perl's DBI or Java's JDBC), and
psql will report the exact line number of your SQL input file
whenever an error occurs.
Also, it is trivial to add BEGIN/COMMIT SQL
commands at the top and bottom of your input file to make the entire batch of
inserts or updates a single transaction. This way, whenever an error occurs on
line 3000 of your 5,000-line SQL command file, the whole transaction aborts
(with a helpful error message, no less!), allowing you to fix the offending
line and simply run the entire file again, confident that the last, successful run will be the only one that actually affects the database.
Let's pretend that the marketing department types up new product
descriptions in a spreadsheet and saves to them to comma-separated value
(.csv) files, which you must upload to the database. Here's a
small sample file, products.csv, consisting of new
products to add to the products table:
"id","name","description","price"
1001,"paperclips","box of 200",0.99
1002,"pencils","box of 10",1.99
1003,"birthday card",,2.99
The goal is to turn the .csv file into file called runme.postgresql containing commands like:
\set ON_ERROR_STOP 1
begin;
insert into products (id, name, description, price) values
(1001, 'paperclips', 'box of 200', 0.99);
insert into products (id, name, description, price) values
(1002, 'pencils', 'box of 10', 1.99);
insert into products (id, name, description, price) values
(1003, 'birthday card', null, 2.99);
commit;
I created the above file using the create_input_sql.pl Perl script, which
you can use as inspiration for your own data upload scripts. The program is not
as important as the idea that we can create a file of commands to feed to
psql.
Our command file has two useful commands in addition to the insert
statements. The first is where I set ON_ERROR_STOP to 1 (making it
true), so that if there is a bad insert statement in the command file,
psql will stop executing the rest of the file. The second is the
set of begin/commit statements that wrap all the insert statements. This allows
the update to execute as a single transaction, as discussed earlier.
Naturally, a shell script seems like a good way to pull this all together:
#!/bin/sh
# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.
./create_input_sql.pl products.csv # outputs runme.postgresql
psql -f runme.postgresql sales_db sales_user
One enhancement I recommend to the script, when used in the real world on
thousand-plus-line files, is to pipe all of psql's output to a log
file so that you have a record of what happened. Also, I've noticed while
running my own considerably large input files that psql runs
much faster when its output goes to a file rather than to a
terminal.
#!/bin/sh
# load_new_products.sh
#
# Wrapper script that transforms products.csv into runme.postgresql
# and feeds runme.postgresql to psql.
./create_input_sql.pl products.csv # outputs runme.postgresql
psql -f runme.postgresql sales_db sales_user 1> load_new_products.log 2>&1
I hope you found these automation samples useful. If this article has done its job, next time you have to do a repetitive task with PostgreSQL, you will ask yourself, "Can I automate this?" and you will have these samples to help you start.
Manni Wood leads teams at a Boston advertising company in building Java-based, database-backed web sites for clients like General Motors and FedEx.
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.