End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

PostgreSQL template databases to restore to a known state

Someone asked on the mailing lists recently about restoring a PostgreSQL database to a known state for testing purposes. How to do this depends a little bit on what one means by "known state", so let's explore a few scenarios and their solutions.

First, let's assume you have a Postgres cluster with one or more databases that you create for developers or QA people to mess around with. At some point, you want to "reset" the database to the pristine state it was in before people starting making changes to it.

The first situation is that people have made both DDL changes (such as ALTER TABLE ... ADD COLUMN) and DML changes (such as INSERT/UPDATE/DELETE). In this case, what you want is a complete snapshot of the database at a point in time, which you can then restore from. The easiest way to do this is to use the TEMPLATE feature of the CREATE DATABASE command.

Every time you run CREATE DATABASE, it uses an already existing database as the "template". Basically, it creates a copy of the template database you specify. If no template is specified, it uses "template1" by default, so that these two commands are equivalent:

CREATE DATABASE foobar;
CREATE DATABASE foobar TEMPLATE template1;

Thus, if we want to create a complete copy of an existing database, we simply use it as a template for our copy:

CREATE DATABASE mydb_template TEMPLATE mydb;

Thus, when we want to restore the mydb database to the exact same state as it was when we ran the above command, we simply do:

DROP DATABASE mydb;
CREATE DATABASE mydb TEMPLATE mydb_template;

You may want to make sure that nobody changes your new template database. One way to do this is to not allow any non-superusers to connect to the database by setting the user limit to zero. This can be done either at creation time, or afterwards, like so:

CREATE DATABASE mydb_template TEMPLATE mydb CONNECTION LIMIT 0;

ALTER DATABASE mydb_template CONNECTION LIMIT 0;

You may want to go further by granting the database official "template" status by adjusting the datistemplate column in the pg_database table:

UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'mydb_template';

This will allow anyone to use the database as a template, as long as they have the CREATEDB privilege. You can also restrict *all* connections to the database, even superusers, by adjusting the datallowconn column:

UPDATE pg_database SET datallowconn = FALSE WHERE datname = 'mydb_template';

Another way to restore the database to a known state is to use the pg_dump utility to create a file, then use psql to restore that database. In this case, the command to save a copy would be:

pg_dump mydb --create > mydb.template.pg

The --create option tells pg_dump to create the database itself as the first command in the file. If you look at the generated file, you'll see that it is using template0 as the template database in this case. Why does Postgres have template0 and template1? The template1 database is meant as a user configurable template that you can make changes to that will be picked up by all future CREATE DATABASE commands (a common example is a CREATE LANGUAGE command). The template0 database on the other hand is meant as a "hands off, don't ever change it" stable database that can always safely be used as a template, with no changes from when the cluster was first created. To that end, you are not even allowed to connect to the template0 database (thanks to the datallowconn column metioned earlier).

Now that we have a file (mydb.template.pg), the procedure to recreate the database becomes:

psql -X -c 'DROP DATABASE mydb'

psql -X --set ON_ERROR_STOP=on --quiet --file mydb.template.pg

We use the -X argument to ensure we don't have any surprises lurking inside of psqlrc files. The --set ON_ERROR_STOP=on option tells psql to stop processing the moment it encounters an error, and the --quiet tells psql to not be verbose and only let us know about very important things. (While I normally advocate using the --single-transaction option as well, we cannot in this case as our file contains a CREATE DATABASE line).


What if (as someone posited in the thread) the original poster really wanted only the *data* to be cleaned out, and not the schema (e.g. DDL)?. In this case, what we want to do is remove all rows from all tables. The easiest way to do this is with the TRUNCATE command of course. Because we don't want to worry about which tables need to be deleted before other ones because of foreign key constraints, we'll also use the CASCADE option to TRUNCATE. We'll query the system catalogs for a list of all user tables, generate truncate commands for them, and then play back the commands we just created. First, we create a simple text file containing commands to truncate all the tables:

SELECT 'TRUNCATE TABLE '
 || quote_ident(nspname)
 || '.'
 || quote_ident(relname)
 || ' CASCADE;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'r';

Once that's saved as truncate_all_tables.pg, resetting the database by removing all rows from all tables becomes as simple as:

psql mydb -X -t -f truncate_all_tables.pg | psql mydb --quiet

We again use the --quiet option to limit the output, as we don't need to see a string of "TRUNCATE TABLE" strings scroll by. The -t option (also written as --tuples-only) prevents the headers and footers from being output, as we don't want to pipe those back in.

It's most likely you'd also want the sequences to be reset to their starting point as well. While sequences generally start at "1", we'll take out the guesswork by using the "ALTER SEQUENCE seqname RESTART" syntax. We'll append the following SQL to the text file we created earlier:

SELECT 'ALTER SEQUENCE '
 || quote_ident(nspname)
 || '.'
 || quote_ident(relname)
 || ' RESTART;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'S';

The command is run the same as before, but now in addition to table truncation, the sequences are all reset to their starting values.


A final way to restore the database to a known state is a variation on the previous pg_dump command. Rather than save the schema *and* data, we simply want to restore the database without any data:

## Create the template file:
pg_dump mydb --schema-only --create > mydb.template.schemaonly.pg

## Restore it:
psql -X -c 'DROP DATABASE mydb'
psql -X --set ON_ERROR_STOP=on --file mydb.template.schemaonly.pg

Those are a few basic ideas on how to reset your database. There are a few limitations that got glossed over, such as that nobody can be connected to the database that is being used as a template for another one when the CREATE DATABASE command is being run, but this should be enough to get you started.

Learn more about End Point's Postgres Support, Development, and Consulting.

5 comments:

Philippe Beaudoin said...

Hi,
The use of the template database is a nice trick.
For cases when a relatively small part of the database is updated and no DDL operation is performed, I am developing a tool named E-Maj (a first version being available on pgfoundry) with a presentation. The concept is an extension of table_log contrib. Some supplied functions allow to create group of tables on which it is possible to set marks at predefined point in time and then, if needed, rollback all row insertions, updates and deletions up to one of these marks.
I intend to improve this tool. But if it can already help someone... Of course, any feedback is welcome.

Bob said...

I think that all objects should be stored in source control just like any code should be. From there you can have labels and branches and a simple build script to build a pristine database from source passing in a label or branch that you want to build from.

Greg Sabino Mullane said...

Philippe: Thanks for the links, looks interesting.

Andres Olarte said...

There's another approach that might work depending on what you do. I run a lot of unit tests against my stored procedures. To be able to roll back to a known good state after each run, I throw and catch an exception inside my test stored procedure. Something like this:

--- RUN MY TESTS HERE ---
RAISE EXCEPTION 'ROLLBACK';
EXCEPTION WHEN raise_exception THEN
RAISE NOTICE 'ROLLED BACK';

This causes any changes to be rolled back, and it's incredibly fast! There's functions I wrote to facilitate this type of testing: PLUnit

This can also be done using save points, if operating outside of a stored procedure. Something like this:

SAVEPOINT my_savepoint;
--- RUN MY TESTS HERE ---
ROLLBACK TO SAVEPOINT my_savepoint;

This solution is obviously not workable for the manual testing of a full application, but more for automated (unit) testing

Ethan Rowe said...

Andres: I've used exactly that strategy myself in an attempt to apply the test-driven development methodology to Postgres-specific work. It ended up working out nicely, and you end up with a DDL script that will back itself out automatically if it fails in foreseeable ways when applied in production, which is pretty nice.