End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Splitting Postgres pg_dump into pre and post data files

I've just released a small Perl script that has helped me solve a specific problem with Postgres dump files. When you use pg_dump or pg_dumpall, it outputs things in the following order, per database:

  1. schema creation commands (e.g. CREATE TABLE)
  2. data loading command (e.g. COPY tablename FROM STDIN)
  3. post-data schema commands (e.g. CREATE INDEX)

The problem is that using the --schema-only flag outputs the first and third sections into a single file. Hence, if you load the file and then load a separate --data-only dump, it can be very slow as all the constraints, indexes, and triggers are already in place. The split_postgres_dump script breaks the dump file into two segments, a "pre" and a "post". (It doesn't handle a file with a data section yet, only a --schema-only version)

Why would you need to do this instead of just using a full dump? Some reasons I've found include:

  • When you need to load the data more than once, such as debugging a data load error.
  • When you want to stop after the data load step (which you can't do with a full dump)
  • When you need to make adjustments to the schema before the data is loaded (seen quite a bit on major version upgrades)

Usage is simply ./split_postgres_dump.pl yourdumpfile.pg, which will then create two new files, yourdumpfile.pg.pre and yourdumpfile.pg.post. It doesn't produce perfectly formatted files, but it gets the job done!

It's a small script, so it has no bug tracker, git repo, etc. but it does have a small wiki page at http://bucardo.org/wiki/Split_postgres_dump from which you can download the latest version.

Future versions of pg_dump will allow you to break things into pre and post data sections with flags, but until then, I hope somebody finds this script useful.

Update: There is now a git repo:
git clone git://bucardo.org/split_postgres_dump.git

10 comments:

Richard (Yen) said...

This is very useful, especially for setting up slony log shipping. The slony1_dump.sql script does a data dump, which needs to be applied against a schema dump. However, `pg_dump --schema-only` produces the pre- and post-data commands in one file, the foreign keys are applied immediately, causing the re-load of data to fail. I've been doing schema split by hand up until now (split the schema file at the first "ALTER TABLE .* ADD CONSTRAINT" command, but I think this script will make things go a lot faster.

Jeff Davis said...

That's very useful.

Other things that might be helpful would be to:
* Optionally restore indexes using CREATE INDEX CONCURRENTLY.
* Optionally separate constraints into a separate phase. This may be useful when you want to restart the application ASAP, and you need the constraints to be active, but you may be able to live without non-constraint indexes for a while.

Grzegorz J said...

just dump it in tar format, and you'll get everything chopped apart.

Greg Sabino Mullane said...

Jeff Davis-

Good ideas. I guess maybe I should put this script into git after all. :)

Greg Sabino Mullane said...

Grzegorz J-

That doesn't really help for the use cases the script was written for.

Greg Sabino Mullane said...

Okay, now in git at github:

http://github.com/bucardo/Split_postgres_dump

Richard Huxton said...

I've got a similar gadget I use here, but it's a schema exploder. You run a --schema-only dump through it and get a directory structure:
[schema]/[object-type]/[name]
e.g.
lookups/table/country_names

Each file has creation / adjustment sections. Hadn't thought about having an option to separate them.

I mainly use it so I have something diff/grep friendly to play with when looking at schema changes.

Andrew said...

Splitting a dump is totally trivial with a custom format dump, with or without data members. You use the --list and --use-list pieces of pg_restore plus a tiny bit of perl (or sed) glue. You're not using custom format dumps? Well, quite apart from this you're also depriving yourself of the possibility to use parallel pg_restore.

Greg Sabino Mullane said...

Andrew:

Not to minimize my own efforts, but this script *is* "a tiny bit of perl (or sed) glue" :) I'm all for custom formats: the process just becomes pg_format --schema-only dumpfile > newdumpfile; ./split_postgres_dump.pl newdumpfile. Of course, pg_dumpall doesn't do custom formats...

Tobias said...

@ Richard Huxton:
Richard, can you give some more details of the scripts your are using to do that kind of 'schema explode'? It sounds like this is actually something I was looking for since years to integrate postgresql development into an SVN.