News

Welcome to End Point’s blog

Ongoing observations by End Point people

Restoring individual table data from a Postgres dump

Recently, one of our clients needed to restore the data in a specific table from the previous night's PostgreSQL dump file. Basically, there was a UPDATE query that did not do what it was supposed to, and some of the columns in the table were irreversibly changed. So, the challenge was to quickly restore the contents of that table.

The SQL dump file was generated by the pg_dumpall command, and thus there was no easy way to extract individual tables. If you are using the pg_dump command, you can specify a "custom" dump format by adding the -Fc option. Then, pulling out the data from a single table becomes as simple as adding a few flags to the pg_restore command like so:

$ pg_restore --data-only --table=alpha large.custom.dumpfile.pg > alpha.data.pg

One of the drawbacks of using the custom format is that it is only available on a per-database basis; you cannot use it with pg_dumpall. That was the case here, so we needed to extract the data of that one table from within the large dump file. If you know me well, you might suspect at this point that I've written yet another handy perl script to tackle the problem. As tempting as that may have been, time was of the essence, and the wonderful array of Unix command line tools already provided me with everything I needed.

Our goal at this point was to pull the data from a single table ("alpha") from a very large dump file ("large.dumpfile.pg") into a separate and smaller file that we could use to import directly into the database.

The first step was to find exactly where in the file the data was. We knew the name of the table, and we also know that a dump file inserts data by using the COPY command, so there should be a line like this in the dump file:

COPY alpha (a,b,c,d) FROM stdin;

Because all the COPYs are done together, we can be pretty sure that the command after "COPY alpha" is another copy. So the first thing to try is:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '

This uses grep's handy -n option (aka --line-number) to output the line number that each match appears on. Then we pipe that back to grep, search for our table name, and print the line after it with the -A option (aka --after-context). The output looked like this:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '
1233889:COPY alpha (cdate, who, state, add, remove) FROM stdin;
12182851:COPY alpha_sequence (sname, value) FROM stdin;

Note that many of the options here are GNU specific. If you are using an operating system that doesn't support the common GNU tools, you are going to have a much harder time doing this (and many other shell tasks)!

We now have a pretty good guess at the starting and ending lines for our data: 1233889 to lines 12182850 (we subtract 1 as we don't want the next COPY). We can now use head and tail to extract the lines we want, once we figure out how many lines our data spans:

$ echo 12182851 - 1233889 | bc
10948962
$ head -1233889 large.dumpfile.pg | tail -10948962 > alpha.data.pg

However, what if the next command was not a COPY? We'll have to scan forward for the end of the COPY section, which is always a backslash and a single dot at the start of a new line. The new command becomes (all one line, but broken down for readability):

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

That's a lot, but in the spirit of Unix tools doing one thing and one thing well, it's easy to break down. First, we grab the line numbers where COPY occurs in our file, then we find the first occurrence of our table (using the -m aka --max-count option). We cut out the first field from that output, using a colon as the delimiter. This gives is the line number where the COPY begins. We pass this to xargs, and tail the file with a --lines=+x argument, which outputs all lines from that file *starting* at the given line number. Finally, we pipe that output to grep and look for the end of copy indicator, stopping at the first one, and also outputting the line number. Here's what we get:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

148956:\.
xargs: tail: terminated by signal 13

This tells us that 148956 lines after the COPY, we encountered the string "\.". (The complaint from xargs can be ignored). Now we can create our data file:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | head -148956 > alpha.data.pg

Now that the file is there, we should do a quick sanity check on it. If the file is small enough, we could simply call it up in your favorite editor or run it through less or more. You can also check things out by knowing that a Postgres dump file separates the data in columns by a tab character when using the COPY command. So we can view all lines that don't have a tab, and make sure there is nothing except comments and the COPY and \. lines:

$ grep -v -P '\t' alpha.data.pg

The grep option -P (aka --perl-regexp) instructs grep to interpret the argument ("backslash t" in this case) as a Perl regular expression. You could also simply input a literal tab there: on most systems this can be done with the <ctrl-v><TAB> key combination.

It's time to replace that bad data. We'll need to truncate the existing table, then COPY our data back in. To do this, we'll create a file that we'll feed to psql -X -f. Here's the top of the file:

$ cat > alpha.restore.pg

\set ON_ERROR_STOP on
\timing

\c mydatabase someuser

BEGIN;

CREATE SCHEMA backup;

CREATE TABLE backup.alpha AS SELECT * FROM public.alpha;

TRUNCATE TABLE alpha;

From the top: we tell psql to stop right away if it encounters any problems, and then turn on the timing of all queries. We explicitly connect to the correct database as the correct user. Putting it here in the script is a safety feature. Then we start a new transaction, create a backup schema, and make a copy of the existing data into a backup table before truncating the original table. The next step is to add in the data, then wrap things up:

$ cat alpha.data.pg >> alpha.restore.pg

Now we run it and check for any errors. We use the -X argument to ensure control of exactly which psql options are in effect, bypassing any psqlrc files that may be in use.

$ psql -X -f alpha.restore.pg

If everything looks good, the final step is to add a COMMIT and run the file again:

$ echo "COMMIT;" >> alpha.restore.pg
$ psql -X -f alpha.restore.pg

And we are done! All of this is a little simplified, as in real life there was actually more than one table to be restored, and each had some foreign key dependencies that had to be worked around, but the basic idea remains the same. (and yes, I know you can do the extraction in a Perl one-liner)

6 comments:

Moltonel said...

This approach using just sed is much nicer IMHO :

sed -n "/^COPY alpha/,/^\\\.$/p" dump.sql

And for extra efficiency, exit ASAP :

sed -n "/^COPY alpha/,/^\\\.$/{p;/^\\\.$/q} dump.sql

Anonymous said...

I just used grep to get lines, computed the size of copy chunk, then head x.dmp -n ??? | tail -n ??? :) Simple and stupid

jacek said...

Moltonel:
It should be
COPY alpha /
not
COPY alpha/
because otherwise all tables named like:
alpha1, alphax etc. would be output.

Anonymous said...

using sed -n '100,400p' input.sql > output.sql will work if you've got a lot of rows - using head|tail fails if your data won't fit into memory.

David said...

Nice blog post.

I'm curious how you handled foreign key dependencies.

IIRC truncate cascade will also truncate tables referencing this one with foreign keys and truncate without cascade won't run if there are foreign keys.

Anonymous said...

Was really usefull.. Helped me so much.

Thanks :)