News

Welcome to End Point’s blog

Ongoing observations by End Point people

How fast is pg_upgrade anyway?

Back in the old days, upgrading Postgres required doing a pg_dump and loading the resulting logical SQL into the new database. This could be a very slow, very painful process, requiring a lot of downtime. While there were other solutions (such as Bucardo) that allowed little (or even zero) downtime, setting them up was a large complex task. Enter the pg_upgrade program, which attempts to upgrade a cluster with minimal downtime. Just how fast is it? I grew tired of answering this question from clients with vague answers such as "it depends" and "really, really fast" and decided to generate some data for ballpark answers.

Spoiler: it's either about 3.5 times as fast as pg_dump, or insanely fast at a flat 15 seconds or so. Before going further, let's discuss the methodology used.

I used the venerable pgbench program to generate some sample tables and data, and then upgraded the resulting database, going from Postgres version 9.3 to 9.4. The pgbench program comes with Postgres, and simply requires an --initialize argument to create the test tables. There is also a --scale argument you can provide to increase the amount of initial data - each increment increases the number of rows in the largest table, pgbench_accounts, by one hundred thousand rows. Here are the scale runs I did, along with the number of rows and overall database size for each level:

Effect of --scale
--scaleRows in pgbench_accountsDatabase size
10010,000,0001418 MB
15015,000,0002123 MB
20020,000,0002829 MB
25025,000,0003535 MB
30030,000,0004241 MB
35035,000,0004947 MB
40040,000,0005652 MB
45045,000,0006358 MB
50050,000,0007064 MB
55055,000,0007770 MB
60060,000,0008476 MB

To test the speed of the pg_dump program, I used this simple command:

$ pg_dump postgres | psql postgres -q -p 5433 -f -

I did make one important optimization, which was to set fsync off on the target database (version 9.4). Although this setting should never be turned off in production - or anytime you cannot replace all your data, upgrades like this are an excellent time to disable fsync. Just make sure you flip it back on again right away! There are some other minor optimizations one could make (especially boosting maintenance_work_mem), but for the purposes of this test, I decided that the fsync was enough.

For testing the speed of pg_upgrade, I used the following command:

$ pg_upgrade -b $BIN1 -B $BIN2 -d $DATA1 -D $DATA2 -P 5433

The speed difference can be understood because pg_dump rewrites the entire database, table by table, row by row, and then recreates all the indexes from scratch. The pg_upgrade program simply copies the data files, making the minimum changes needed to support the new version. Because of this, it will always be faster. How much faster depends on a lot of variables, e.g. the number and size of your indexes. The chart below shows a nice linear slope for both methods, and yielding on average a 3.48 increase in speed of pg_upgrade versus pg_dump:

pg_dump versus pg_upgrade
--scaleDatabase sizepg_dump
(seconds)
pg_upgrade
(seconds)
Speedup
1001.4 GB210.074.72.82
1502.1 GB305.079.43.86
2002.8 GB457.6122.23.75
2503.5 GB636.1172.13.70
3004.2 GB832.2215.13.87
3504.9 GB1098.8320.73.43
4005.7 GB1172.7361.43.25
4506.4 GB1340.2426.73.15
5007.1 GB1509.6476.33.17
5507.8 GB1664.0480.03.47
6008.5 GB1927.06073.17

If you graph it out, you can see both of them having a similar slope, but with pg_upgrade as the clear winner:

I mentioned earlier that there were some other optimizations that could be done to make the pg_dump slightly faster. As it turns out, pg_upgrade can also be made faster. Absolutely, beautifully, insanely faster. All we have to do is add the --link argument. What this does is rather than copying the data files, it simply links them via the filesystem. Thus, each large data file that makes up the majority of a database's size takes a fraction of a second to link to the new version. Here are the new numbers, generated simply by adding a --link to the pg_upgrade command from above:

pg_upgrade --link is crazy fast
--scaleDatabase sizepg_upgrade --link
(seconds)
1001.4 GB12.9
1502.1 GB13.4
2002.8 GB13.5
2503.5 GB13.2
3004.2 GB13.6
3504.9 GB14.4
4005.7 GB13.1
4506.4 GB13.0
5007.1 GB13.2
5507.8 GB13.1
6008.5 GB12.9

No, those are not typos - an average of thirteen seconds despite the size of the database! The only downside to this method is that you cannot access the old system once the new system starts up, but that's a very small price to pay, as you can easily backup the old system first. There is no point in graphing these numbers out - just look at the graph above and imagine a nearly flat line traveling across the bottom of the graph :)

Are there any other options that can affect the time? While pgbench has a handy --foreign-keys argument I often use to generate a more "realistic" test database, both pg_dump and pg_upgrade are unaffected by any numbers of foreign keys. One limitation of pg_upgrade is that it cannot change the --checksum attribute of a database. In other words, if you want to go from a non-checksummed version of Postgres to a checksummed version, you need to use pg_dump or some other method. On the plus side, my testing found negligible difference between upgrading a checksummed versus a non-checksummed version.

Another limitation of the pg_upgrade method is that all internal stats are blown away by the upgrade, so the database starts out in a completely unanalyzed state. This is not as much an issue as it used to be, as pg_upgrade will generate a script to regenerate these stats, using the handy --analyze-in-stages argument to vacuum. There are a few other minor limitations to pg_upgrade: read the documentation for a complete list. In the end, pg_upgrade is extraordinarily fast and should be your preferred method for upgrading. Here is a final chart showing the strengths and weaknesses of the major upgrade methods.

Postgres upgrade methods compared
MethodStrengthsWeaknesses
pg_dump
  • Always works
  • Battle tested
  • Slowest method
  • Maximum downtime
  • Requires lots of disk space
pg_upgrade
  • Very fast
  • --link mode super fast
  • Cannot always be used (finicky)
  • Stats are lost
  • Minimal but non-zero downtime
Bucardo
  • Handles complex cases
  • Zero-downtime possible
  • Complex to setup
  • Requires primary keys on large tables
  • Requires lots of disk space

(As an addendum of sorts, pg_upgrade is fantastic, but the Holy Grail is still out of sight: true in-place upgrades. This would mean dropping in a new major version (similar to the way revisions can be dropped in now), and this new version would be able to read both old and new data file formats, and doing an update-on-write as needed. Someday!)

9 comments:

Denish Patel said...

Very good article but it will be useful with large db size. For small db size, it's very easy to tune pg_dump/restore with parallel options and get away with smaller outage window. The problem arises with dbsize greater than 20 or over 100GB.

Anyway you can show difference in timing between 10GB vs 100GB?

Greg Sabino Mullane said...

I ran out of space on my test box, but I will find another place and run some larger tests. I suspect the speedup will grow quickly past 3.5 as the sizes get larger.

Tuxie said...

In my experience with 7+ TB databases on fast storage (SSD/SAN) and 24 core servers with 256+ GB RAM, parallel pg_dump+pg_restore is not much slower than pg_upgrade because pg_restore (like pg_upgrade) will spend the majority of the total time reindexing and analyzing, not loading data. Going forward, now that we have --checksum in place and pg_upgrade has had time to mature, we will use pg_upgrade to go from 9.4 to 9.5 though.

Joe said...

Doesn't the Bucardo approach also require lots of disk space (I assume it means replicating from old to new)?

Greg Sabino Mullane said...

Joe: Good point! I've added "disk space" to the list of Bucardo drawbacks.

Greg Sabino Mullane said...

Tuxie: Not sure I follow: pg_upgrade does not reindex (but does need to analyze). It really depends on your DB if most of the pg_restore time is spent on COPY vs CREATE INDEX. I do generally recommend that clients suck up the time hit in order to get to --checksums, as it is a one-time pain, and after that you can use pg_upgrade -k :)

David Fetter said...

Nice summary.

You've really made the case for the logical replication (Bucardo/Slony/Londiste) version because "zero downtime" is frequently a hard requirement and very few situations make storage so expensive that an in-place upgrade is required.

Greg Sabino Mullane said...

David Fetter: Yes, it's hard to beat zero downtime, but I have found that most clients usually have *some* wiggle room despite a stated hard requirement of "zero", and thus --link becomes a great option. I don't think storage costs are as much a factor as process complexity as far as doing logical replication versus in-place.

Greg Sabino Mullane said...

Denish: I ran some more tests on a better server, and found the ratio fairly consistent across a wide range of --steps. In this case, the ratio was around 2.8x. At a step of 10000, and a 146GB database, pg_dump took 150 minutes, while pg_upgrade took 52 minutes. Using --link took 9 seconds :)