Welcome to End Point’s blog

Ongoing observations by End Point people

PostgreSQL EC2/EBS/RAID 0 snapshot backup

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.

They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

  1. Create an atomic snapshot of the block devices
  2. Spin up another virtual server
  3. Mount the backup volume
  4. Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
  5. Do whatever pg_dump or other backups are desired
  6. Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.

The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

  1. Created one EC2 image:
    Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
    Debian AMI ID ami-4ffe1926 (x86_64)
    Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores
  2. Created 4 x 10 GB EBS volumes
  3. Attached volumes to the image
  4. Created software RAID 0 device:
    mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
  5. Created XFS filesystem on top of RAID 0 device:
    mkfs -t xfs -L /pgdata /dev/md0
  6. Set up in /etc/fstab and mounted:
    mkdir /pgdata
    # edit /etc/fstab, with noatime
    mount /pgdata
  7. Installed PostgreSQL 8.3
  8. Configured postgresql.conf to be similar to primary production database server
  9. Created empty new database cluster with data directory in /pgdata
  10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
  11. Ran some bulk inserts to grow database to around 5 GB
  12. Rebooted EC2 instance to confirm everything came back up correctly on its own
  13. Set up two concurrent data-insertion processes:
    • 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
    • Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
  14. Started those "beater" jobs, which mostly consumed 2-3 CPU cores
  15. Manually inserted a known test row and created a known view that should appear in the snapshot
  16. Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
    SELECT pg_start_backup('raid_backup');
  17. Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
  18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
  19. Tell Postgres the backup's over:
    SELECT pg_stop_backup();
  20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
  21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
  22. Then, on the new EC2 instance for doing backups:
    • mdadm --assemble --scan
    • mount /pgdata
    • Start Postgres
    • Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
    • VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
    • pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.

There are lots of caveats to note here:

  • This is an experiment in progress, not a how-to for the general public.
  • The data set that was snapshotted was fairly small.
  • Two successful runs, even with no failures, is not a very big sample set. :)
  • I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.


Ethan Rowe said...

Thanks for writing it up, Jon. I'm psyched you dug into this so much (as you well know). :)

One of the things we had wondered about was how RAID would respond to inconsistencies in the underlying volumes owing to the lack of atomicity inherent in snapshots of independent EBS devices.

The choice to give it a try is informed, at least in my view, on the principle that a RAID controller that cannot deal with inconsistencies in the array members is a RAID controller that can't work in production anyway.

Thanks again.
- Ethan

Jon Jensen said...

Thanks, Ethan.

Not only should the RAID controller be able to deal with it, I'm not sure there's any "it" to deal with. Though of course this is software RAID, so the "controller" is just another block device later, not an actual hardware controller.

Why would the RAID metadata ever change unless the administrator specifically does something to change it? The on-disk state of anything having to do with RAID shouldn't be volatile at all.

The data within the RAID "container" is volatile, of course, but that's operating system block device-level stuff that is a matter for the filesystem.

So the race condition, such as it is, revolves primarily around filesystem metadata (very little, if atime updates are off and no files are being created or unlinked).

ajaya said...

Have you looked at ec2-consistent-snapshot from eric hammond that can be adapted to postgres?

I saw a link around

Anonymous said...

I have been meaning to push up some code to that launchpad project. My guess is that it would be safer to 1) keep wall on separate fs 2) checkpoint or start_backup 3) xfs_freeze data fs 4) checkpoint all ebs devices and 5) xfs unfreeze data fs . Only lightly tested, though
- adler

jason said...

Why not use LVM on top of the RAID and use an LVM snapshot. That would be consistent.

Jon Jensen said...

Jason, we thought about using LVM and yes, it would be consistent, but it would have to be done on the host in question, and won't help offload any I/O from the already I/O-saturated host.

Unless there's some way to share the same LVM block devices from multiple hosts that I don't know about?

Jon Jensen said...

Ajaya, that project looks interesting so I'll check it out. Thanks for the link.

Adler: Doesn't xfs_freeze block all filesystem writes? That still may be better than shutting down Postgres altogether during the snapshot, but it's going to add at least a little downtime. I'd like to try it.

Anonymous said...

Yes, that will block all writes, but it should only take a few seconds to initiate the snapshots and since the WAL is not blocked, the backends will continue to accept most queries in the meantime. This way you are guaranteed that your snapshots are consistent.

ec2-consistent-snapshot will timeout if any one snapshot request does not initiate in more than 10 seconds (by default). So that gives you a ceiling on what pauses may happen if EBS is not acting as expected.


Robert Treat said...

We've been doing these kinds of experiments at OmniTI, using ZFS snapshots, for years, so it's nice to see others getting into the game.

If you're going to run these in production, you really should build it on top of the pitr facility, and your thinking is on track wrt using the pg_start/stop backup facilities, and to grab the xlogs last during that time.

We normally build these on top of running pitr instances anyway, but a simpler solution for stand alone systems might be to just use /bin/false and grabbing the xlog dir, I'd probably need to do some experiments on that before recommending it.

Anyway, nice write up!

Jon Jensen said...

Thanks for the note, Robert.

I guess I didn't clearly state that we've been doing this in production for years using LVM2 and NetApp snapshots, depending on the client's hardware.

What I was doing here was trying the same thing on EBS with 4 atomic snapshots that together make for a non-atomic RAID 0 snapshot, which isn't theoretically pure but did work anyway.

Greg Smith said...

Matching Robert's suggestion, just because you've been doing this successfully for a while doesn't make me cringe less. If you're using pg_start_backup, you really should be saving the archive segments it generates while doing the snapshot shuffle and getting a completely clean copy that goes through recovery properly. I'm sure the database comes fine anyway most of the time. Murphy's law say the one time it doesn't will inevitably be the time you actually need that backup functional the most.

Just providing a minimal archive_command and saving its output avoids all this concern about whether your snapshot is perfectly atomic. That sidesteps concerns that might think you want LVM (which is never a good thing to introduce into an already working system due to its overhead), or want to freeze XFS (always scary and disruptive).

Jon Jensen said...

Greg: Yes, as I noted in my conclusion, I would definitely use the stored WAL files during the backup if this were in production.

Our production snapshots (for other clients) with LVM have worked out very well and are fully consistent thanks to being on a single snapshottable block device, so we're very happy with those but just can't do that with the 4-device EBS setup here.

Josh Berkus said...


The problem we've had with EBS wasn't average throughput, it was minimum throughput; that is, even with RAIDed EBS, sometimes I/O would drop to nothing due to competing users of the cloud. Only for brief periods, but they were still sufficient to make database requests time out. Have you not had this experience?

Ethan Rowe said...

Josh (Berkus):

If memory serves, we have experienced the throughput issues you've described, but when using the stock local storage such as it is on an EC2 instance. I do not think it's been a real problem for us since going to the 4-EBS-volume RAID0 configuration. However, that doesn't mean it can't or won't happen.

To that end, we had the RAID0 volume simply stop responding at one point, necessitating a failover to the warm standby. The source of the failure was a mystery, and could have been one of XFS, the RAID software, EBS itself.

- Ethan

Log Buffer said...

[...]Jon Jensen of End Point’s Blog posts a HOWTO on PostgreSQL EC2/EBS/RAID 0 snapshot backup.[...]

Log Buffer #180

Anonymous said...

For what it's worth, my experience is that the xfs fs will often not be recoverable if it is not frozen (or "quiesced" in file system parlance) during the window when multiple underlying devices are snapshotted. You were able to get this down to ~1 second by initiating the snapshots in parallel, but there is no such guarantee that it is good enough (that I know of). You seem to understand this, but it's not spelled out in your post.

I imagine that Linux multi-disk has some tolerance for recovery from non-atomic situations, but it may just involve some luck.

It would be nice if amazon would provide enhancements to the Linux file-systems and/or the RAID drivers to help deal with this issue. Or even better, they could provide a layer on top of EBS that manages all this for you.


Cloud computing said...

I guess, as with all mainstream emerging technologies there are still bugs to iron out. Yet, as you've demonstrated there's always a clever work around. Thanks for posting, it's a nice, insightful bit of reading.


syrnick said...

Awesome research!

Do you have any follow-ups to this? Was 9.x better with these snapshots? Was it working fine ever since? Were insurmountable/unforseen challenges?

Jon Jensen said...

syrnick: Thanks for the note. Nope, no follow-ups right now. I usually try to steer people running mission-critical Postgres setups away from AWS regardless of the type of storage, since it almost always has worse I/O than a nonvirtualized system with standard direct-attached drives, or a SAN. People are still using AWS for Postgres, but I think it's more work and less reliable than makes sense as a default.

syrnick said...

We're on AWS already, but I'd love to set up the backups exactly as you described. In fact, we already have chef-based snapshotter ready for this (with EBS snapshots and S3 WAL archiving), but I haven't fully tested the recovery. So, your post was quite inspiring.

Jon Jensen said...

Ah, cool. Well, I'd say test the recovery of a snapshot, say, once a week for a month to build up your confidence in both the snapshotting and the recovery strategy, and you should be good!