End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

PostgreSQL EC2/EBS/RAID 0 snapshot backup

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.

They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

  1. Create an atomic snapshot of the block devices
  2. Spin up another virtual server
  3. Mount the backup volume
  4. Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
  5. Do whatever pg_dump or other backups are desired
  6. Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.

The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

  1. Created one EC2 image:
    Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
    Debian AMI ID ami-4ffe1926 (x86_64)
    Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores
  2. Created 4 x 10 GB EBS volumes
  3. Attached volumes to the image
  4. Created software RAID 0 device:
    mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
  5. Created XFS filesystem on top of RAID 0 device:
    mkfs -t xfs -L /pgdata /dev/md0
  6. Set up in /etc/fstab and mounted:
    mkdir /pgdata
    # edit /etc/fstab, with noatime
    mount /pgdata
  7. Installed PostgreSQL 8.3
  8. Configured postgresql.conf to be similar to primary production database server
  9. Created empty new database cluster with data directory in /pgdata
  10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
  11. Ran some bulk inserts to grow database to around 5 GB
  12. Rebooted EC2 instance to confirm everything came back up correctly on its own
  13. Set up two concurrent data-insertion processes:
    • 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
    • Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
  14. Started those "beater" jobs, which mostly consumed 2-3 CPU cores
  15. Manually inserted a known test row and created a known view that should appear in the snapshot
  16. Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
    SELECT pg_start_backup('raid_backup');
  17. Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
  18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
  19. Tell Postgres the backup's over:
    SELECT pg_stop_backup();
  20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
  21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
  22. Then, on the new EC2 instance for doing backups:
    • mdadm --assemble --scan
    • mount /pgdata
    • Start Postgres
    • Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
    • VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
    • pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.

There are lots of caveats to note here:

  • This is an experiment in progress, not a how-to for the general public.
  • The data set that was snapshotted was fairly small.
  • Two successful runs, even with no failures, is not a very big sample set. :)
  • I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.

More Code and SEO with the Google Analytics API

My latest blog article inspiration came from an SEOmoz pro webinar on Actionable Analytics. This time around, I wrote the article and it was published on SEOmoz's YOUmoz Blog and I thought I'd summarize and extend the article here with some technical details more appealing to our audience. The article is titled Visualizing Keyword Data with the Google Analytics API.

In the article, I discuss and show examples of how the number of unique keywords receiving search traffic has diversified or expanded over time and that our SEO efforts (including writing blog articles) are likely resulting in this diversification of keywords. Some snapshots from the articles:

The unique keyword (keywords receiving at least one search visit)
count per month (top) compared to the number of articles available
on our blog at that time (bottom).

I also briefly examined how unique keywords receiving at least one visit overlapped between each month and saw about 10-20% of overlapping keywords (likely the short-tail of SEO).


The keyword overlap per month, where the keywords receiving at least
one visit in consecutive months are shown in the overlap section.

Now, on to things that End Point's audience may find more interesting. Something that might appeal more to our developer-types is the code written to use the Google Analytics API to generate the data used for this article. I researched a bit and tried writing my own ruby code (gem-less) to pull from the Google API, followed by using the Gattica gem, and finally the garb gem. After wrestling with the former two options, I settled on the garb gem, which had decent documentation here to get me up and running with a Google Analytics report quickly. Here's an example of the code required to create your first Google Analytics API report:

#!/usr/bin/ruby

require 'rubygems'
require 'garb'

# set email, password, profile_id
Garb::Session.login(email, password)
profile = Garb::Profile.first(profile_id)

report = Garb::Report.new(profile,
        :limit => 100,
        :start_date => Date.today - 30,
        :end_date => Date.today)
report.dimensions :keyword
report.metrics :visits
report.results.each do |result|
  puts "#{result.keyword}:#{result.visits}"
end

If you aren't familiar with the Google Analtyics API, possible dimensions and metrics are documented here. There are some Google Analytics API limitations on metric and dimension combinations, but I think if you get creative you'd be able to overcome most of those limitations (assuming you won't be exceeding the limit of 1,000 API requests per day).

Why should you care about the Google Analytics API? Well, the API allowed me to programmatically aggregate the keyword counts in monthly increments for the SEOmoz article. One thing I consider to be pretty lame is the inability to select more than 3 custom segments and exclude the "All Visits" segment to allow a better visual comparison of the segments. In the data below, I have 3 defined custom segments. I would prefer to compare about 10 custom segments of End Point's blog keyword groupings (e.g., "Rails Keywords", "Postgres Keywords"), but Google Analytics limits the selected segments and includes "All Visits" when you select more than one custom segment.

Another thing I consider to be lame is the inability to merge Google Analytics profiles. Recently, End Point combined its corporate blog GA profile with its main website GA profile to better track conversion between the sites:

Dead metrics from migrated profile.

With the Google Analytics API, we could compute different aggregates of data, compare more than a few custom data segments, and combine two google profiles if they have merged. Of course, these things wouldn't necessarily be easy, but working with the gem proved to be simple, so in theory this all could be done and in the meantime we'll keep our dead profile around.

Again, please read the original article here if you are interested :)

PostgreSQL tip: dump objects into a new schema

Sometimes the need arises to export a PostgreSQL database and put its contents into its own schema; say you've been busy developing things in the public schema. Sometime people suggest manipulating the pg_dump output either manually or using a tool such as sed or perl to explicitly schema-qualify all table objects, etc, but this is error-prone depending on your table names, and can be more trouble than its worth.

One trick that may work for you if your current database is not in use by anyone else is to rename the default public schema to your desired schema name before dumping, and then optionally changing it back to public afterward. This has the benefit that all objects will be properly dumped in the new schema (sequences, etc) and not just tables, plus you don't have to worry about trying to parse SQL with regexes to modify this explicitly.

$ psql -c "ALTER SCHEMA public RENAME new_name"
$ pg_dump --schema=new_name > new_name_dump.sql
$ psql -c "ALTER SCHEMA new_name RENAME public"
$ # load new_name_dump.sql elsewhere

Cheers!

GNU diff: changing the output filenames

I was working on a script to monitor/compare remote versions of a file and compare against our local mirror; part of this work involved fetching the remote file to a temporary location and doing a diff -u against the local file to see if there were any changes. This worked fine, but the temporary filename was less-than descriptive.

The man page for diff was somewhat cryptic when it came to changing the displayed output filenames themselves, however based on some figuring-out, you can pass the -L (--label) flag to diff. You need to pass it twice; the first -L will replace the filename in the --- output line and the second -L replaces the file in the +++ line.

$ wget -qO /tmp/grocerylist
$ diff -u /path/to/local/grocerylist -L /path/to/local/grocerylist /tmp/grocerylist -L http://mirrorsite.com/grocerylist

--- /path/to/local/grocerylist
+++ http://mirrorsite.com/grocerylist
@@ -11,7 +11,7 @@
potatoes
bread
eggs
-    milk
+    soda
oranges
apples
celery

Obvious shortcomings in this approach are the fact that you need to specify a redundant -L line to the first file; in my case, this was all handled programatically, so I just substituted the same parameter in both places. Also, you lose the default output label which shows the current modification date/time for each. In my case, I didn't care about when, just if there were differences and what they were.

MRM: The power of 'random', program maintenance, weave and electricity

Time for another installment of Monday Reading Material!

I'm in New Zealand (and across the dateline!) so this is appearing a day early for many of you :)

Reading material from last week:

Safari 4 Top Sites feature skews analytics

Safari version 4 has a new "Top Sites" feature that shows thumbnail images of the sites the user most frequently visits (or, until enough history is collected, just generally popular sites).

Martin Sutherland describes this feature in details and shows how to detect these requests, which set the X-Purpose HTTP header to "preview".

The reason this matters is that Safari uses its normal browsing engine to fetch not just the HTML, but all embedded JavaScript and images, and runs in-page client JavaScript code. And these preview thumbnails are refreshed fairly frequently -- possibly several times per day per user.

Thus every preview request looks just like a regular user visit, and this skews analytics which see a much higher than average number of views from Safari 4 users, with lower time-on-site averages and higher bounce rates since no subsequent visits are registered (at least as part of the preview function).

The solution is to simply not output any analytics code when the X-Purpose header is set to "preview". In Interchange this is easily done if you have an include file for your analytics code, by wrapping the file with an [if] block such as this:

[tmp x_purpose][env HTTP_X_PURPOSE][/tmp]
[if scratch x_purpose eq 'preview']
<!-- skip analytics for browser previews -->
[else]
(normal Google Analytics, Omniture SiteCatalyst, or other analytics code)
[/else]
[/if]

In Ruby on Rails you'd check request.env["HTTP_X_PURPOSE"].

In PHP you'd check $_SERVER["HTTP_X_PURPOSE"].

In Django you'd check request.META["HTTP_X_PURPOSE"] or the equivalent request.META.get("HTTP_X_PURPOSE") (from the HttpRequest class).

And so on.

I confirmed the analytics tracking code was omitted by waiting for Safari to make its preview request and inspecting the response with the Fiddler proxy, on Windows. The same can be done for Safari on Mac OS X with a suitable Mac OS X HTTP proxy.

MySQL Ruby Gem CentOS RHEL 5 Installation Error Troubleshooting

Building and installing the Ruby mysql gem on freshly-installed Red Hat based systems sometimes produces the frustratingly ambiguous error below:

# gem install mysql
/usr/bin/ruby extconf.rb
checking for mysql_ssl_set()... no
checking for rb_str_set_len()... no
checking for rb_thread_start_timer()... no
checking for mysql.h... no
checking for mysql/mysql.h... no
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.  Check the mkmf.log file for more
details.  You may need configuration options.

Searching the web for info on this error yields two basic solutions:

  1. Install the mysql-devel package (this provides the mysql.h file in /usr/include/mysql/).
  2. Run gem install mysql -- --with-mysql-config=/usr/bin/mysql_config or some other additional options.

These are correct but not sufficient. Because this gem compiles a library to interface with MySQL's C API, the gcc and make packages are also required to create the build environment:

# yum install mysql-devel gcc make
# gem install mysql -- --with-mysql-config=/usr/bin/mysql_config

Alternatively, if you're using your distro's ruby (not a custom build like Ruby Enterprise Edition), you can install EPEL's ruby-mysql package along with their rubygem-rails and other packages.

On Linux, noatime includes nodiratime

Note to performance-tweaking Linux sysadmins, pointed out to me by Selena Deckelmann: On Linux, the filesystem attribute noatime includes nodiratime, so there's no need to say both "noatime,nodiratime" as I once did. (See this article on atime for details if you're not familiar with it.)

Apparently the nodiratime attribute was added later as a subset of noatime applying only to directories to still offer a bit of performance boost in situations where noatime on files would cause trouble (as with mutt and a few other applications that care about atimes).

See also the related newer relatime attribute in the mount(8) manpage.

Monday Reading Material: 2/8/2010

Just a few links from the past week that are worth checking out:
  • "Spices: the internet of the ancient world!" - Planet Money Podcast. Great storytelling about the ancient spice trade and how information about where certain spices came from eventually leaked out and popped the spice trading bubble/monopoly.
  • Enterprise software is entirely bereft of soul. Reading this reminded me of antifeatures and the competitive advantages of open source software.
  • Emulating Empathy. Nice summary of how interacting with users of software (customers) on a non-technical issues, or over high-level requirements, provokes creativity. Also, that good customer communication is a *skill* not an innate talent - meaning it can be taught and learned. :)
  • Interaxon. Other than the cute name, this is a fascinating company and concept based in Vancouver, BC. Thought controlled computing! Looking forward to seeing what comes out of their "Bright Ideas" exhibit during the Winter Olympics.

PostgreSQL version 9.0 release date prediction

So when will PostgreSQL version 9.0 come out? I decided to "run the numbers" and take a look at how the Postgres project has done historically. Here's a quick graph showing the approximate number of days each major release since version 6.0 took:

Some interesting things can be seen here: there is a rough correlation between the complexity of a new release and the time it takes, major releases take longer, and the trend is gradually towards more days per release. Overall the project is doing great, releasing on average every 288 days since version 6. If we only look at version 7 and onwards, the releases are on average 367 days apart. If we look at *just* version 7, the average is 324 days. If we look at *just* version 8, the average is 410. Since the last major version that came out was on July 1, 2009, the numbers predict 9.0 will be released on July 3, 2010, based on the version 7 and 8 averages, and on August 15, 2010, based on just the version 8 averages. However, this upcoming version has two very major features, streaming replication (SR) and hot standby (HS). How those will affect the release schedule remains to be seen, but I suspect the 9.0 to 9.1 window will be short indeed.

As a recap, the Postgres project only bumps the first part of the version number for major changes (Although many, myself included, would argue that 7.4 was such a major jump it should have been called 8.0). The second number occurs anytime a "new release" happens, and means new features and enhancements. The final number, the revision, is only incremented for security and bug fixes, and is almost always a 100% binary compatible drop in for the previous revision in the branch. (What's the average (mean) days between revisions? 84 days since version 6, and 88 days since version 7. The medians are 84 and 87 respectively.)

How busy were those periods? Here's the number of commits per release period. Note that I said release period, not release, as commits are still being made to old branches, although this is a very small minority of the commits, so I did not bother to break it down at that level.

There is a strong correlation with the previous chart. Of note is version 8.1, which had few commits and was released relatively quickly. Also note that version 8.0 is still winning as far as the sheer number of commits, most likely due to the fact that native Windows support was added in that version.

Some other items of interest from the data:

  • There have been roughly 140,000 commits from version 6.0 to 8.4.2.
  • There have been 32 CVS committers since the start of the project (and of course, many hundreds of others whose work was funnelled through those committers)
  • The mean number of commits per person is 4383, but the distribution is very skewed: Bruce, Peter, and Tom account for 80% of all commits, with the mean between them of 37,000 commits.
  • Commits changed about 40 lines on average.

Alright, two final charts: commits per time periods. I'll let the data speak for itself this time. Stay tuned for future blog posts exploring this data further!

LCA2010: Postgres represent!

I had the pleasure of attending and presenting at LinuxConf.AU this year in Wellington, NZ. Linux Conf.AU is an institution whose friendliness and focus on the practical business of creating and sustaining open source projects was truly inspirational.

My talk this year was "A Survey of Open Source Databases", where I actually created a survey and asked over 35 open source database projects to respond. I have received about 15 responses so far, and also did my own research on the over 50 projects I identified. I created a place-holder site for my research at: ossdbsurvey.org. I'm hoping to revise the survey (make it shorter!!) and get more projects to provide information.

Ultimately, I'd like the site to be a central location for finding information and comparing different projects. Performance of each is a huge issue, and there are a lot of individuals constructing good (and bad) systems for comparing. I don't think I want to dive into that pool, yet. But I would like to start collecting the work others have done in a central place. Right now it is really far too difficult to find all of this information.

Part of the talk was also a foray into the dangerous world of classification. I tried to put together basic categories, based on conversations with individual developers and some fine-tuning with Josh Berkus. Josh gave a short overview of database models during "Relational vs Non-relational" in the Data Storage mini-conf, and we collaborated some on category definition. I also saw Devdas Bhagat give a use case talk on using Postgres, yet again confirming how wonderful transactional DDL is for developers. I also gave a lightning talk (WITHOUT SLIDES!) on Bucardo at the tail end of the Data Storage mini-conf.

Josh Berkus, during "PostgreSQL Development Today", announced to the world that the new version of Postgres would be version 9.0! And he did a live demonstration of streaming replication and hot standby. The audience seemed pleased.

I was delighted to see representatives from the Postgres community on the main stage of the conference three times during LCA!

And finally, I had the pleasure of participating in the Friday keynote lightning talks. I kicked things off by telling the story of the elections in Ondo State, Nigeria, in 5 minutes. I saw that one of the IT people I met while in Akure was now helping Osun state investigate and correct election fraud in January. So glad to see that their good work continues!

Automatic migration from Slony to Bucardo

About a month ago, Bucardo added an interesting set of features in the form of a new script called slony_migrator.pl. In this post I'll describe slony_migrator.pl and its three major functions.

The Setup

For these examples, I'm using the pagila sample database along with a set of scripts I wrote and made available here. These scripts build two different Slony clusters. The first is a simple one, which replicates this database from a database called "pagila1" on one host to a database "pagila2" on another host. The second is more complex. Its one master node replicates the pagila database to two slave nodes, one of which replicates it again to a fourth slave using Slony's FORWARD function as described here. I implemented this setup on two FreeBSD virtual machines, known as myfreebsd and myfreebsd2. The reset-simple.sh and reset-complex.sh scripts in the script package I've linked to will build all the necessary databases from one pagila database and do all the Slony configuration.

Slony Synopsis

The slony_migrator.pl script has three possible actions, the first of which is to connect to a running Slony cluster and print a synopsis of the Slony setup it discovers. You can do this safely against a running, production Slony cluster; it gathers all its necessary information from a few simple Slony queries. Here's the synopsis the script writes for the simple configuration I described above:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd
Slony version: 1.2.16
psql version: 8.3
Postgres version: 8.3.7
Slony schema: _pagila
Local node: 1
SET 1: All pagila tables
* Master node: 1  Active: Yes  PID: 3309  Comment: "Cluster node 1"
  (dbname=pagila1 host=myfreebsd user=postgres)
  ** Slave node:  2  Active: Yes  Forward: Yes  Provider:  1  Comment: "Node 2"
     (dbname=pagila2 host=myfreebsd2 user=postgres)

The script has reported the Slony, PostgreSQL, and psql versions, the Slony schema name, and shows that there's only one set, replicated from the master node to one slave node, including connection information for each node. Here is the output of the same action, run against the complex slony setup. Notice that node 3 has node 2 as its provider, not node 1:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd
Slony version: 1.2.16
psql version: 8.3
Postgres version: 8.3.7
Slony schema: _pagila
Local node: 1
SET 1: All pagila tables
* Master node: 1  Active: Yes  PID: 3764  Comment: "Cluster node 1"
  (dbname=pagila1 host=myfreebsd  user=postgres)
  ** Slave node:  2  Active: Yes  Forward: Yes  Provider:  1  Comment: "Cluster node 2"
     (dbname=pagila2 host=myfreebsd2 user=postgres)
  ** Slave node:  3  Active:  No  Forward: Yes  Provider:  2  Comment: "Cluster node 3"
     (dbname=pagila3 host=myfreebsd2 user=postgres)
  ** Slave node:  4  Active: Yes  Forward: Yes  Provider:  1  Comment: "Cluster node 4"
     (dbname=pagila4 host=myfreebsd  user=postgres)

This is a simple way to get an idea of how a Slony cluster is organized. Again, we can get all this without downtime or any impact on the Slony cluster.

Creating Slonik Scripts Automatically

Slony gets its configuration entirely through scripts passed to an application called Slonik, which writes configuration entries into a Slony schema within a replicated database. At least as far as I know, however, Slony doesn't provide a way to regenerate those scripts based on the contents of that schema. The slony_migrator.pl script will do that for you with the --slonik option. For example, here is the Slonik script it generates for the simple configuration:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --slonik
CLUSTER NAME = pagila;
NODE 1 ADMIN CONNINFO = 'dbname=pagila1 host=myfreebsd user=postgres';
NODE 2 ADMIN CONNINFO = 'dbname=pagila2 host=myfreebsd2 user=postgres';
INIT CLUSTER (ID = 1, COMMENT = 'Cluster node 1');
STORE NODE (ID = 2, EVENT NODE = 1, COMMENT = 'Node 2');
STORE PATH (SERVER = 1, CLIENT = 2, CONNINFO = 'dbname=pagila1 host=myfreebsd user=postgres', CONNRETRY = 10);
STORE PATH (SERVER = 2, CLIENT = 1, CONNINFO = 'dbname=pagila2 host=myfreebsd2 user=postgres', CONNRETRY = 10);
ECHO 'Please start up replication nodes here';
TRY {
    CREATE SET (ID = 1, ORIGIN = 1, COMMENT = 'All pagila tables');
} ON ERROR {
    EXIT -1;
}
SET ADD TABLE (ID = 6, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.customer', KEY = 'customer_pkey', COMMENT = 'public.customer');
SET ADD TABLE (ID = 11, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.language', KEY = 'language_pkey', COMMENT = 'public.language');
--- snip ---
SET ADD SEQUENCE (ID = 13, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.store_store_id_seq', COMMENT = 'public.store_store_id_seq');
SET ADD SEQUENCE (ID = 10, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.payment_payment_id_seq', COMMENT = 'public.payment_payment_id_seq');
SET ADD SEQUENCE (ID = 5, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.country_country_id_seq', COMMENT = 'public.country_country_id_seq');
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 2, FORWARD = YES);

The pagila database contains many tables and sequences, and I've removed the repetitive commands to tell Slony about all of them, for the sake of brevity, but in its original form, the code above would rebuild the simple Slony cluster exactly, and can be very useful for getting an idea of how an otherwise unknown cluster is configured. I won't promise the Slonik code is ideal, but it does recreate a working cluster. The more complex Slonik output is very similar, differing only in how the sets are subscribed. Here I'll show only the major differences, which are the commands required to create the more complex Slony subscription scheme. In the downloadable script package I mentioned above, this subscription code is somewhat more complex, specifically because Slony won't let you subscribe node 3 to updates from node 2 until node 2 is fully subscribed itself. The slony_migrator.pl script isn't smart enough on its own to add necessary WAIT FOR EVENT Slonik commands, but it does get most of the code right, and, importantly, creates the subscriptions in the proper order.

SET ADD SEQUENCE (ID = 10, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.payment_payment_id_seq', COMMENT = 'public.payment_payment_id_seq');
SET ADD SEQUENCE (ID = 5, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.country_country_id_seq', COMMENT = 'public.country_country_id_seq');
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 4, FORWARD = YES);
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 2, FORWARD = YES);
SUBSCRIBE SET (ID = 1, PROVIDER = 2, RECEIVER = 3, FORWARD = YES);

Migrating Slony Clusters to Bucardo

The final slony_migrator.pl option will create a set of bucardo_ctl commands to create a Bucardo cluster to match an existing Slony setup. Although Bucardo can be configured by directly modifying its configuration database, a great deal of work of late has gone into making configuration easier through the bucardo_ctl program. Here's the output from slony_migrator.pl on the simple Slony cluster. Note the --bucardo command-line option, which invokes this function:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
--- snip ---
./bucardo_ctl add sequence public.city_city_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.store_store_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sync pagila_set1_node1_to_node2 source=pagila_node1_set1 targetdb=pagila_2 type=pushdelta

The Bucardo model of a replication system differs from Slony, but the two match fairly closely, especially for a simple scenario like this one. But slony_migrator.pl will work for the more complex Slony example I've been using, shown here:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_4 dbname=pagila4  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_3 dbname=pagila3  host=myfreebsd2 user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
--- snip ---
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sync pagila_set1_node1_to_node4 source=pagila_node1_set1 targetdb=pagila_4 type=pushdelta
./bucardo_ctl add sync pagila_set1_node1_to_node2 source=pagila_node1_set1 targetdb=pagila_2 type=pushdelta target_makedelta=on
./bucardo_ctl add table public.customer db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add table public.language db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add table public.store db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
--- snip ---
./bucardo_ctl add sequence public.store_store_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sync pagila_set1_node2_to_node3 source=pagila_node2_set1 targetdb=pagila_3 type=pushdelta

I mentioned the Bucardo data model differs from that of Slony. Slony contains a set of tables and sequences in a "set", and that Slony set remains a distinct object on all databases where those objects are found. Bucardo, on the other hand, has a concept of a "sync", which is a replication job from one database to one or more slaves (here I'm talking only about master->slave syncs, and ignoring for purposes of this post Bucardo's ability to do multi-master replication). This makes the setup slightly different for the more complex Slony scenario, in that whereas Slony has one set and different subscriptions, in Bucardo I need to define the tables and sequences involved in each of three syncs: one from node 1 to node 2, one from node 1 to node 4, and one from node 2 to node 3. I also need to turn on Bucardo's "makedelta" option for the node 1 -> node 2 sync, which is the Bucardo equivalent of the Slony FORWARD subscription option.

Migrating from Slony to Bucardo

This post is getting long, but for the sake of demonstration let's show a migration from Slony to Bucardo, using the more complex Slony example. First, I'll create a blank database, and install Bucardo in it:

josh@eddie:~/devel/bucardo$ createdb bucardo
josh@eddie:~/devel/bucardo$ ./bucardo_ctl install
This will install the bucardo database into an existing Postgres cluster.
Postgres must have been compiled with Perl support,
and you must connect as a superuser

We will create a new superuser named 'bucardo',
and make it the owner of a new database named 'bucardo'

Current connection settings:
1. Host:          /tmp
2. Port:          5432
3. User:          postgres
4. Database:      postgres
5. PID directory: /var/run/bucardo
Enter a number to change it, P to proceed, or Q to quit: 

I'll make the necessary configuration changes, and run the installation by following the simple menu.

Current connection settings:
1. Host:          /tmp
2. Port:          5432
3. User:          postgres
4. Database:      bucardo
5. PID directory: /home/josh/devel/bucardo/pid
Enter a number to change it, P to proceed, or Q to quit: p

Postgres version is: 8.3
Attempting to create and populate the bucardo database and schema
Database creation is complete

Connecting to database 'bucardo' as user 'bucardo'
Updated configuration setting "piddir"
Installation is now complete.

If you see any unexpected errors above, please report them to bucardo-general@bucardo.org

You should probably check over the configuration variables next, by running:
./bucardo_ctl show all
Change any setting by using: ./bucardo_ctl set foo=bar

Now I'll use slony_migrator.pl to get a set of bucardo_ctl scripts to build my Bucardo cluster:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo > pagila-slony2bucardo.sh
josh@eddie:~/devel/bucardo/scripts$ head pagila-slony2bucardo.sh 
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_4 dbname=pagila4  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_3 dbname=pagila3  host=myfreebsd2 user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film_category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1

I'll run the script...

josh@eddie:~/devel/bucardo$ sh scripts/pagila-slony2bucardo.sh
Added database "pagila_1"   
Added database "pagila_4"   
Added database "pagila_3"   
Added database "pagila_2"   
Created herd "pagila_node1_set1"
Added table "public.customer"
Added table "public.language"
Added table "public.store"
--- snip ---
Added sequence "public.store_store_id_seq"
Added sequence "public.payment_payment_id_seq"
Added sequence "public.country_country_id_seq"
Added sync "pagila_set1_node1_to_node4"
Added sync "pagila_set1_node1_to_node2"
Created herd "pagila_node2_set1"
Added table "public.customer"
Added table "public.language"
Added table "public.store"
--- snip ---
Added sequence "public.store_store_id_seq"
Added sequence "public.payment_payment_id_seq"
Added sequence "public.country_country_id_seq"
Added sync "pagila_set1_node2_to_node3"

Now all that's left is to shut down Slony (I just use the "pkill slon" command on each database server), start Bucardo, and, eventually, remove the Slony schemas. Note that Bucardo runs only on one machine (which in this case isn't either of the database servers I'm using for this demonstration -- Bucardo can run effectively anywhere you want).

josh@eddie:~/devel/bucardo$ ./bucardo_ctl start
Checking for existing processes
Removing /home/josh/devel/bucardo/pid/fullstopbucardo
Starting Bucardo
josh@eddie:~/devel/bucardo$ tail -f log.bucardo 
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.actor_actor_id_seq to value of 202, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.city_city_id_seq to value of 600, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.store_store_id_seq to value of 2, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.payment_payment_id_seq to value of 32098, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.country_country_id_seq to value of 109, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Total delta count: 0
[Mon Feb  1 21:45:27 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node2_pagila_2" from 22961
[Mon Feb  1 21:45:27 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node2"
[Mon Feb  1 21:45:27 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node4_pagila_4" from 22962
[Mon Feb  1 21:45:27 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node4"

Based on those logs, it looks like everything's running fine, but just to make sure, I'll use bucardo_ctl's "list syncs" and "status" commands:

josh@eddie:~/devel/bucardo$ ./bucardo_ctl list syncs
Sync: pagila_set1_node1_to_node2  (pushdelta)  pagila_node1_set1 =>  pagila_2  (Active)
Sync: pagila_set1_node1_to_node4  (pushdelta)  pagila_node1_set1 =>  pagila_4  (Active)
Sync: pagila_set1_node2_to_node3  (pushdelta)  pagila_node2_set1 =>  pagila_3  (Active)

josh@eddie:~/devel/bucardo$ ./bucardo_ctl status
Days back: 3  User: bucardo  Database: bucardo  Host: /tmp  PID of Bucardo MCP: 22936
Name                       Type  State PID   Last_good Time  I/U/D Last_bad Time
==========================+=====+=====+=====+=========+=====+=====+========+====
pagila_set1_node1_to_node2| P   |idle |22952|52s      |0s   |0/0/0|unknown |    
pagila_set1_node1_to_node4| P   |idle |22953|52s      |0s   |0/0/0|unknown |    
pagila_set1_node2_to_node3| P   |idle |22954|52s      |0s   |0/0/0|unknown |    

Everything looks good. Before I test that data are really replicated correctly, I'll issue the a "DROP SCHEMA _pagila CASCADE" command in each database, which I can do while Bucardo's running. If this were a production system, the best strategy, to avoid things getting replicated twice) would be to stop all applications, stop Slony, start Bucardo, and start the applications, though because Slony and Bucardo both replicate rows using primary keys, doing otherwise wouldn't cause duplicated data.

Finally, I'll tail the Bucardo logs while inserting rows in the pagila1 database, to see what happens. These rows tell me it's working:

[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.payment_payment_id_seq to value of 32098, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.inventory_inventory_id_seq to value of 4581, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.country_country_id_seq to value of 109, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Total delta count: 1
[Mon Feb  1 21:55:42 2010]  KID Deleting rows from public.actor
[Mon Feb  1 21:55:42 2010]  KID Begin COPY to public.actor
[Mon Feb  1 21:55:42 2010]  KID End COPY to public.actor
[Mon Feb  1 21:55:42 2010]  KID Pushdelta counts: deletes=0 inserts=1
[Mon Feb  1 21:55:42 2010]  KID Updating bucardo_track for public.actor on pagila_1
...
[Mon Feb  1 21:55:43 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node4_pagila_4" from 22962
[Mon Feb  1 21:55:43 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node4"
[Mon Feb  1 21:55:43 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node2_pagila_2" from 22961
[Mon Feb  1 21:55:43 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node2"

In this case I need to "kick" the node 2 -> node 3 sync to get it to replicate, but I could configure the sync with a timeout so that happened automatically. Once I do that, I get log messages for it as well.

[Mon Feb  1 22:00:34 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node2_to_node3_pagila_3" from 22963
[Mon Feb  1 22:00:34 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node2_to_node3"

Please consider giving slony_migrator.pl a try. I'd be glad to hear how it works out.