News

Welcome to End Point’s blog

Ongoing observations by End Point people

MongoDB replication from Postgres using Bucardo

One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)

To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:

sudo aptitude install postgresql-server \
perl-DBIx-Safe \
perl-DBD-Pg \
postgresql-contrib

Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.

Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:

aptitude install mongodb

For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.

Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:

$  mkdir /tmp/mongodata
$  mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log
all output going to: /tmp/mongo.log
forked process: 428

You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:

$  mongo
MongoDB shell version: 1.8.1
Fri Jun 10 12:45:00
connecting to: test
> quit()
$ 

The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:

$  sudo cpan MongoDB

If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.

Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.

$ git clone git://bucardo.org/bucardo.git
Initialized empty Git repository...

$ cd bucardo
$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for Bucardo
$ make
cp bucardo.schema blib/share/bucardo.schema
cp Bucardo.pm blib/lib/Bucardo.pm
cp bucardo blib/script/bucardo
/usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo
Manifying blib/man1/bucardo.1pm
Manifying blib/man3/Bucardo.3pm
$ sudo make install
Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm
Installing /usr/local/share/bucardo/bucardo.schema
Installing /usr/local/bin/bucardo
Installing /usr/local/share/man/man1/bucardo.1pm
Installing /usr/local/share/man/man3/Bucardo.3pm
Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod
$ sudo mkdir /var/run/bucardo
$ sudo chown $USER /var/run/bucardo
$ bucardo install
This will install the bucardo database into an existing Postgres cluster.
...
Installation is now complete.

Now we create some test databases and populate with pgbench:

$ psql -c 'create database btest1'
CREATE DATABASE
$ pgbench -i btest1
NOTICE:  table "pgbench_branches" does not exist, skipping
...
creating tables...
10000 tuples done.
20000 tuples done.
...
100000 tuples done.
$ psql -c 'create database btest2 template btest1'
CREATE DATABASE
$ psql -c 'create database btest3 template btest1'
CREATE DATABASE
$ psql btest3 -c 'truncate table pgbench_accounts'
TRUNCATE TABLE

$ bucardo add db t1 dbname=btest1
Added database "t1"
$ bucardo add db t2 dbname=btest2
Added database "t2"
$ bucardo add db t3 dbname=btest3
Added database "t3"
$ bucardo list dbs
Database: t1  Status: active  Conn: psql -p 5432 -U bucardo -d btest1
Database: t2  Status: active  Conn: psql -p 5432 -U bucardo -d btest2
Database: t3  Status: active  Conn: psql -p 5432 -U bucardo -d btest3

$ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd
Created herd "therd"
Added table "public.pgbench_accounts"
Added table "public.pgbench_branches"
Added table "public.pgbench_tellers"

$ bucardo list tables
Table: public.pgbench_accounts  DB: t1  PK: aid (int4)
Table: public.pgbench_branches  DB: t1  PK: bid (int4)
Table: public.pgbench_tellers   DB: t1  PK: tid (int4)

The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:

$  bucardo add db m1 dbname=btest1 type=mongo
Added database "m1"
$  bucardo list dbs
Database: m1  Type: mongo     Status: active  
Database: t1  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest1
Database: t2  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest2
Database: t3  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest3

Next we group our databases together and assign them roles:

$  bucardo add dbgroup tgroup  t1:source  t2:source  t3:target  m1:target
Created database group "tgroup"
Added database "t1" to group "tgroup" as source
Added database "t2" to group "tgroup" as source
Added database "t3" to group "tgroup" as target
Added database "m1" to group "tgroup" as target

Note that "target" is the default action, so we could shorten that to:

$  bucardo add dbgroup tgroup t1:source  t2  t3  m1

However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)

We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:

$  bucardo add sync mongotest  herd=therd  dbs=tgroup  ping=false
Added sync "mongotest"

$  bucardo start
Checking for existing processes
Starting Bucardo

$  pgbench -t 10000 btest1
starting vacuum...end.
transaction type: TPC-B (sort of)
number of transactions actually processed: 10000/10000
...
tps = 503.300595 (excluding connections establishing)
$  pgbench -t 10000 btest2
number of transactions actually processed: 10000/10000
...
tps = 408.059368 (excluding connections establishing)
$  bucardo kick mongotest

We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:

$  bucardo status
PID of Bucardo MCP: 3317
 Name        State    Last good    Time    Last I/D/C    Last bad    Time  
===========+========+============+=======+=============+===========+=======
 mongotest | Good   | 21:57:47   | 11s   | 6/36234/898 | none      |

Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:

$  psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$  psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$  psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$  psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$  psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$  psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010

$  mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
>  db.pgbench_accounts.count()
18106
>  db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
  "_id" : ObjectId("4df39bcb8795839660001de5"),
  "abalance" : -12071,
  "aid" : 84733,
  "bid" : 1,
  "filler" : "               "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
  "_id" : ObjectId("4df39bd08795839660002fb0"),
  "abalance" : 13010,
  "aid" : 45500,
  "bid" : 1,
  "filler" : "               "
}

Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:

$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid'
UPDATE 100000
$ bucardo kick mongotest
Kicked sync mongotest
$ echo 'db.pgbench_accounts.count()' | mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:47:00
connecting to: btest1
> 100000
> bye

A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.

Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.

Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.

4 comments:

MattK said...

Any future plans to replicate from MongoDB to Postgres?

Greg Sabino Mullane said...

MattK: Not at the moment, as I don't know MongoDB at that deep a level; as it doesn't have triggers, I presume some sort of binary log reading would have to be done to pick up the changes. Also, we don't have any clients wanting to go in *that* direction, but if they do, that may alter the plans as well. :)

Kristina Chodorow said...

For Postgres->Mongo direction, there is a "human-readable" log of operations in the local.oplog.rs collection (see http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/). I'm not sure what you could do about nested documents and things, though.

Greg Sabino Mullane said...

Thanks, that gives me a starting point. We can start small and use simple datatypes to start with.