End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Slony, sl_status and diagnosing a particular type of lag

During some routine checking on a slony cluster, Greg noticed something curious. Replication was still happening between the master and a couple slaves, but we were seeing our indicator for lag inside of slony increasing.

To check out the status of slony replication, you will typically take a look at the view ‘sl_status’:

mydatabase=# select * from sl_status; 
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts | st_lag_num_events |       st
_lag_time       
-----------+-------------+---------------+----------------------------+------------------+----------------------------+---------------------------+-------------------+---------
----------------
         2 |           1 |       2697511 | 2008-04-30 02:40:06.034144 |          2565031 | 2008-04-14 15:31:32.897165 | 2008-04-14 16:24:08.81738 |            132480 | 15 days 
10:16:03.060499
(1 row)

This view pulls data out of sl_event and sl_confirm, two tables that keep track of the forward progress of replication. Every time there is an event - SYNCs, DDL changes, slony administrative events - a row is added to sl_event. Slony is very chatty and so all of the slaves send events to each other, as well as the master. (That statement is a simplification, and it is possible to make some configuration changes that reduce the traffic, but in general, this is what people who set up slony will see.)

Broken down, the columns are:

st_origin: the local slony system
st_received: the slony instance that sent an event
st_last_event: the sequence number of the last event received from that origin/received pair
st_last_event_ts: the timestamp on the last event received
st_last_received: the sequence number of the last sl_event + sl_confirm pair received
st_last_received_ts: the timestamp on the sl_confirm in that pair
st_last_received_event_ts: the timestamp on the sl_event in that pair
st_lag_num_events: difference between st_last_event and st_last_received
st_lag_time: difference between st_last_event_ts and st_last_received_ts

Depending on the type of event, a row might be added to sl_confirm immediately (by the same thread that created the event), or this may be created separately by another process. The important thing here is that there is a separation between sl_event and sl_confirm, so it is possible for sl_event SYNCs (replication events) to continue to come through and be applied to the server, without the sl_confirm rows being eventually created.

We have a monitor which checks the status of replication by looking at a recently added value on the master and comparing that to what is on the slave. This works well for workloads that are primarily append-only. So, that monitor thought replication was working fine, even though the lag was increasing steadily.

sl_event and sl_confirm tables are periodically cleaned up by cleanupEvent(), automatically by slony. Typically, this function is run every 100 seconds. When the slon process kicks it off, it checks to see what the newest confirmed events are, deletes old event records, and old confirm rows.

When confirms stop coming through, sl_events can’t be cleaned up on the affected server (because they haven’t been confirmed!). Depending on how active your servers are, this will eat up disk space. But you’ve got disk space monitors in place, right? :)

So, how do you fix the problem when the confirms stop coming through?

I had a look at process tables on all the slon slaves, and noticed that on the two lagged systems, there was no incoming connection from the master slony system. The fix: restart slony on the master so that it could reconnect.

There’s a couple things I wished that slony would have told me:

  • Notification on the slave that it no longer had its connection back to the master. We’ll set up our own monitors to detect that this connection no longer exists, but it would be much nicer for slony to warn about this. Additionally, it would be nice to be able to re-connect to a single slave without restarting slon entirely.
  • More explanation about sl_confirm and likely causes of failed confirmations. I hope I’ve shed a little light with this blog post.

The documentation for setting up slony is very good, but the troubleshooting information is lacking around events and confirmations, and how each type of event and confirmation actually happens. I’m happy to be proven wrong -- so please leave pointers in the comments!

Comparing databases with check_postgres

One of the more recent additions to check_postgres, the all-singing, all-dancing Postgres monitoring tool, is the "same_schema" action. This was necessitated by clients who wanted to make sure that their schemas were identical across different servers. The two use cases I've seen are servers that are being replicated by Bucardo or Slony, and servers that are doing horizontal sharding (e.g. same schema and database on different servers: which server you go to depends on (for example) your customer id). Oft times a new index fails to make it to one of the slaves, or some function is tweaked on one server by a developer, who then forgets to change it back or propagate it. This program allows a quick and automatable check for such problems.

The idea behind the same_schema check is simple: we walk the schema and check for any differences, then throw a warning if any are found. In this case, we're using the term "schema" in the classic sense of a description of your database objects. Thus, one of the things we check is that all the schemas (in the classic RDBMS sense of a container of other database objects) are the same, when running the "same_schema" check. Only slightly confusing. :)

Not only is this program nice for monitoring (e.g. as a Nagios check), but if you pass in a --verbose argument, you get a simple not-all-on-one-line breakdown of all the differences between the two databases. Let's do a quick example.

First, we download and install check_postgres. We'll pull straight from a git repository for check_postgres. While we have our own repo at bucardo.org, we also are keeping it in sync with a tree at github.org, so we'll use that one:

git clone git://github.com/bucardo/check_postgres.git
cd check_postgres
perl Makefile.PL
make
make test
sudo make install

Let's create a Postgres cluster with the initdb command, start it up, then create two new databases to compare to each other.

initdb -D cptest
echo port=5555 >> cptest/postgresql.conf
pg_ctl -D cptest -l cp.log start
psql -p 5555 -c 'CREATE DATABASE yin'
psql -p 5555 -c 'CREATE DATABASE yang'

We're ready to run the script. By default, it outputs things in a Nagios-friendly manner. We should see an 'OK' because the two databases are identical:

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA OK: DB "yin" (port=5555 => 5555) Both databases have identical items | time=0.01

The message could be clearer and show both database names, but the check worked and showed that things are exactly the same. Let's throw some differences in and run it again:

psql -p 5555 -d yin -c 'create table foobar(a int primary key, b text, c text)'
psql -p 5555 -d yang -c 'create table foobar(a int, b text, c varchar(99))'
psql -p 5555 -d yin -c 'create schema yinonly'
psql -p 5555 -d yang -c 'create table pineapple(id int)'

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01
Schema in 1 but not 2: yinonly  Table in 2 but not 1: public.pineapple  Column "a" of "public.foobar": nullable is NO on 1, but YES on 2.  Column "c" of "public.foobar": type is text on 1, but character varying on 2.  Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not. 

It works, but a little messy for human consumption. Nagios requires everything to be in a single line, but we'll add a --verbose argument to ask the script for prettier formatting:

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01
Schema in 1 but not 2: yinonly
Table in 2 but not 1: public.pineapple
Column "a" of "public.foobar": nullable is NO on 1, but YES on 2.
Column "c" of "public.foobar": type is text on 1, but character varying on 2.
Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not.

There are also ways to filter the output, for times when you have known differences. For example, to exclude any tables with the word 'bucardo' in them, you could add this argument:

--warning="notable=bucardo"

The online documentation has more details about all the filtering options.

So what kind of things do we check for? Right now, we are checking:

  • users (existence and powers, i.e. createdb, superuser)
  • schemas
  • tables
  • sequences
  • views
  • triggers
  • constraints
  • columns
  • functions (including volatility, strictness, etc.)

Got something else we aren't covering? Send in a patch, or a quick request, to the mailing list.

SDCH: Shared Dictionary Compression over HTTP

Here's something new in HTTP land to play with: Shared Dictionary Compression over HTTP (SDCH, apparently pronounced "sandwich") is a new HTTP 1.1 extension announced by Wei-Hsin Lee of Google last September. Lee explains that with it "a user agent obtains a site-specific dictionary that then allows pages on the site that have many common elements to be transmitted much more quickly." SDCH is applied before gzip or deflate compression, and Lee notes 40% better compression than gzip alone in their tests. Access to the dictionaries stored in the client is scoped by site and path just as cookies are.

The first client support was in the Google Toolbar for Internet Explorer, but it is now going to be much more widely used because it is supported in the Google Chrome browser for Windows. (It's still not in the latest Chrome developer build for Linux, or at any rate not enabled by default if the code is there.)

Only Google's web servers support it to date, as far as I know. Someone intended to start a mod_sdch project for Apache, but there's no code at all yet and no activity since September 2008.

It is interesting to consider the challenge this will have on HTTP proxies that filter content, since the entire content would not be available to the proxy to scan during a single HTTP conversation. Sneakily-split malicious payloads would then be reassembled by the browser or other client, not requiring JavaScript or other active reassembly methods. This forum thread discusses this threat and gives an example of stripping the Accept-encoding: sdch request headers to prevent SDCH from being used at all. Though the threat is real, it's hard to escape the obvious analogy with TCP filtering, which had to grow from stateless to more difficult stateful TCP packet inspection. New features mean not just new benefits but also new complexity, but that's not reason to reflexively reject them.

SDCH references:

OSCON so far! Filesystem information bonanza on Wednesday

Wednesday was the first official day of OSCON, and I spent it elbow deep in filesystems. The morning was kicked off with Val Aurora delivering a great overview of Btrfs, a new fileystem currently in development. Some of the features include:

  • Copy on write filesystem
  • Cheap, easy filesystem snapshots
  • Dynamically resizable partitions
  • Indexed directory structure
  • Very simple administration

Val demonstrated basic functionality, including creating snapshots and creating a Btrfs filesystem on top of an ext3 filesystem. Cool stuff! The filesystem is still under heavy development, but seems very promising.

Next I saw Theodore Ts'o, the primary developer behind ext4, talk about the future of filesystems and storage. He referenced a great paper that dives deep into the economics behind SSD (solid state drives) and platter hard drive manufacturing. One interesting calculation was that even if we could convert all the silicon fabs to manufacture flash, would only be able to covert about 12% of the world-wide capacity of hard drive production. Because of this, Theodore believes that it is going to be challenging for the cost of SSDs to drop to the point where it becomes cost competitive with hard drives.

Other observations from Theodore concerned the slowing of innovation around hard drives, and companies like Seagate cutting back in their R&D departments. He sees opportunity for software and filesystem innovation in this environment, and so far that is playing out in the rapid development of new filesystems for Linux (Nilfs2, POMELFS, and EXOFS as three recent new examples). One open issue he brought up is the need for more and better benchmarking tools.

In the afternoon, I presented Linux Filesystem Performance for Databases. I've uploaded the slides to the conference site. I talked about the work that the Portland PostgreSQL Performance Pad team did on filesystem testing with some hardware donated from HP. I also included results from some recent DBT-2 tests Mark had run with PostgreSQL, using pgtune and then refining a few key parameters.

There were quite a few interesting questions, and I talked to one of the Wikia admins about a recent change he'd made to use SSDs instead of hard drives in some of their servers. I mentioned that it would be great to see a case study and data from his experience.

Fedora goes up to eleven

I upgraded to Fedora Core 11 this week, up from 9. What really surprised me is how fast it is. I don't recall ever having such a noticably faster and responsive desktop after a distro upgrade. Everything is more responsive and instantaneous. Even though FC9 didn't seem particularly slow, I realize now that I had been spending time just waiting for the software to catch up. I don't know where the credit is due. Could be anything from the kernel to XFCE to the apps. But I like it.

Normally, most SELinux issues occur when it is enabled. But after the upgrade I ran into one of the opposite variety: the gedit program threw an error on every file save. The 244605 and 477070 tickets might be the same issue I ran into. I worked around the problem by switching to 'kate -u'.

I need at least four different text editors in order to enjoy using the computer. How could anyone stand the monotony of a single text editor all day? Right now I'm doing:

  • gvim for detailed things
  • vanilla vim for when gvim is too awesome.
  • kate for vogon poetry
  • jedit for when I get that certain IDE feeling (you know the one)
  • emacs for inducing carpal tunnel syndrome

It's also important to switch out text editors at regular intervals, such as when one of them breaks in a distro upgrade.

The first few minutes after booting up a fresh distro install are disorienting. Keyboard shortcuts and other customizations are so much a part of me that I can hardly function at all without them. Thank goodness that it's only a matter of minutes before /home gets restored and everything is back to normal.

The policy I follow is to upgrade to every odd-numbered Fedora release. I don't actually know if the even-numbered releases are worse or better, but based on my experience with Star Trek movies, I'm not going to take any chances. The majority of my coworkers run non-Fedora distros, even the ones that used to work at Red Hat. Some of them do RHEL, which I did a few times when I was waiting for an odd-numbered Fedora release.

Although I considered the B-tree FS with its "i-can't-believe-it's-not-btr" kernel option, ext4 won in the end. With the new policy of updating atime only once per day, I'm leaving it enabled. I couldn't get the nvidia driver to load until I removed nouveau from initrd and downgraded to 173xx.

Overall, FC11 has been a great upgrade. Thanks to Free Software developers the world over.

Gmail Contacts Notes Converter

As I mentioned previously, I recently got a Google Ion phone running Android. I recently began using it as my main mobile phone, and thus needed to finally migrate the contacts from my Nokia 6126 phone to Android.

This is apparently easy to do by first copying all the contacts from the Nokia 6126 internal memory to the SIM card, then moving the SIM card to the Ion and importing the contacts. But that only works if all your contacts fit on the SIM card. If not, they're truncated, and you have to delete many contacts on the Nokia to fit more, which would be a nonreversable move.

Several posts describe ways to do the export and import, such as this one that didn't really apply to my phone, and this one that involves VCF export & import which I didn't see a way to do.

Ultimately I found an article that described Nokia's PC Suite software that I'd never heard of before, which I downloaded on an old Windows machine and used to download the contacts from the phone via Bluetooth, then export to a CSV file and import into Gmail. So far, so good.

Except as this post and another post describe, then all the contact data showed up in a single Notes field, useless for dialing or emailing.

I decided it would be easiest to convert the Notes data into normal phone fields since I already had some contact information in the phone and couldn't find any other reasonable way.

I came up with this Gmail Contacts Notes Converter script to solve the problem. It takes Gmail-exported CSV as input, converts any Notes field data into standard Work or Personal contacts, and outputs CSV that can be re-imported into Gmail. It requires Perl 5.10.0 and I've only tested it on Linux. (It could be modified to work with earlier Perl versions fairly easily.)

Perhaps it will be useful to someone else as well.

pgGearman 0.1 release!

Yesterday, Brian Aker and Eric Day presented pgGearman: A distributed worker queue for PostgreSQL during the OSCON/SFPUG PgDay.

Gearman is a distributed worker queuing system that allows you to farm work out to a collection of servers, and basically run arbitrary operations. The example they presented was automating and distributing the load of image processing for Livejournal. For example, everyone loves to share pictures of their kittens, but once an image is uploaded, it may need to be scaled or cropped in different ways to display in different contexts. Gearman is a tool you can use to farm these types of jobs out.

So, in anticipation of the talk, I worked with Eric Day on a set of C-language user defined functions for Postgres that allow client connections to a Gearman server.

You can try out the pgGearman 0.1 release on Launchpad!

CSS @font-face in Firefox 3.5

This has been frequently mentioned around the web already, but it's important enough that I'll bring it up again anyway. Firefox 3.5 adds the CSS @font-face rule, which makes it possible to reference fonts not installed in the operating system of the browser, just as is done with images or other embedded content.

Technically this is not a complicated matter, but font foundries (almost all of whom have a proprietary software business model) have tried to hold it back hoping for magical DRM to keep people from using fonts without paying for them, which of course isn't possible. As one of the original Netscape developers mentioned, if they had waited for such a thing for images, the web would still be plain-text only.

The quickest way to get a feel for the impact this change can have is to look at Ian Lynam & Craig Mod's article demonstrating @font-face in Firefox 3.5 side-by-side with any of the other current browsers. It is exciting to finally see this ability in a mainstream browser after all these years.

Subverting PostgreSQL Aggregates for Pentaho

In a recent post I described MDX and a project I'm working on with the Mondrian MDX engine. In this post I'll describe a system I implemented to overcome one of Mondrian's limitations.

Each Mondrian measure has an associated aggregate function defined. For instance, here's a measure from the sample data that ships with Pentaho:

<Measure name="Quantity" column="QUANTITYORDERED" aggregator="sum" />

The schema defines the database connection properties and the table this cube deals with elsewhere; this line says there's a column called QUANTITYORDERED which Mondrian can meaningfully aggregate with the sum() function. Mondrian knows about six aggregates: count, avg, sum, min, max, and distinct-count. And therein lies the problem. In this case, the client wanted to use other aggregates such as median and standard deviation, but Mondrian didn't provide them[1].

Mondrian uses the aggregator attribute of the measure definition to generate SQL statements exactly as you might expect. In the case of the measure above, the SQL query involving that measure would read "sum(QUANTITYORDERED)". In our case, Mondrian is backed by a PostgreSQL database, which offers a much richer set of aggregates (such as stddev() for the standard deviation, one of the numbers we need), but Mondrian doesn't know how to get to them.

Measures can be defined in terms of SQL expressions, rather than simple column names, but this doesn't immediately help. If I wanted the standard deviation of the quantity ordered, I might try something like this:

<Measure name="Quantity">
    <KeyExpression><SQL dialect="postgres">
        stddev(quantityordered)
    </SQL></KeyExpression>
</Measure>

Here, Mondrian would complain that the measure was defined without an aggregator attribute. And if I define one, such as sum, the resulting SQL becomes "sum(stddev(quantityordered))", which is illegal and makes PostgreSQL complain about nested aggregates.

But PostgreSQL's function overloading can help here. Although Mondrian's generated SQL will always include a call to a "count()" function if the aggregator is defined as "count", but there's no reason we can't make PostgreSQL use some other count() function. For instance, let's defined a new "count()" function that isn't an aggregate, but simply returns whatever argument it is passed. Then we can use it to wrap whatever function we want, including arbitrary aggregate functions.

Consider an attempt to get Mondrian to use the stddev() aggregate. It returns a DOUBLE PRECISION type, so our fake count function must simply accept a DOUBLE PRECISION variable and return it:

CREATE FUNCTION count(DOUBLE PRECISION) RETURNS DOUBLE PRECISION AS $$
    SELECT $1
$$ LANGUAGE SQL IMMUTABLE;

Then we define a measure like this:

<Measure name="Quantity Std. Dev" aggregator="count">
    <KeyExpression><SQL dialect="postgres">
        stddev(quantityordered)
    </SQL></KeyExpression>
</Measure>

The resulting SQL is "count(stddev(quantityordered))", but in this case PostgreSQL uses our new count() function, and we get exactly the return value we want.

There's a catch: if we have a double precision column "foo" in a table "bar", and write:

SELECT count(foo) FROM bar;

...it uses our new count function, and rather than returning the number of rows in bar, it returns the value for foo from each row in bar.

To get around this problem, we can define a new data type. We'll write a function to create that datatype from another data type, and rewrite our count function to accept only that data type, and return the original data type, like this:

CREATE TYPE dp_cust AS (
    dp DOUBLE PRECISION); 

CREATE FUNCTION make_dpcust(a DOUBLE PRECISION) RETURNS dp_cust IMMUTABLE AS $$
DECLARE
    dpc dp_cust;
BEGIN
    dpc.dp := a;
    RETURN dpc;
END;
$$ LANGUAGE plpgsql;

DROP FUNCTION count(double precision);

CREATE FUNCTION count(dp_cust) RETURNS DOUBLE PRECISION IMMUTABLE AS $$
    SELECT $1.dp
$$ LANGUAGE sql;

Now our count() function will only be called when we're dealing with the dp_cust type, and we can control precisely when that happens, because the only way we make dp_cust values will be with the make_dpcust function. Our measure now looks like this:

<Measure name="Quantity Std. Dev" aggregator="count">
    <KeyExpression><SQL dialect="postgres">
        make_dpcust(stddev(quantityordered))
    </SQL></KeyExpression>
</Measure>

With this new data type and our custom count() function we can use whatever PostgreSQL aggregate we want as a measure aggregate in Mondrian.

[1] Note that the Mondrian developers already recognize this as a shortcoming worth removing. Allowing user-defined aggregates is on the Mondrian roadmap.

Bucardo and truncate triggers

Version 8.4 of Postgres was recently released. One of the features that hasn't gotten a lot of press, but which I'm excited about, is truncate triggers. This fixes a critical hole in trigger-based PostgreSQL replication systems, and support for these new triggers is now working in the Bucardo replication program.

Truncate triggers were added to Postgres by Simon Riggs (thanks Simon!), and unlike other types of triggers (UPDATE, DELETE, and INSERT), they are statement-level only, as truncate is not a row-level action.

Here's a quick demo showing off the new triggers. This is using the development version of Bucardo - a major new version is expected to be released in the next week or two that will include truncate trigger support and many other things. If you want to try this out for yourself, just run:

$ git clone git-clone http://bucardo.org/bucardo.git/

Bucardo does three types of replication; for this example, we'll be using the 'pushdelta' method, which is your basic "master to slaves" relationship. In addition to the master database (which we'll name A) and the slave database (which we'll name B), we'll create a third database for Bucardo itself.

$ initdb -D bcdata
$ initdb -D testA 
$ initdb -D testB 

(Technically, we are creating three new database clusters, and since we are doing this as the postgres user, the default database for all three will be 'postgres')

Let's give them all unique port numbers:

$ echo port=5400 >> bcdata/postgresql.conf
$ echo port=5401 >> testA/postgresql.conf 
$ echo port=5402 >> testB/postgresql.conf 

Now start them all up:

$ pg_ctl start -D bcdata -l bc.log
$ pg_ctl start -D testA -l A.log
$ pg_ctl start -D testB -l B.log

We'll create a simple test table on both sides:

$ psql -d postgres -p 5401 -c 'CREATE TABLE trtest(id int primary key)'
$ psql -d postgres -p 5402 -c 'CREATE TABLE trtest(id int primary key)'

Before we go any further, let's install Bucardo itself. Bucardo is a Perl daemon that uses a central database to store its configuration information. The first step is to create the Bucardo schema. This, like almost everything else with Bucardo, is done with the 'bucardo_ctl' script. The install process is interactive:

$ bucardo_ctl install --dbport=5400

This will install the bucardo database into an existing Postgres cluster.
Postgres must have been compiled with Perl support,
and you must connect as a superuser

We will create a new superuser named 'bucardo',
and make it the owner of a new database named 'bucardo'

Current connection settings:
1. Host:          
2. Port:          5400
3. User:          postgres
4. PID directory: /var/run/bucardo
Enter a number to change it, P to proceed, or Q to quit: P

Version is: 8.4
Attempting to create and populate the bucardo database and schema
Database creation is complete

Connecting to database 'bucardo' as user 'bucardo'
Updated configuration setting "piddir"
Installation is now complete.

If you see any unexpected errors above, please report them to bucardo-general@bucardo.org

You should probably check over the configuration variables next, by running:
bucardo_ctl show all
Change any setting by using: bucardo_ctl set foo=bar

Because we don't want to tell the bucardo_ctl program our custom port each time we call it, we'll store that info into the ~/.bucardorc file:

$ echo dbport=5400 > ~/.bucardorc

Let's double check that everything went okay by checking the list of databases that Bucardo knows about:

$ bucardo_ctl list db
There are no entries in the 'db' table.

Time to teach Bucardo about our two new databases. The format for the add commands is: bucardo_ctl add [type of thing] [name of thing within the database] [arguments of foo=bar format]

$ bucardo_ctl add database postgres name=master port=5401
Database added: master

$ bucardo_ctl add database postgres name=slave1 port=5402
Database added: slave1

Before we go any further, let's look at our databases:

$ bucardo_ctl list dbs
Database: master   Status: active
Conn: psql -h  -p 5401 -U bucardo -d postgres

Database: slave1   Status: active
Conn: psql -h  -p 5402 -U bucardo -d postgres

Note that by default we connect as the 'bucardo' user. This is a highly recommended practice, for safety and auditing. Since that user obviously does not exist on the newly created databases, we need to add them in:

$ psql -p 5401 -c 'create user bucardo superuser'
$ psql -p 5402 -c 'create user bucardo superuser'

Now we need to teach Bucardo about the tables we want to replicate:

$ bucardo_ctl add table trtest db=master herd=herd1
Created herd "herd1"
Table added: public.trtest

A herd is simply a named connection of tables. Typically, you put tables that are linked together by foreign keys or other logic into a herd so that they all get replicated at the same time.

The final setup step is to create a replication event, which in Bucardo is known as a 'sync':

$ bucardo_ctl add sync willow source=herd1 targetdb=slave1 type=pushdelta
NOTICE:  Starting validate_sync for willow
CONTEXT:  SQL statement "SELECT validate_sync('willow')"
Sync added: willow

This command actually did quite a bit of work behind the scenes, including creating all the supporting schemas, tables, functions, triggers, and indexes that Bucardo will need.

We are now ready to start Bucardo up. Simple enough:

$ bucardo_ctl start
Checking for existing processes
Starting Bucardo

Let's add a row to the master table and make sure it goes to the slave:

$ psql -p 5401 -c 'insert into trtest(id) VALUES (1)'
INSERT 0 1
$ psql -p 5402 -c 'select * from trtest'
 id
----
  1
(1 row)

Looks fine, so let's try out the truncate. On versions of Postgres less than 8.4, there was no way for Bucardo (or Slony) to know that a truncate had been run, so the rows were removed from the master but not from the slave. We'll do a truncate and add a new row in a single operation:

$ psql -p 5401 -c 'begin; truncate table trtest; insert into trtest values (2); commit'
COMMIT
$ psql -p 5402 -c 'select * from trtest'
 id
----
  2
(1 row)

It works! Let's clean up our test environment for good measure:

$ bucardo_ctl stop
$ pg_ctl stop -D bcdata
$ pg_ctl stop -D testA
$ pg_ctl stop -D testB

As mentioned, there are three types of syncs in Bucardo. The other type that can make use of truncate triggers is the 'swap' sync, aka "master to master". I've not yet decided on the behavior for such syncs, but one possibility is simply:

  • Database A gets truncated at time X
  • Bucardo truncates database B, then discards all delta rows older than X for both A and B, and all delta rows for B
  • Everything after X gets processed as normal (conflict resolution, etc.)
  • The same thing for a truncate on database B (truncate A, discard all older rows).

Second proposal:

  • Database A gets truncated at time X
  • We populate the delta table with every primary key in the table before truncation (assuming we can get at it)
  • That's it! Bucardo does its normal thing as if we just deleted a whole bunch of rows on A, and in theory deletes them from B as well.

Comments on this strategy welcome!

Update: Clarified initdb cluster vs. database per comment #1 below, and added new truncation handling scheme for multi-master replication per comment #2.

MDX

Recently I've been working with Mondrian, an open source MDX engine. MDX stands for "multi-dimensional expressions", and is a query language used in analytical databases. In MDX, data are considered in "cubes" made up of "dimensions", which are concepts analogous to "tables" and "columns", respectively, in a relational database. And in MDX, much as in SQL, queries written in a special query language tell the MDX engine to return a data set by describing filters in terms of the various dimensions.

But MDX and SQL return data sets in very different ways. Whereas a SQL query will return individual rows (unless aggregate functions are used), MDX always aggregates rows. In MDX, dimensions aren't simple fields that contain arbitrary values; they're hierarchical objects that can be queried at different levels. And finally, in MDX only certain dimensions can be returned in a query. These dimensions are known as "Measures".

Without an example this doubtless makes little sense at first glance. In my case, the underlying data come from a public health application. Among other responsibilities, public health departments have as their task to prevent the spread of disease. Some diseases, such as tuberculosis or swine flu, are of particular interest because of their virulence, their mortality, or other characteristics. Health care providers are legally required to report cases of these diseases to various public health organizations, where the data are analyzed to identify and control outbreaks. The cube in question describes cases of these reportable conditions. Dimensions include the particular disease, the patient's gender and race, the health department jurisdiction the patient lives in, and a few other characteristics. Among the available measures are the count of cases, the average age of each patient, and the average duration of the local public health department's investigation into the case.

You'll note that each of the measures describes groups of cases: a count of cases, the average from a group of values, etc. MDX will tell me the number of cases that meet a criterion, for instance, but not the names of each patient involved. As I said before, MDX only returns aggregates, not individual rows. Each measure's definition includes an aggregate function used to calculate the final value for that measure based on a group of rows in a database.

The cube also uses hierarchical dimensions. As an example, public health data categorizes cases by age group rather than by age. Groups include '< 1 year' and '1-4 years' at the young end, '85+ years' at the older end, and five year increments for everything in between. So the age dimension hierarchy would include two levels: one for the age group, and one for the specific age. In some instances, the jurisdiction dimension might also be a hierarchy, with the public health department at the top level, and subdivisions such as county, zip code, or neighborhood in levels of increasing specificity underneath.


At this point, the SQL-oriented reader says, "Well, you can do all this in SQL," and that is perfectly true. In fact, the major duty of an MDX engine is generally to translate MDX queries into SQL queries (or more often, sets thereof). The advantage of MDX is that sometimes it's simply easier to express a particular set of dimensions and measures in MDX than in the corresponding set of SQL queries. Better still, there are nice applications that speak MDX and allow you to browse interactively through MDX cubes without knowing either MDX or SQL. And finally, when the data set gets really large, which is common in OLAP databases, the MDX engine knows about optimizations it can make to speed things up.

A simple MDX query might look like this:

SELECT
  NON EMPTY {[Measures].[Quantity]} ON COLUMNS,
  NON EMPTY
      {([Markets].[All Markets], [Customers].[All Customers],
        [Product].[All Products], [Time].[All Years],
        [Order Status].[All Status Types])} ON ROWS
FROM [SteelWheelsSales]

These data come from a cube that ships as a sample with the open source Pentaho business intelligence software suite. [SteelWheelsSales] represents the cube name; other bracketed expressions are measure and dimension names. "ON ROWS" and "ON COLUMNS" describe the "axis" on which the particular measure or dimensions should be displayed. The "ROWS" and "COLUMNS" axes exist by default, and others can be defined at will. The query above gives a result set like this one:

This image shows what a more complex Mondrian MDX session might look like. The cube describes sales data from a sample business. Users can easily "slice and dice" data to view trends over time, variations across sales regions or product lines, or mixtures thereof. In this case, the rows describe various combinations of dimension values, and each cell contains the one measure this query asks for, aggregated across the rows that match the corresponding dimensions.

For more on MDX query syntax or MDX in general, see this Microsoft library MDX reference.

MTU tweak: a fix for upload pain

While traveling and staying at Hostel Tyn in Prague's city center, I ran into a strange problem with my laptop on their wireless network.

When many people were using the network (either on the hostel's public computers or on the wireless network), sometimes things bogged down a bit. That wasn't a big deal and required merely a little patience.

But after a while I noticed that absolutely no "uploads" worked. Not via ssh, not via browser POST, nothing. They always hung. Even when only a file upload of 10 KB or so was involved. So I started to wonder what was going on.

As I considered trying some kind of rate limiting via iptables, I remembered somewhere hearing that occasionally you can run into mismatched MTU settings between the Ethernet LAN you're on and your operating system's network settings.

I checked my setup and saw something like this:

ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:10.x.x.x  Bcast:10.x.x.x  Mask:255.255.255.0
          inet6 addr: fe80::xxx:xxxx:xxxx:xxxx/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1239 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:191529 (191.5 KB)  TX bytes:4543 (4.5 KB)

The MTU 1500 stood out as being worthy of tweaking. So I tried a completely unscientific change:

sudo ifconfig wlan0 mtu 1400

Then tried the same HTTP POST that had been consistently failing, and poof! It worked fine. Every time.

I think mostly likely something more than 1400 bytes would've been possible, perhaps just a few short of 1500. The number 1492 rings familiar. I'll be old-fashioned and not look it up on the web. But this 1400-byte MTU worked fine and solved the problem. To my delight.

As an interesting aside, before making the change, I found one web application where uploads did work fine anyway: Google's Picasa. I'm not sure why, but maybe it sliced & diced the upload stream into smaller chunks on its own? A mystery for another day.

Last day in Nigeria! A short summary

Today is my last day in Nigeria. I hop in a car in a couple hours and head off to visit the university in Akure, and then I will be driving to Lagos to catch a plane home.

My students are pictured above. We covered a great deal of material this week. They learned about the PostgreSQL project, basic database administration, how to develop a schema from forms and application requirements, how to write procedural code inside the database, and how to use the pgAdmin and psql interfaces.

I learned about how many of the officials and IT workers I met (both in the class and outside of it), had worked very hard on the court case that led to the change in government in Ondo State three months ago. There had been systematic election fraud, and they were able to prove it in court using some clever IT and forensic analysis work. The members of SITEDEC believe very strongly in the importance of IT in increasing government accountability and transparency, a belief re-affirmed by their recent successes.

I'm looking forward to hearing about how the work progresses on their census and voter registration databases. Of course, I want to come back to Nigeria. It's a beautiful country, and I didn't have nearly enough time here to appreciate it.

Windows installer tip: passwords

Updated below!

When specifying a password for the Windows PostgreSQL one-click installer, you get this message:

Please provide a password for the database superuser and service account (postgres). If the service account already exists in Windows, you must enter the current password for the account. If the account does not exist, it will be created when you click 'Next'.

If you have already installed Postgres as a service, you will need to enter the current user postgres service user password to get past the password dialog box. Meaning, if you're logged in to Windows as 'selena', you need to enter selena's password. As a non-Windows user, this baffled me, and a few other people on this thread.

Otherwise, you can just enter a password that will be used for the 'postgres' database user. Hope this helps someone!

Update:

Further explanation from Dave Page, the maintainer of the windows package:

Selena: It's not the password for the user that you are logged in as that you need to enter, it's the password for the service account (ie. postgres).

Unlike *nix & Mac, service accounts on Windows need to have passwords so unfortunately we need to ensure we have the correct password to install the service. Hence, if there's an existing postgres account, we need the existing password, otherwise the account will be created with whatever password you specify.

In all OSs, we use the password entered on that page as the database superuser password.

In Nigeria: Weekend exploring

Yesterday, I traveled to a Michelin (yes, the tire company!) plantation for a party thrown in honor of the new Secretary to the Ondo State Government, Dr. Aderotimi Adelola.

Michelin grows rubber trees on this sprawling estate. It took nearly 20 minutes to get from the highway to the primary school deep inside the plantation where the celebration was held. Tapped rubber trees pictured below!

I was invited to a table inside the Governor's main tent, and spent most of the time just looking around at all the government officials, and chatting with the Chairman of SITEDEC, Cyril Egunlayi.

The high point of the afternoon was Dr. Olusegun Mimiko's speech welcoming Dr. Adelola to the government. He's a charismatic speaker. The people around the perimeter pressed closer, and were attentively silent for his 10 or 15 minute speech. He emphasized education -- his hometown's slogan is "Home of Education". He also said that despite Ondo State's history of leading Nigeria in educational opportunities, the state had regressed and needed to catch up again. Mimiko speaking:

The car ride out and back to the plantation took about two hours each way. I spent much of that time talking about open source options for various IT infrastructure, where something like Google Apps might fit in for them, and passed on information I'd I'd gotten about microwave links from a Portland WiMax provider, Stephouse Wireless. I also told Cyril about feedback regarding a replacement for Exchange. My followers on Twitter universally recommended Zimbra, and that was confirmed by at least one End Point coworker, Adam Volrath.

We also stopped by the office on our way home to check in on a new wireless repeater the engineers were installing on the tower they have out behind the SITEDEC center. We still have a few details to work out for the class arrangements.

In the evening, I enjoyed some Nigerian barbecue with Deji Agbebi. Originally from Lagos, he worked for a Canadian firm in the early 90s who's goal was to provide clean drinking water to villages in Ondo state. For various reasons, including a military coup, that business failed. Now Deji works in the US. He's a friend of Cyril's, and is here in Akure, hoping to help with the work the government is trying to complete before January.

Nigeria PostgreSQL Training: Day 1

I am in Lagos, Nigeria this morning, preparing for a half-day car ride to Akure in Ondo State. I'll be spending the next seven days with programmers from Ondo state, who are six months or so away from deploying a system to provide government-provided services using a centralized card system. They are designing their database using PostgreSQL!

Ondo state has a little over 3 million people, and plans to integrate a half-dozen government services under the centralized data system. They conducted a census in 2006, and will be using their new system to gather data yearly going forward.

Their plan is extremely ambitious, given obstacles like lack of power in most of the rural areas, and social issues like people not wanting to give accurate information about themselves to the government. Some biometric information, like finger prints, will be gathered electronically using special machines that they will primarily lease (instead of buying - significant cost savings), and these machines require power. They have been specially outfitted with dry-cell batteries, that operate for about 8 hours before needing to be recharged.

For the social problems around data collection, a marketing campaign to explain exactly what benefits those who provide accurate information are entitled to. After I mentioned to my host the American aversion to centralized government identification cards, he explained that in Nigeria they had the same issue. In addition to the marketing on TV, radio, newspapers and even leaflets, data collection volunteers will be trained on exactly how to collect accurate information. I am looking forward to having a look at the surveys and data collection strategy.

Otherwise, I've had a lot of fun talking with people. My car trip from the airport and remaining evening was mostly spent with me making funny vocabulary errors (tshirt == vest - who knew?), and explaining that Americans were mourning and in shock just like Nigerians because of Michael Jackson's death. I made an offhand comment about the number of people walking around outside at dusk because a friend had said a similar thing about Portland, OR's nightlife, and my escort commented on how peaceful and free people are in Lagos.