Welcome to End Point’s blog

Ongoing observations by End Point people

Distributed Transactions and Two-Phase Commit

The typical example of a transaction involves Alice and Bob, and their bank. Alice pays Bob $100, and the bank needs to debit Alice and credit Bob. Easy enough, provided the server doesn't crash. But what happens if the bank debits Alice, and then before crediting Bob, the server goes down? Or what if they credit Bob first, and then try to debit Alice only to find she doesn't have enough funds? A transaction allows the debit and credit operations to happen as a package ("atomically" is the word commonly used), so either both operations happen or neither happens, even if the server crashes halfway through the transaction. That way the bank never credits Bob without debiting Alice, or vice versa.

That's simple enough, but the situation can become more complex. What if, for instance, for buzzword-compliance purposes, the bank has "sharded" its accounts database by splitting it in pieces and putting each piece on a different server (whether this is would be smart or not is outside the scope of this post). The typical transaction handles statements issued only for one database, so we can't wrap the debit and credit operations within a single BEGIN/COMMIT if Alice's account information lives on one server and Bob's lives on another.

Enter "distributed transactions". A distributed transaction allows applications to group multiple transaction-aware systems into a single transaction. These systems might be different databases, or they might include other systems such as message queues, in which case the transaction concept means a message would get delivered if and only if the rest of the transaction completed. So with a distributed transaction, the bank could debit Alice's account in one database and credit Bob's in another, atomically.

All this comes at some cost. Distributed transactions require a "transaction manager", an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support "two-phase commit" (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.

The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying "PREPARE TRANSACTION 'foo'" (where "foo" is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with "COMMIT PREPARED 'foo'", and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn't yet prepared), or "ROLLBACK PREPARED 'foo'". Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.

In PostgreSQL, two-phase commit is supported provided max_prepared_transactions is nonzero. A PREPARE TRANSACTION statement persists the current transaction to disk, and dissociates it from the current session. That way it can survive even if the database goes down. The current session no longer has an active transaction. However, the prepared transaction acts like any other open transaction in that all locks held by the prepared transaction remain held, and VACUUM cannot reclaim storage from that transaction. So it's not a good idea to leave prepared transactions open for a long time.

Distributed transactions are most common, it seems, in Java applications. Full J2EE application servers typically come with a transaction manager component. For my examples I'll use an open source, standalone transaction manager, called Bitronix. I'm not particularly fond of using Java for simple scripts, though, so I've used JRuby for this demonstration code.

This script uses two databases, which I've called "athos" and "porthos". Each has same schema, which provides a simple framework for the sharded bank example described above. This schema provides a table for account names, another for ledger information, and a simple trigger to raise an exception when a transaction would bring a person's balance below $0. I'll first populate athos with Alice's account information. She gets $200 to start. Bob will go in the porthos database, with no initial balance.

5432 josh@athos# insert into accounts values ('Alice');
5432 josh@athos*# insert into ledger values ('Alice', 200);
5432 josh@athos*# commit;
COMMIT5432 josh@athos# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# insert into accounts values ('Bob');
5432 josh@porthos*# commit;

Use of Bitronix is pretty straightforward. After setting up a few constants for easier typing, I create a Bitronix data source for each PostgreSQL database. Here I have to use the PostgreSQL JDBC driver's org.postgresql.xa.PGXADataSource class; "XA" is Java's protocol for two-phase commit, and requires JDBC driver support. Here's the code for setting up one data source; the other is just the same.

ds1 =
ds1.set_class_name 'org.postgresql.xa.PGXADataSource'
ds1.set_unique_name 'pgsql1'
ds1.set_max_pool_size 3
ds1.get_driver_properties.set_property 'databaseName', 'athos'
ds1.get_driver_properties.set_property 'user', 'josh'

Then I simply get a connection from each data source, instantiate a Bitronix TransactionManager object, and begin a transaction.

c1 = ds1.get_connection
c2 = ds2.get_connection
btm = TxnSvc.get_transaction_manager

Within my transaction, I just use normal JDBC commands to debit Alice and credit Bob, after which I commit the transaction through the TransactionManager object. If this transaction fails, it raises an exception, which I can capture using Ruby's begin/rescue exception handling, and roll back the transaction.

  s2 = c2.prepare_statement "INSERT INTO ledger VALUES ('Bob', 100)"

  s1 = c1.prepare_statement "INSERT INTO ledger VALUES ('Alice', -100)"

  puts "Successfully committed"
  puts "Something bad happened: " + $!

When I run this, Bitronix gives me a bunch of output, which I haven't bothered to suppress, but among it all is the "Successfully committed" string I told it to print on success. Since Alice is debited $100 each time we run this, and she started with $200, we can run it twice before hitting errors. On the third time, we get this:

Something bad happened: org.postgresql.util.PSQLException: ERROR: Rejecting operation; account owner Alice's balance would drop below 0

This is our trigger firing, to tell us that we can't debit Alice any more. If I look in the two databases, I can see that everything worked as planned:

5432 josh@athos*# select get_balance('Alice');
(1 row)

5432 josh@athos*# \c porthos 
You are now connected to database "porthos".
5432 josh@porthos# select get_balance('Bob');
(1 row)

Remember I've run my script three times, but Bob has only been credited $200, because that's all Alice had to start with.


Robert Young said...
This comment has been removed by the author.
Robert Young said...

(I tried to edit the first time, but that didn't work, so here goes again.)

One is better off using the federation support available in industrial strength databases (DB2, SQL Server, and Oracle, at least). I gather PG doesn't yet support federation.

Out of band transaction management goes back to, at least, CICS (which is still around). With federation, once configured by the DBA (who may be you) all the tables are local to the SQL you execute. With non-database transaction managers, you're beholden not only on their syntax, but their continued existence. Federation is transparent, and happens in the address space of the databases. And, yes, real federation is multi-vendor.

The ACID requirements are the responsibility of the database(s), not client code, whether that code is embedded in the application or some tacked on carbuncle. CICS persists only because it is 1) old and there are a lot of old programs out there which depend on it and 2) it knows how to ACID simple files (it pre-dates RDBMS). There is no good reason to revert to 1965, even if the syntax is java, ruby, or python.

Joshua Tolley said...

If federation support exists, using it might not be a bad idea. Federation software performs exactly the same role as the transaction manager, in exactly the same way, with the possible addition of aliasing functionality to make objects in different databases look like they're all in one database. Transaction managers such as Bitronix are extremely common in the wild; every full J2EE server ships with one, for instance.

There's no dependence on Bitronix' syntax here; database queries go directly to the database, using whatever flavor of SQL the database supports. All the transaction manager needs is a JDBC driver supporting XA. There is, as you point out, a dependence on the existence of a transaction manager, just as there would be a dependence on either federation software, or all databases combined into one, if we used federation instead.

The client code should most certainly control when transactions begin and end. Only the client software, after all, knows when it has begun and completed one unit of work.

Ethan Rowe said...

I think some of this depends on your underlying assumptions. If you have a function-heavy database, where any logically-atomic business logic is encapsulated in a function/stored proc, then you would logically expect the database to handle the distributed transaction concerns. In that respect, you're acting as if every interaction with the database is atomic at the statement level.

But it's not at all uncommon to have the case that Josh is outlining, where the app is enforcing the business logic and has to coordinate transactions itself. At which point, naturally the inclination is to manage it from the app, as you are rather stuck with that strategy to begin with.

I'm not debating between the approaches, mind you, only that the underlying assumptions differ.

Robert Young said...

Given this:

End Point's developers are experts in designing, developing, and maintaining effective and well-structured databases.

I find this:
But it's not at all uncommon to have the case that Josh is outlining, where the app is enforcing the business logic and has to coordinate transactions itself. At which point, naturally the inclination is to manage it from the app, as you are rather stuck with that strategy to begin with.

While I surely agree with your analysis (mine was the same and motivated me to comment), I find it at odds with the End Point statement. That's all. If organizations with End Point's statement (or similar) are willing to cave in to app driven control, then we aren't making progress. We're the ones who are supposed to be smarter than those average bears.

Render unto the database that which is the database's. Keep our eyes on the prize. Fight the good fight. Do the right thing. And so on.

sewa elf said...

Nice article, thanks for the information.