Avoiding Distributed Transactions

A page for those who cannot afford a broken system. Part of some thoughts I was going to put in a humble tome called "The Perfect Computing Universe". --RichardHenderson.

Reasons for distributing a transaction :

Keeping a distributed object synchronized = Keeping related but separated, objects synchronized.

Alternative:

Make intermediate states explicit and valid.

This is equivalent to writing your own transaction manager (by definition). Advanced transaction monitors allow the detection and recovery logic to be scripted. This is much more flexible than the default n-phase commit logic which simply asserts that intermediate states of a transaction are invalid.

Fat transaction controllers make this a scriptable exercise. One can describe recovery scenarios for any possible failure. Transactional systems make individual rollback a doddle (oops, errrm, walk in the park, easypeasylemonsqueezy, trivially simple) but don't remove the need for programmed responses. Some things cannot be recovered after the fact.

Firstly, if it is distributed for denormalising data, then it sucks. Get rid of it. Batch oriented aggregation is considerably better and works great. Do you really want to stop your main transactional moneymaker just because you can't add a total? No, I didn't think so.

If you truly have two (or more) separate transactions that must act as one consider your options.

Consider all the cases. Ignoring failure, we have all or nothing transactions. Sounds cool. Adds a massive dependency in the worst possible way, but it is easy. All we need is the ability to rollback changes before they are committed. Commitment must not fail. That is the same as a physical failure. 2-phase commit adds a check to make this unlikely.

If we have a physical failure then we may be left with any state of 2 dependent objects:

Double failure: Any combination.
Single far-end failure: New/Old, New/New.
Single near-end failure: New/New, Old/New.

Lets assume we know that a physical failure has taken place. Standard transactions move currently outstanding logical transactions to a consistent state. Old/Old or New/New. They do not store any information outside the scope of the transaction. So say one of them fails and returns. How do we know which transactions it may have lost? If we can identify them, how can we resynchronize the parties?

n-phase commit protocols do nothing to protect you here. They can't. You could buy a fat replication system to store the information semi-transparently, but that adds further dependencies, and tends to fail with the systems it services.

Using the standard credit/debit transaction:

Transaction: "purchase a widget"

Steps:
- buy a widget: Customer.balance = Customer.balance - Widget.price
- sell a widget: Vendor.balance = Vendor.balance + Widget.price

If everything works then we are okay. Consider the failure cases. Customer isn't updated, or Vendor isn't updated. A timestamp might work, but not on its own.

We can assume that each database will be self-consistent, but of an unknown era. We need a record of transactions, and a way of seeing if it was applied successfully. This record needs to be independent of either party if it is to be robust to failure. A transaction journal.

So, say a party signals failure. It has a record of transactions it knows about. It queries the transaction journal and finds a number of transactions which were committed to it. Now it needs to correct the transactions. These transactions may be old, so the transaction journal needs to be persistent up to some checkpoint in the past. Usually a little further back than the last major backup of any clients.

So now we have a list of transactions. Since the peers in the transaction are known, the specific transaction information may be recovered from there. Alternatively, the transaction messages may be resubmitted from the journal itself. The latter is the most powerful technique as it avoids complicating peers for a rare situation. Anyone who comes from the CICS world, or has used Tuxedo should recognise this scheme.

Eventually all known transactions are rolled forward on all peers, and we are back in business.

In the debit and credit case, if either party fails and comes back online with old data. The transaction monitor/journal detects this and brings it back up to date. We may emulate the third-party monitor with peer-to-peer techniques but they are all significantly less reliable/flexible and much more intrusive. For physical reliability you need redundancy AND physical independence.

For me this is a major reason to have a middle tier. You need a place where transactions are initiated/journalled and monitored. It makes sense if it understands transactions both at the top-level and the individual steps. It needs access to all peer databases etc etc.

The details of an implementation are left as an exercise [;)].

This is a heavyweight piece of infrastructure logic. Avoid if possible.

I do not understand what this page is saying. Is it saying that to get reliable distributed transactions everything must pass through the bottleneck of a reliable (non distributed) Journal? If your journal is distributed you are back to the original problem... Please explain. -- JamesCrook

This page is saying that distributed transactions are hard, and to be avoided. If you can't avoid them, then don't rely on 2-phase commit alone as it will not save you from physical failures.

The remainder is a short example of how reliability can be added by using a third-party transaction journal type functionality. I tend to associate a transaction log with the middle-tier, which is conveniently at the 'center of things' anyway.

This would be an architectural pattern I suppose, though I prefer the term 'architectural doodad' these days.