#1755 new enhancement

2-phase commit

Reported by: daira Owned by: daira
Priority: normal Milestone: soon
Component: code Version: 1.9.2
Keywords: 2pc mutable reliability consistency Cc: zooko, warner
Launchpad Bug:

Description

[fill this in with a description of the 2-phase commit protocol, and how it improves write consistency / allows smaller writes for MDMF without regressions]

Change History (13)

comment:1 Changed at 2012-06-01T21:43:08Z by davidsarah

  • Owner set to davidsarah
  • Status changed from new to assigned

comment:2 Changed at 2012-07-05T02:52:52Z by zooko

  • Cc zooko added
  • Milestone changed from undecided to 0.2.0
  • Version changed from 1.9.1 to cloud-branch

comment:3 Changed at 2012-07-05T14:59:37Z by zooko

  • Milestone changed from 0.2.0 to soon
  • Version changed from cloud-branch to 1.9.2

comment:4 Changed at 2012-11-06T22:56:38Z by davidsarah

  • Keywords mutable reliability consistency added

See also #1845.

comment:5 follow-up: Changed at 2012-11-21T00:16:09Z by zooko

The difficulty of distributed two-phase commit in general is that if the Transaction Manager fails after telling some of the Resource Managers to prepare but before either telling them to commit or telling them to rollback, then they are stuck in this prepared state (i.e. locked).

(See Gray and Reuter's book "Transaction Processing", and see also Gray-1995-“Consensus On Transaction Commit”.)

The role of Transaction Manager in this future extension of Tahoe-LAFS would be filled by the LAFS storage client (i.e. the LAFS gateway) and the roles of Resource Managers would be filled by LAFS storage servers.

There is, of course, no way for a Resource Manager to tell the difference between their Transaction Manager having failed versus being slow or being temporarily disconnected from the network, other than the passage of time with the absence of a new message (either "commit" or "rollback") from the Transaction Manager.

In general, this can become intractable for large distributed systems with many resources being locked, many Transaction Managers which need to fail over to one another (using Paxos to elect a new leader, I suppose), and frequent write-contention.

But in practical terms, I expect Tahoe-LAFS will be able to use 2-phase-commit ("2PC") nicely, because typically the scope of what is locked, who is doing the locking, and how much write-contention we have to support, are all relatively narrow. That is, for the use cases that we expect to be asked to handle, only a single mutable file/dir is locked at a time, and only one or a small number of computers have the write cap to a single mutable file/dir.

I think we intend to support the use case that a small number of writers have shared write access to a mutable file/dir and they may occasionally write at the same time as each other, but we do not intend to support the use case that where a large or dynamic set of writers have write access to the same resources, and there may be continuous write collisions that never pause long enough for the distributed system to stabilize.

(I think this is sufficient because I think people who use Tahoe-LAFS will typically use immutables and single-writer-mutables for most of their state management, and rely on shared-writer-mutables only for the sort of "last link in the chain" that can't be managed any other way.)

Another way that Tahoe-LAFS is less fragile than most distributed 2-phase-commit systems is that we've already long since accepted that inconsistency can happen (different storage servers have different versions of a mutable file), and we have mechanisms (repair) in place to recover from that.

So unlike traditional 2PC, 2PC-for-LAFS doesn't have to bear the burden of preventing inconsistency from ever occurring in the distributed system. 2PC for us is just to help multiple writers to coordinate with one another more efficiently, and to help reduce the rate of inconsistency arising within a single storage server. I.e. to allow upload or modification of a mutable share, which may require multiple messages from LAFS storage client to LAFS storage server, without opening a large time window in which a failure of either end or of the connection between them would leave an inconsistent share on that server.

So anyway, we have to come up with a plan for how storage servers (who are playing the role of Resource Manager) handle the case that the storage client (LAFS gateway, Transaction Manager) has told them to prepare and hasn't yet told them whether they should commit or rollback, and then a lot of time passes. As a first strawman argument, I propose a simple hardcoded, fixed, long timeout. Let's say one hour. If your LAFS client hasn't told you whether to commit or to rollback within an hour of asking you to prepare, then you will unilaterally roll back.

Version 0, edited at 2012-11-21T00:16:09Z by zooko (next)

comment:7 in reply to: ↑ 5 Changed at 2012-11-21T03:48:20Z by davidsarah

Replying to zooko:

So anyway, we have to come up with a plan for how storage servers (who are playing the role of Resource Manager) handle the case that the storage client (LAFS gateway, Transaction Manager) has told them to prepare and hasn't yet told them whether they should commit or rollback, and then a lot of time passes. As a first strawman argument, I propose a simple hardcoded, fixed, long timeout. Let's say one hour. If your LAFS client hasn't told you whether to commit or to rollback within an hour of asking you to prepare, then you will unilaterally roll back.

I'd suggest a shorter timeout than this, say 5 minutes. This is assuming the variation we discussed at the summit where clients can upload their new file contents to a holding area on each server before actually taking the lock. In that case, the lock timeout only needs to be long enough to allow all the servers to receive a lock request and confirm it, then for the client to receive all those confirmations, send out commit messages, and each server to receive their commit message.

If that takes longer than 5 minutes, something is seriously wrong. (Note that as long as the number of servers responding before the timeout is at least the happiness threshold, the file will still be updated. The fact that other servers may time out does not cause any inconsistency that we can't tolerate.)

comment:8 Changed at 2012-11-21T03:50:48Z by davidsarah

I meant to say that the reason I don't like the longer timeout, is that if for example a client's network connection dropped at just the wrong point, the file would be unavailable for writes by other clients for the duration of the timeout.

comment:9 Changed at 2013-02-20T18:37:02Z by zooko

I think #1920 is an example of a failure (seen in the wild) of a kind that would be prevented by 2PC.

comment:10 Changed at 2013-03-21T15:15:56Z by zooko

  • Type changed from defect to enhancement

comment:11 Changed at 2013-03-21T15:16:25Z by zooko

  • Cc warner added

Adding Cc: warner just because I want him to pay attention to this ticket. Not sure if adding Cc: warner works for that...

comment:12 Changed at 2016-10-13T15:49:36Z by daira

  • Owner changed from davidsarah to daira
  • Reporter changed from davidsarah to daira
  • Status changed from assigned to new

comment:13 Changed at 2016-10-13T15:49:59Z by daira

I'd like to work on this at the Tahoe summit in November.

Note: See TracTickets for help on using tickets.