Context Navigation

← Previous Ticket
Next Ticket →

Opened at 2012-11-20T01:11:26Z

Closed at 2013-05-01T15:04:59Z

#1869 closed defect (fixed)

pluggable backends: serialize backend operations on a shareset

Reported by:	davidsarah	Owned by:	zooko
Priority:	normal	Milestone:	1.10.1
Component:	code-storage	Version:	cloud-branch
Keywords:	cloud-backend storage shareset cache test-needed	Cc:
Launchpad Bug:

Description (last modified by davidsarah)

The following issue needs to be fixed before the pluggable backends code is merged to trunk (#1819). Before pluggable backends, operations that access share storage were synchronous, and so two or more such operations could not run concurrently. This is important because clients make the assumption of serializability for accesses to a mutable share on a given server (if they are the only such client, which is implied by the Prime Coordination Directive).

Now that backend operations are asynchronous, they can run concurrently. That needs to be fixed. Note that operations on different sharesets do not need to be serializable with respect to each other. (That would impose unnecessary delays.)

The simplest way to implement this is probably to have a weak cache mapping storage indices to ShareSet objects, so that there can only be one ShareSet object per actual shareset, and then use a deferred queue in the ShareSet object to serialize operations. I'm not sure exactly where the cache should go yet.

Change History (12)

comment:1 Changed at 2012-11-20T01:12:36Z by davidsarah

Description modified (diff)
Status changed from new to assigned

comment:2 Changed at 2013-04-21T11:28:49Z by daira

Owner changed from davidsarah to daira
Status changed from assigned to new

Hmm, but the ShareSet object returns Share objects, and accesses to those also need to be serialized, which makes it inconvenient to have the queue in the ShareSet?. Maybe we need to have only one Share object at a time for a given share, but that would require Share objects to be reusable. Gahh, complicated.

comment:3 Changed at 2013-04-23T21:05:20Z by daira

Oh, I've been thinking about this in the wrong way. The operations that need to be serialized (or as-if serialized) are only remote operations, i.e. on RIStorageServer, RIBucketReader or RIBucketWriter. That simplifies things a lot.

comment:4 follow-ups: ↓ 7 ↓ 8 Changed at 2013-04-23T22:53:13Z by daira

Keywords design-review-needed added
Owner daira deleted

Okay, I think I understand how this will work now. We'll have a WeakValueDictionary of DeferredLocks, indexed by storage index, in each backend object. The lock object will be passed on to ShareSet, BucketReader, and BucketWriter objects for that SI, and will be used to lock the remote operations. (Most of this can be done generically without only minimal support in the backend-specific objects.)

A subtle complication is that because the locks are associated with sharesets, the get_sharesets_for_prefix method on a backend needs to have a rather weak specification:

    def get_sharesets_for_prefix(prefix):
        """
        Return a Deferred that fires with an iterable of IShareSet objects
        for all storage indices matching the given base-32 prefix, for
        which this backend holds shares.

        Separate locks are taken for each shareset, and nothing prevents
        sharesets matching the prefix from being deleted or added between
        listing the sharesets and taking these locks. Callers must be able
        to tolerate this.
        """

Version 0, edited at 2013-04-23T22:53:13Z by daira (next)

comment:5 Changed at 2013-04-24T00:08:16Z by daira

Zooko asked for clarification on the problem this ticket is trying to solve.

Consider a remote operation such as RIStorageServer.slot_testv_and_readv_and_writev. The specification of that operation says that the tests, reads, and writes are done atomically with respect to other remote operations. So for example if the operation succeeds, then the reads will always be consistent with the tests.

Before pluggable backends, that was achieved by the implementation of slot_testv_and_readv_and_writev being synchronous. But for a cloud backend, the implementation cannot be synchronous because it must make HTTP[S] requests. Using a synchronous HTTP implementation would be a bad idea because the latency of HTTP (especially the worst-case latency) is much greater than the latency of local disk, and the reactor would be blocked for the whole sequence of requests, even though operations on other sharesets could safely proceed in parallel. So what happens is that the backend code creates a Deferrred for the result of the operation and immediately returns it. Nothing prevents the subsequent callbacks for distinct operations on a shareset from interleaving.

While developing the cloud backend, we filed this ticket for the problem and glossed over it. In practice, race conditions between concurrent operations on a shareset are rarely hit, especially when there are few concurrent operations because there is only one user. Nevertheless, the implementation is clearly incorrect as it stands.

Note that the problem isn't restricted to operations where the atomicity requirements were spelled out in the operation's specification. There were implicit assumptions about atomicity between other operations that were implemented synchronously -- for example that reads of mutable files cannot interleave with writes. This doesn't just apply to mutable sharesets, because an immutable shareset has mutable state (the bucket allocations made by RIStorageServer.allocate_buckets).

comment:6 Changed at 2013-04-24T00:14:29Z by daira

Also note that even for the disk backend on the cloud-backend branch, remote operations are no longer implemented synchronously, because they share implementation with the cloud backend (for example, slot_testv_and_readv_and_writev and slot_readv are implemented generically in src/allmydata/storage/backends/base.py, not separately for each backend).

comment:7 in reply to: ↑ 4 Changed at 2013-04-24T12:56:45Z by daira

Replying to daira:

Okay, I think I understand how this will work now. We'll have a WeakValueDictionary of DeferredLocks, indexed by storage index, in each backend object. The lock object will be passed on to ShareSet, BucketReader, and BucketWriter objects for that SI, and will be used to lock the remote operations.

Actually it doesn't need to be passed on to BucketReader and BucketWriter because they only handle individual immutable shares.

comment:8 in reply to: ↑ 4 Changed at 2013-04-24T13:05:46Z by daira

Replying to daira:

    Separate locks are taken for each shareset, and nothing prevents
    sharesets matching the prefix from being deleted or added between
    listing the sharesets and taking these locks. Callers must be able
    to tolerate this.

should be:

    A caller will typically perform operations that take locks on some
    of the sharesets returned by this method. Nothing prevents sharesets
    matching the prefix from being deleted or added between listing the
    sharesets and taking any such locks; callers must be able to tolerate
    this.

comment:9 Changed at 2013-04-24T21:44:11Z by zooko

Thanks for the explanation! I'm glad you're focusing on fixing this issue!

comment:10 Changed at 2013-04-25T23:39:43Z by daira

Fixed in https://github.com/LeastAuthority/tahoe-lafs/commit/81e33c5aafb732810aa44121885943ec0a72ffad on the cloud-rebased2 branch, but I also want to add a test. (It's a bit tricky because it the test has to reliably trigger the situation that would be a race condition without the lock.)

comment:11 Changed at 2013-05-01T15:04:42Z by daira

Keywords test-needed added; design-review-needed removed
Owner set to zooko

Test added in 9384bd49 on the cloud-rebased2 branch. Review of the fix and test needed.

comment:12 Changed at 2013-05-01T15:04:59Z by daira

Resolution set to fixed
Status changed from new to closed

Note: See TracTickets for help on using tickets.

Download in other formats: