#1117 closed defect (fixed)

the immutable uploader should call remote_abort on buckets that it knows it won't be using

Reported by: kevan Owned by: zooko
Priority: major Milestone: 1.7.1
Component: code-peerselection Version: 1.7.0
Keywords: Cc:
Launchpad Bug:

Description

The immutable upload code doesn't seem to call the abort method of buckets on remote storage servers when an upload fails (e.g., because the user aborted it, or because peer selection didn't work out). This means that the placeholder sharefiles stick around and reduce available space until the client disconnects, or they expire in some other way. The immutable upload code should do a better job of calling abort to resolve this issue.

(this was first reported in http://tahoe-lafs.org/pipermail/tahoe-dev/2010-July/004656.html)

Attachments (2)

1117patch.dpatch (12.3 KB) - added by kevan at 2010-07-16T00:51:19Z.
test-1117.diff (1.6 KB) - added by warner at 2010-07-18T05:08:20Z.
test case to check that the code fails without the abort-shares patch

Download all attachments as: .zip

Change History (16)

comment:1 Changed at 2010-07-14T06:41:04Z by zooko

Kevan: is this a regression vs. v1.6.0? My guess is that this bug was already present in v1.6.0, but if not please add the "regression" tag to this ticket!

comment:2 Changed at 2010-07-14T21:27:22Z by warner

The storage server doesn't exactly use placeholder files, but the internal how-much-space-have-i-committed-to code will indeed keep counting that space until an abort() is sent, so the uploader should definitely abort the shares as soon as it realizes it isn't going to use them. Otherwise the allocation will stick around until the server connection is dropped.

The share that's hanging out may also convince later uploaders to refrain from uploading a new copy of that same share. I think the server reports in-progress shares in exactly the same way as it reports really-complete shares. So failing to abort a share is likely to confuse later uploads too.

comment:3 Changed at 2010-07-16T00:50:33Z by kevan

From what I understand, the storage server only stops counting the space allocated to a bucket writer when that writer's remote_close method is called, since that causes the server's bucket_writer_closed method to be invoked, which causes the bucket writer to be removed from the active writers list. remote_abort, on the other hand, only deletes the incoming file associated with the bucket. If I haven't misunderstood, this issue then breaks down into:

  1. The client needs to be careful about aborting shares when it knows that it will no longer use them.
  2. The server needs to treat remote_abort more like remote_close, only instead of copying the file from the incomingdir to the sharedir, it needs to delete that file.

I've attached a patch that addresses both of these issues. This can be considered a backward-compatibility break for clients that were relying on the fact that abort()ing a bucket didn't cause the server to stop counting the space assigned to that bucket. I'm not sure how likely it is that there are any such clients.

In the tests for the client-side modifications, I use a fireEventually() to make sure that the abort messages get to the storage server before I check that they're sent (the bucket writer proxy abort call uses callRemoteOnly instead of callRemote, because it doesn't care so much about the result, so it is harder to reason about when the messages get to their destination when testing). Is this reasonable? Is this a good thing? Is there a better way?

zooko: I think you're right; this bug seems to exist in 1.6.0 too, so this isn't a regression.

Changed at 2010-07-16T00:51:19Z by kevan

comment:4 Changed at 2010-07-16T00:51:41Z by kevan

  • Keywords review-needed added

comment:5 Changed at 2010-07-17T06:36:38Z by zooko

Since it is a bugfix and a patch exists it is still a candidate for 1.7.1. Someone should review it carefully!

comment:6 Changed at 2010-07-17T07:26:10Z by zooko

I just read through attachment:1117patch.dpatch . I didn't see any mistakes and the patch adds two unit tests--test_abort(), test_peer_selector_bucket_abort(), and test_encoder_bucket_abort((). However, I'm too sleepy to double check all the uses of self.buckets in src/allmydata/immutable/upload.py and to understand exactly what those tests do, so I'm putting this off until tomorrow.

(Anyone else should feel free to review this before I get around to it.)

comment:7 Changed at 2010-07-17T21:40:23Z by davidsarah

  • Owner changed from kevan to davidsarah
  • Status changed from new to assigned

comment:8 Changed at 2010-07-18T03:02:04Z by warner

Thinking about this a bit further (and in light of the persistent-failure-to-upload described in #1118).. it's not a space problem, but rather it's a consequene of the way the server handles shares that it thinks are already in the process of being uploaded.

If an upload fails partway through (after allocate_buckets), such as how #1118 stopped at an assert statement, the storage servers will have BucketWriter objects with open filehandles to partially- (perhaps not-at-all-) written shares in incoming/ . The client will neither close those shares, nor abort them, nor drop the connection, so they'll stick around until the client next restarts. When the client tries to upload the file a second time, their allocate_buckets call will hit source:src/allmydata/storage/server.py#L335, in which the presence of the incoming/ file will cause the server to refuse to accept a new share, but not claim that it already has the share (indistinguishable from the case where it does not have enough space to accept the share).

This effectively disables those servers for that one file (i.e. for that one storage-index). If the grid only has 10 servers, then a single failed upload is enough to leave the client with no servers that will accept shares, and all subsequent uploads of that file (until the client is restarted, severing the TCP connections and aborting the shares) will fail. If the grid has 20 servers, then two failed uploads are enough to get into this nothing-will-work state.

As the rest of this ticket points out, the necessary fix is to examine the error paths out of the uploader code, to make sure that all paths result in the shares either being closed or aborted. This is non-trivial. We need to accumulate a list of remote BucketReaders as soon as they are received from the server (in response to the allocate_buckets call), and then have an addBoth handler (like a 'finally' block in synchronous try/except clauses) that aborts anything left in the list. When the upload succeeds, entries in this list should be removed as soon as the response to the close() message is received. Since the BucketReader is received by the peer-selection code, whereas the best place for the addBoth handler is elsewhere (in the CHKUploader, or maybe the Uploader), it's not clear where this list ought to live.

comment:9 Changed at 2010-07-18T03:39:33Z by zooko

  • Keywords reviewed added; review-needed removed
  • Owner changed from davidsarah to zooko
  • Status changed from assigned to new

comment:10 Changed at 2010-07-18T03:39:40Z by zooko

  • Status changed from new to assigned

Changed at 2010-07-18T05:08:20Z by warner

test case to check that the code fails without the abort-shares patch

comment:11 Changed at 2010-07-18T05:09:13Z by warner

the test case in the patch I just attached fails without Kevan's patch. I have not yet confirmed that it passes *with* his patch.

comment:12 Changed at 2010-07-18T19:46:37Z by warner

I have just confirmed that my new test case does indeed pass when Kevan's patch is applied. The tests that are in his patch are ok, but they focus on allocated size, rather than the ability to perform a second upload (i.e. the lack of the buggy prevent-all-further-allocate_buckets behavior). So I think we should apply both patches.

comment:13 Changed at 2010-07-18T20:50:16Z by zooko@…

  • Resolution set to fixed
  • Status changed from assigned to closed

In 16bb529339e6cbd5:

tests, NEWS, CREDITS re: #1117
Give Brian and Kevan promotions, move release date in NEWS to the 18th, commit Brian's test for #1117.
fixes #1117

comment:14 Changed at 2010-07-18T20:58:54Z by zooko

  • Keywords reviewed removed
Note: See TracTickets for help on using tickets.