#17 closed defect (fixed)

upload needs to be tolerant of lost peers

Reported by: warner Owned by: warner
Priority: major Milestone: 0.9.0 (Allmydata 3.0 final)
Component: code-encoding Version: 0.7.0
Keywords: Cc:
Launchpad Bug:

Description

When we upload a file, we can tolerate not having enough peers (or those peers not offering enough space), based upon a threshold named "shares_of_happiness". We want to place 100 shares by default, and as long as we can find homes for at least 75 of them, we're happy.

But in the current src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there).

encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again.

At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.

Change History (7)

comment:1 Changed at 2007-04-28T19:15:36Z by warner

  • Component changed from component1 to code

comment:2 Changed at 2007-05-04T05:15:12Z by warner

  • Priority changed from major to critical

comment:3 Changed at 2007-05-30T00:48:14Z by warner

Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose all peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.

comment:4 Changed at 2007-06-06T16:41:12Z by warner

  • Milestone set to release 0.2.1

comment:5 Changed at 2007-06-06T19:50:05Z by warner

  • Resolution set to fixed
  • Status changed from new to closed

6bb9debc166df756 and f4c048bbeba15f51 should address this: now we keep going as long as we can still place 'shares_of_happiness' shares (which defaults to 75, in our 25-of-100 encoding). There are log messages generated when this happens, to indicate how close we are to giving up.

If we lose so many peers that we go below shares-of-happiness, the upload fails with a NotEnoughPeersError? exception.

comment:6 Changed at 2008-01-26T00:15:28Z by warner

  • Component changed from code to code-encoding
  • Milestone changed from 0.3.0 to 0.9.0 (Allmydata 3.0 final)
  • Priority changed from critical to major
  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Version set to 0.7.0

oops, it turns out that there is still a problem: if the peer is quietly lost before the upload starts, then the initial storage.WriteBucketProxy.start call (which writes a bunch of offsets into the remote share) will fail with some sort of connection-lost error (either when TCP times out, or when the storage server reconnects and replaces the existing connection). Failures in this particular method call are not caught in the same way as later failures, and any such failure will cause the upload to fail.

The task is to modify encode.Encoder.start():213 where it says:

        for l in self.landlords.values():
            d.addCallback(lambda res, l=l: l.start())

to wrap the start() calls in the same kind of drop-that-server-on-error code that all the other remote calls use.

This might be the cause of #193 (if the upload was stalled waiting for the lost peer's TCP connection to close), although I kind of doubt it. It might also be the cause of #253.

comment:7 Changed at 2008-01-28T19:24:50Z by warner

  • Resolution set to fixed
  • Status changed from reopened to closed

Fixed, in 4c5518faefebc1c7. I *think* that's the last of them, so I'm closing out this ticket again.

Note: See TracTickets for help on using tickets.