[tahoe-lafs-trac-stream] [tahoe-lafs] #302: stop permuting peerlist, use SI as offset into ring instead?

Wed Mar 27 14:37:10 UTC 2013

#302: stop permuting peerlist, use SI as offset into ring instead?
-------------------------+-------------------------------------------------
     Reporter:  warner   |      Owner:  zooko
         Type:  task     |     Status:  assigned
     Priority:  major    |  Milestone:  undecided
    Component:  code-    |    Version:  0.7.0
  peerselection          |   Keywords:  repair newcaps newurls performance
   Resolution:           |  preservation upload
Launchpad Bug:           |
-------------------------+-------------------------------------------------
Description changed by zooko:

Old description:

> We were chatting today about a subject that comes up about every three
> months: why do we use a consistently-permuted peerlist when choosing
> where
> shares should be placed?
>
> The StorageIndex is the output of a hash function and therefore randomly
> distributed, so the set of all storage indices in the network should be
> (in
> the long run) evenly distributed across the SHA-256 2**256 numberspace.
>
> Our current [PeerSelection/TahoeTwo "Tahoe2"] algorithm uses the file's
> Storage Index to permute a list of all known storage servers (by hashing
> the
> (SI+peerid) string and sorting the result). This adds an additional level
> of
> random-distribution: each file gets a different ordering of storage
> servers.
>
> The alternate approach that Zooko suggested today was to skip the
> permutation
> and instead define the algorithm to be:
>
>  * put the storage servers in a circle, placed by their nodeid
>  * use the storage index as a starting location
>  * travel clockwise, asking each server in turn to accept a lease
>  * continue travelling until all shares are placed, or we've run out of
> servers
>
> The reason we went with the permuted list was because we were concerned
> about
> the effects of non-uniform storage server capacities. There are two
> situations to look at: light loading (i.e. nobody is full and all lease
> requests are accepted), and heavy loading (some or most storage servers
> are
> full, and are rejecting lease requests, so the uploader will visit other
> servers to place their shares).
>
> For light loading, the non-permuted approach causes any individual
> storage
> server to experience a load roughly equal to the percentage of the ring
> that
> is covered by its N counter-clockwise neighbors. On average, this will be
> the
> same for all peers, but some peerids might be clustered more than others,
> resulting in more traffic to those peers than the rest.
>
> For heavy loading, once a server is full, all of the traffic that would
> have
> landed on it will be redirected clockwise around the ring (imagine rain
> flowing across the ground until it finds a hole large enough to form a
> puddle). This will result in a concentration of load on the server just
> past
> the full region, and may cause that server to fill quickly. The likely
> result
> is a large section of the ring which is full, while there may be more
> space
> available on the other side of the ring.
>
> In contrast, the permuted approach removes the correlation between server
> locations: each file sees all servers in different locations. So instead
> of
> having "hot spots" (either caused by randomly-clustered peerids, or
> randomly-clustered full servers), the load will be distributed more
> evenly.
> We would expect the heavily-loaded grid to see all servers get full at
> roughly the same time.
>
> There are two arguments in favor of switching to the non-permuted
> approach.
> The first is simplicity: it is slightly easier to explain the non-
> permuted
> algorithm, and it is easier to predict where any given file's shares will
> wind up. The second is a potential benefit for repair. The issue is as
> follows: a storage server has just died (the hard drive experienced a
> fatal
> error), and all of those shares are gone. Repair can be invoked to
> replace
> the missing shares and bring the file back up to full strength, but which
> files need to be repaired? The most convenient list of storage indexes
> that
> need repairing was on the server that just died. Is there some other way
> to
> construct this list of repair work?
>
> (The assumption is that failure-driven repair is more sustainable than
> constant repair of all known files. This depends heavily upon the numbers
> we
> use: how many servers, how many files, how many clients, how many repair
> processes, what bandwidth is consumed by repair, etc).
>
> The benefit that non-permuted share distribution would offer is in the
> resulting correlation between shares held by server B and those held by
> its
> neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have
> never
> rejected a lease request and are equally old, then every storage index on
> server B will also be on either A or C (assuming k>=2). Thus, if B dies,
> we
> can use A and C to construct the list of repair work that needs to be
> done.
>
> However, Rob astutely pointed out that there are plenty of other ways to
> accomplish this same job. For example, each server could be assigned a
> "buddy", using a simple foolscap protocol, and each time server B adds a
> new
> share, it tells its buddy "D" about the storage index. If B dies, we ask
> the
> buddy for the share list and dump it into the repair queue. We could use
> a
> central server for this purpose, or distribute it out: what really
> matters is
> that we have a way to find out who the buddy is.
>
> We need to think more about this. I like the load distribution properties
> of
> permuting, and I'm not particularly concerned about the descriptive
> complexity, but I too am concerned about the repair design, and would
> like to
> leave ourselves some tricks to pull out in case we run into future
> problems.
> The B-probably-has-A+C benefit of non-permutation breaks down if any of
> the
> servers are full or new, so I'm less convinced that it will be a
> significant
> help. But, so much of this is guesswork.. we don't really know.

New description:

 We were chatting today about a subject that comes up about every three
 months: why do we use a consistently-permuted peerlist when choosing where
 shares should be placed?

 The StorageIndex is the output of a hash function and therefore randomly
 distributed, so the set of all storage indices in the network should be
 (in
 the long run) evenly distributed across the SHA-256 2**256 numberspace.

 Our current [source:git/docs/architecture.rst#server-selection "Tahoe2"]
 algorithm uses the file's
 Storage Index to permute a list of all known storage servers (by hashing
 the
 (SI+peerid) string and sorting the result). This adds an additional level
 of
 random-distribution: each file gets a different ordering of storage
 servers.

 The alternate approach that Zooko suggested today was to skip the
 permutation
 and instead define the algorithm to be:

  * put the storage servers in a circle, placed by their nodeid
  * use the storage index as a starting location
  * travel clockwise, asking each server in turn to accept a lease
  * continue travelling until all shares are placed, or we've run out of
 servers

 The reason we went with the permuted list was because we were concerned
 about
 the effects of non-uniform storage server capacities. There are two
 situations to look at: light loading (i.e. nobody is full and all lease
 requests are accepted), and heavy loading (some or most storage servers
 are
 full, and are rejecting lease requests, so the uploader will visit other
 servers to place their shares).

 For light loading, the non-permuted approach causes any individual storage
 server to experience a load roughly equal to the percentage of the ring
 that
 is covered by its N counter-clockwise neighbors. On average, this will be
 the
 same for all peers, but some peerids might be clustered more than others,
 resulting in more traffic to those peers than the rest.

 For heavy loading, once a server is full, all of the traffic that would
 have
 landed on it will be redirected clockwise around the ring (imagine rain
 flowing across the ground until it finds a hole large enough to form a
 puddle). This will result in a concentration of load on the server just
 past
 the full region, and may cause that server to fill quickly. The likely
 result
 is a large section of the ring which is full, while there may be more
 space
 available on the other side of the ring.

 In contrast, the permuted approach removes the correlation between server
 locations: each file sees all servers in different locations. So instead
 of
 having "hot spots" (either caused by randomly-clustered peerids, or
 randomly-clustered full servers), the load will be distributed more
 evenly.
 We would expect the heavily-loaded grid to see all servers get full at
 roughly the same time.

 There are two arguments in favor of switching to the non-permuted
 approach.
 The first is simplicity: it is slightly easier to explain the non-permuted
 algorithm, and it is easier to predict where any given file's shares will
 wind up. The second is a potential benefit for repair. The issue is as
 follows: a storage server has just died (the hard drive experienced a
 fatal
 error), and all of those shares are gone. Repair can be invoked to replace
 the missing shares and bring the file back up to full strength, but which
 files need to be repaired? The most convenient list of storage indexes
 that
 need repairing was on the server that just died. Is there some other way
 to
 construct this list of repair work?

 (The assumption is that failure-driven repair is more sustainable than
 constant repair of all known files. This depends heavily upon the numbers
 we
 use: how many servers, how many files, how many clients, how many repair
 processes, what bandwidth is consumed by repair, etc).

 The benefit that non-permuted share distribution would offer is in the
 resulting correlation between shares held by server B and those held by
 its
 neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have
 never
 rejected a lease request and are equally old, then every storage index on
 server B will also be on either A or C (assuming k>=2). Thus, if B dies,
 we
 can use A and C to construct the list of repair work that needs to be
 done.

 However, Rob astutely pointed out that there are plenty of other ways to
 accomplish this same job. For example, each server could be assigned a
 "buddy", using a simple foolscap protocol, and each time server B adds a
 new
 share, it tells its buddy "D" about the storage index. If B dies, we ask
 the
 buddy for the share list and dump it into the repair queue. We could use a
 central server for this purpose, or distribute it out: what really matters
 is
 that we have a way to find out who the buddy is.

 We need to think more about this. I like the load distribution properties
 of
 permuting, and I'm not particularly concerned about the descriptive
 complexity, but I too am concerned about the repair design, and would like
 to
 leave ourselves some tricks to pull out in case we run into future
 problems.
 The B-probably-has-A+C benefit of non-permutation breaks down if any of
 the
 servers are full or new, so I'm less convinced that it will be a
 significant
 help. But, so much of this is guesswork.. we don't really know.

--

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/302#comment:24>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage