[tahoe-lafs-trac-stream] [tahoe-lafs] #302: stop permuting peerlist, use SI as offset into ring instead?
tahoe-lafs
trac at tahoe-lafs.org
Wed Mar 27 14:37:10 UTC 2013
#302: stop permuting peerlist, use SI as offset into ring instead?
-------------------------+-------------------------------------------------
Reporter: warner | Owner: zooko
Type: task | Status: assigned
Priority: major | Milestone: undecided
Component: code- | Version: 0.7.0
peerselection | Keywords: repair newcaps newurls performance
Resolution: | preservation upload
Launchpad Bug: |
-------------------------+-------------------------------------------------
Description changed by zooko:
Old description:
> We were chatting today about a subject that comes up about every three
> months: why do we use a consistently-permuted peerlist when choosing
> where
> shares should be placed?
>
> The StorageIndex is the output of a hash function and therefore randomly
> distributed, so the set of all storage indices in the network should be
> (in
> the long run) evenly distributed across the SHA-256 2**256 numberspace.
>
> Our current [PeerSelection/TahoeTwo "Tahoe2"] algorithm uses the file's
> Storage Index to permute a list of all known storage servers (by hashing
> the
> (SI+peerid) string and sorting the result). This adds an additional level
> of
> random-distribution: each file gets a different ordering of storage
> servers.
>
> The alternate approach that Zooko suggested today was to skip the
> permutation
> and instead define the algorithm to be:
>
> * put the storage servers in a circle, placed by their nodeid
> * use the storage index as a starting location
> * travel clockwise, asking each server in turn to accept a lease
> * continue travelling until all shares are placed, or we've run out of
> servers
>
> The reason we went with the permuted list was because we were concerned
> about
> the effects of non-uniform storage server capacities. There are two
> situations to look at: light loading (i.e. nobody is full and all lease
> requests are accepted), and heavy loading (some or most storage servers
> are
> full, and are rejecting lease requests, so the uploader will visit other
> servers to place their shares).
>
> For light loading, the non-permuted approach causes any individual
> storage
> server to experience a load roughly equal to the percentage of the ring
> that
> is covered by its N counter-clockwise neighbors. On average, this will be
> the
> same for all peers, but some peerids might be clustered more than others,
> resulting in more traffic to those peers than the rest.
>
> For heavy loading, once a server is full, all of the traffic that would
> have
> landed on it will be redirected clockwise around the ring (imagine rain
> flowing across the ground until it finds a hole large enough to form a
> puddle). This will result in a concentration of load on the server just
> past
> the full region, and may cause that server to fill quickly. The likely
> result
> is a large section of the ring which is full, while there may be more
> space
> available on the other side of the ring.
>
> In contrast, the permuted approach removes the correlation between server
> locations: each file sees all servers in different locations. So instead
> of
> having "hot spots" (either caused by randomly-clustered peerids, or
> randomly-clustered full servers), the load will be distributed more
> evenly.
> We would expect the heavily-loaded grid to see all servers get full at
> roughly the same time.
>
> There are two arguments in favor of switching to the non-permuted
> approach.
> The first is simplicity: it is slightly easier to explain the non-
> permuted
> algorithm, and it is easier to predict where any given file's shares will
> wind up. The second is a potential benefit for repair. The issue is as
> follows: a storage server has just died (the hard drive experienced a
> fatal
> error), and all of those shares are gone. Repair can be invoked to
> replace
> the missing shares and bring the file back up to full strength, but which
> files need to be repaired? The most convenient list of storage indexes
> that
> need repairing was on the server that just died. Is there some other way
> to
> construct this list of repair work?
>
> (The assumption is that failure-driven repair is more sustainable than
> constant repair of all known files. This depends heavily upon the numbers
> we
> use: how many servers, how many files, how many clients, how many repair
> processes, what bandwidth is consumed by repair, etc).
>
> The benefit that non-permuted share distribution would offer is in the
> resulting correlation between shares held by server B and those held by
> its
> neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have
> never
> rejected a lease request and are equally old, then every storage index on
> server B will also be on either A or C (assuming k>=2). Thus, if B dies,
> we
> can use A and C to construct the list of repair work that needs to be
> done.
>
> However, Rob astutely pointed out that there are plenty of other ways to
> accomplish this same job. For example, each server could be assigned a
> "buddy", using a simple foolscap protocol, and each time server B adds a
> new
> share, it tells its buddy "D" about the storage index. If B dies, we ask
> the
> buddy for the share list and dump it into the repair queue. We could use
> a
> central server for this purpose, or distribute it out: what really
> matters is
> that we have a way to find out who the buddy is.
>
> We need to think more about this. I like the load distribution properties
> of
> permuting, and I'm not particularly concerned about the descriptive
> complexity, but I too am concerned about the repair design, and would
> like to
> leave ourselves some tricks to pull out in case we run into future
> problems.
> The B-probably-has-A+C benefit of non-permutation breaks down if any of
> the
> servers are full or new, so I'm less convinced that it will be a
> significant
> help. But, so much of this is guesswork.. we don't really know.
New description:
We were chatting today about a subject that comes up about every three
months: why do we use a consistently-permuted peerlist when choosing where
shares should be placed?
The StorageIndex is the output of a hash function and therefore randomly
distributed, so the set of all storage indices in the network should be
(in
the long run) evenly distributed across the SHA-256 2**256 numberspace.
Our current [source:git/docs/architecture.rst#server-selection "Tahoe2"]
algorithm uses the file's
Storage Index to permute a list of all known storage servers (by hashing
the
(SI+peerid) string and sorting the result). This adds an additional level
of
random-distribution: each file gets a different ordering of storage
servers.
The alternate approach that Zooko suggested today was to skip the
permutation
and instead define the algorithm to be:
* put the storage servers in a circle, placed by their nodeid
* use the storage index as a starting location
* travel clockwise, asking each server in turn to accept a lease
* continue travelling until all shares are placed, or we've run out of
servers
The reason we went with the permuted list was because we were concerned
about
the effects of non-uniform storage server capacities. There are two
situations to look at: light loading (i.e. nobody is full and all lease
requests are accepted), and heavy loading (some or most storage servers
are
full, and are rejecting lease requests, so the uploader will visit other
servers to place their shares).
For light loading, the non-permuted approach causes any individual storage
server to experience a load roughly equal to the percentage of the ring
that
is covered by its N counter-clockwise neighbors. On average, this will be
the
same for all peers, but some peerids might be clustered more than others,
resulting in more traffic to those peers than the rest.
For heavy loading, once a server is full, all of the traffic that would
have
landed on it will be redirected clockwise around the ring (imagine rain
flowing across the ground until it finds a hole large enough to form a
puddle). This will result in a concentration of load on the server just
past
the full region, and may cause that server to fill quickly. The likely
result
is a large section of the ring which is full, while there may be more
space
available on the other side of the ring.
In contrast, the permuted approach removes the correlation between server
locations: each file sees all servers in different locations. So instead
of
having "hot spots" (either caused by randomly-clustered peerids, or
randomly-clustered full servers), the load will be distributed more
evenly.
We would expect the heavily-loaded grid to see all servers get full at
roughly the same time.
There are two arguments in favor of switching to the non-permuted
approach.
The first is simplicity: it is slightly easier to explain the non-permuted
algorithm, and it is easier to predict where any given file's shares will
wind up. The second is a potential benefit for repair. The issue is as
follows: a storage server has just died (the hard drive experienced a
fatal
error), and all of those shares are gone. Repair can be invoked to replace
the missing shares and bring the file back up to full strength, but which
files need to be repaired? The most convenient list of storage indexes
that
need repairing was on the server that just died. Is there some other way
to
construct this list of repair work?
(The assumption is that failure-driven repair is more sustainable than
constant repair of all known files. This depends heavily upon the numbers
we
use: how many servers, how many files, how many clients, how many repair
processes, what bandwidth is consumed by repair, etc).
The benefit that non-permuted share distribution would offer is in the
resulting correlation between shares held by server B and those held by
its
neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have
never
rejected a lease request and are equally old, then every storage index on
server B will also be on either A or C (assuming k>=2). Thus, if B dies,
we
can use A and C to construct the list of repair work that needs to be
done.
However, Rob astutely pointed out that there are plenty of other ways to
accomplish this same job. For example, each server could be assigned a
"buddy", using a simple foolscap protocol, and each time server B adds a
new
share, it tells its buddy "D" about the storage index. If B dies, we ask
the
buddy for the share list and dump it into the repair queue. We could use a
central server for this purpose, or distribute it out: what really matters
is
that we have a way to find out who the buddy is.
We need to think more about this. I like the load distribution properties
of
permuting, and I'm not particularly concerned about the descriptive
complexity, but I too am concerned about the repair design, and would like
to
leave ourselves some tricks to pull out in case we run into future
problems.
The B-probably-has-A+C benefit of non-permutation breaks down if any of
the
servers are full or new, so I'm less convinced that it will be a
significant
help. But, so much of this is guesswork.. we don't really know.
--
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/302#comment:24>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list