[tahoe-dev] [tahoe-lafs] #302: stop permuting peerlist, use SI as offset into ring instead?
tahoe-lafs
trac at allmydata.org
Fri Dec 25 21:57:31 PST 2009
#302: stop permuting peerlist, use SI as offset into ring instead?
------------------------------------+---------------------------------------
Reporter: warner | Owner:
Type: task | Status: new
Priority: major | Milestone: undecided
Component: code-peerselection | Version: 0.7.0
Keywords: repair newcaps newurls | Launchpad_bug:
------------------------------------+---------------------------------------
Comment(by warner):
ok, so the "ringsim.py" simulator that I just attached to this ticket
demonstrates one of the concerns I described above: non-permuted
peer-selection will result in higher bytes-per-second upload rates to some
servers than to others. (I haven't yet built a simulator to investigate
the
effect of full servers shedding load onto their clockwise neighbors).
Run the code like {{{python ./ringsim.py --seed=abc --permute=1}}} . It
will
create a ring of 100 servers using "abc" as a seed to decide their
nodeids.
(any specific seed will result in a consisent distribution of nodeids).
Then it will upload files (each with a size in {{{randrange(2GiB)}}}, mean
size 1GiB) one at a time. Every few thousand uploads it will analyze the
space used per-server and emit a report line like:
{{{
uploaded 16000
min/max/(exp) usage-pf-ps 33.86 MB/38.66 MB/(35.72 MB): spread-pf: 4.80 MB
(13.45%) stddev: 1.00 MB (2.81%)
}}}
The first block of numbers is "usage-per-file-per-server", meaning how
much
storage space was used on each server, divided by the total number of
files
that had been uploaded so far. If we pretend that we're uploading files at
a
rate of one per second, this is actually measuring bytes-per-second. The
"min" value of 33.86 MB means that the least-used server had received
33.86MB
per file (i.e. per second). The most-used (fullest) server had received
38.66MB per second. Our average filesize of 1GiB and 3-of-10 encoding
parameters means that we'd expect to place 35.72MB per-server per-file.
The "spread-pf" is the difference between the least-used and most-used
servers: 4.80MB = 38.66-33.86. 13.45% is that spread expressed as a
percentage of the expected usage value.
The "stddev" is the standard deviation of all 100 servers' usage values.
If
usage were perfectly uniform, this would be zero. 2.81% is the standard
deviation expressed as a percentage of the expected usage value.
The simulator will run nearly forever. Run it with {{{--permute=1}}} and
notice how the min/max values converge on the expected value over time,
and
how the spread and stddev drop towards zero. In my test run, after 200000
files, the spread was down to 1.61MB (4.5%) and the stddev down to 265kB
(0.74%). This is the law of large numbers in action.
Now, re-run the simulator with {{{--permute=0}}} and {{{--seed=abc}}}. It
runs much faster (because linear ring-offset selection is a lot easier
than
hash-based permutation). Look at the usage report for 16000 files:
{{{
uploaded 16000
min/max/(exp) usage-pf-ps 7.10 MB/55.78 MB/(35.81 MB): spread-pf: 48.69 MB
(135.96%) stddev: 12.12 MB (33.84%)
}}}
The spread is enormous, as is the standard deviation. The least-used
server
is using roughly an eighth as much space as the most-full server, whereas
in
the permuted case they were using within 15% of each other.
And if you let it run for a while and look at the 200000 file report, it
doesn't get better over time:
{{{
uploaded 200000
min/max/(exp) usage-pf-ps 6.90 MB/56.05 MB/(35.69 MB): spread-pf: 49.15 MB
(137.73%) stddev: 12.17 MB (34.10%)
}}}
Even after 200k files, the least-to-most-used ratio is 8x. And the stddev
is
basically constant.
A bit of extra code reveals why. The least-used server has a nodeid that
starts with dff96a, and the neighboring portion of the sorted list of
serverids (with the separation between each node and its CCW neighbor)
shows:
{{{
<Server dee63f41a151782ed22d012814dbce4e> 0112bfce339ff4781d6b7a6f9705aae2
<Server df510eb7034c80b737719d76ca1d19d6> 006acf7561fb088865449c4eb5414b88
<Server df6848a848d3a59966e5835a1883fa77> 001739f1458724e22f73e5e34e66e0a1
<Server dfaff86b8c4d56ca38e673186fb11f56> 0047afc34379b130d200efbe572d24df
<Server dfba0da90d512319e6447e6464a57bb3> 000a153d8103cc4fad5e0b4bf4f45c5d
<Server dff96a791f1d835d6be65724b91e1225> 003f5cd011cc604385a1d8c054789672
<--
<Server e10e706a39fd09fe19db22fc498ed1ed> 011505f11adf86a0adf4cbd79070bfc8
<Server e1bfb715d3cec4799ad3bb2e50f523fe> 00b146ab99d1ba7b80f8983207665211
<Server e2f95cb1e2313cf26e738b3e8a0b8934> 0139a59c0e627878d39fd01039166536
}}}
100 uniformly-distributed servers would have a separation of 028F5C28F6...
but the randomly-chosen nodeids in our {{{--seed=abc}}} ring are not
uniformly distributed. In this case, lucky node dff96a happened to land
unusually close after node dfba0, with a separation of just 003f5d...,
about
one tenth the ideal (uniform) separation. In fact it sits at the end of an
unusally dense cluster of nodeids.
(what we actually care about is the separation between node dff96a and
it's
10'th CCW neighbor, since we're encoding each file into 10 shares. The
separation between dfba0 and dff96a is a big contributor to this, but not
the
whole thing).
And similarly, the most-used server was 4f5ab8, and that portion of the
ring
looks like:
{{{
<Server 3be5ef429769acbb7b9bb73443ea9fee> 0183079e758e2d8ac094ba3269f908fb
<Server 3f0b2004a830fc609b61621ee3b77b1f> 032530c210c74fa51fc5aaea9fccdb31
<Server 4681adee12a4e943c07fed69b644c640> 07768de96a73ece3251e8b4ad28d4b21
<Server 4f5ab87f270f850a05c443a14fa2042e> 08d90a91146a9bc645445637995d3dee
<--
<Server 50719e5b06af03bbea0f362bed7e4dd3> 0116e5dbdf9f7eb1e44af28a9ddc49a5
<Server 52b741b1eb3e4d31ef44b78845b13a5f> 0245a356e48f49760535815c5832ec8c
<Server 54b497bd7905c60256a6a8735b6c2581> 01fd560b8dc778d06761f0eb15baeb22
}}}
The 4f5ab8 node is sitting just clockwise of an unusually large gap, from
4681ad, with a separation of 08d90b, about 3.75 times the ideal (uniform)
separation.
This is the "lumpy distribution" problem that I was worried about. The
effect
is reduced when shares are spread over more servers. If I re-run the
simulation with {{{--N=40}}} (3-of-40 encoding), I see a spread of about
50%
the expected value, and a stddev of about 15%. There is a corresponding
increase in the effect when shares are spread over fewer servers:
{{{--N=5}}}
gives me a spread of 195% and a sddev of 46%.
The effect is easiest to understand when k=N=1. In that case, the inlet
rate
for any given server is strictly equal to the total upload rate times the
fraction of the ring that lies between that server and its nearest CCW
neighbor. For our "abc" seed, the smallest separation is between node
b80ea
and b8159, with a gap of 0006ea (which is 1/95 of the uniform gap), and
the
largest is between 6ef56 and 7b41d, with a gap of 0c4c6 (about 4.8 times
the
uniform gap). So we'd expect to see lucky node b8159 to get about 0.0095
of
the total traffic, and unlucky 7b41d to get about .048 of the traffic, and
a
ratio between the two of about 455x.
And indeed, although b8159 and 5a82c are in competition for least-used
server, after about 300000 files, we get this report:
{{{
uploaded 292000
min/max/(exp) usage-pf-ps 153.15 kB/52.44 MB/(10.68 MB): spread-pf: 52.29
MB (489.40%) stddev: 10.69 MB (100.02%)
least: b8159546f332951c52367c4ad92fd9f7
most: 7b41d2d31e6180d42eec56221d02cf4d
}}}
And 10.68MB/153kB is 70x, and 52.44/10.68 is 4.9x, matching our
expectations
of 95x and 4.8x pretty closely.
If you re-run the program with e.g. {{{--seed=def --permute=0}}}, you'll
get
a different distribution of nodeids, which happens to get a 4x ratio
between
most-full and least-full, and a stddev of about 30%. Better, but still
pretty
bad. {{{--seed=def --permute=1}}} behaves just as well as the "abc" seed:
stddev is again down to 0.74% after 200k files.
If you're lucky and find a seed that gives you a uniform distribution,
then
{{{--permute=0}}} should give you the same statistics as
{{{--permute=1}}}.
But for most seeds (i.e. most grids), you'll get a very lumpy
distribution.
{{{--permute=1}}} tolerates arbitrarily lumpy server distributions.
So, based upon this simulation, I'm fairly convinced that permuted-list is
necessary to avoid long-term uneven upload rates to different servers. A
simple linear ring-offset algorithm will subject servers to vastly
different
loads unless the nodeids can be controlled to maintain a uniform
distribution
(which means changing them every time a server is added or removed).
Now, do we actually need uniform upload rates? What we really want, to
attain
maximum reliability, is to never double-up shares. That means we want all
servers to become full at the same time, so instead of equal bytes-per-
second
for all servers, we actually want equal percentage-of-space-per-second for
all servers. That's much trickier. If we could completely (and
continuously)
control nodeids (by decoupling peer-selection index values from
cryptographic-backed server pubkeys), we could adjust them to achieve
inter-server gaps that compensate for how much space they have remaining:
small servers would be clustered closer together, large servers would be
placed CW from large gaps. The math necessary to do this strikes me as
pretty
complicated, and I think that changing nodeids over time would damage
efficient retrievability, since shares will no longer be in the ideal
places
when the downloader tries to perform the same peer-selection routine as
the
uploader did.
We could also have servers refuse some percentage of incoming shares even
if
they had space for them, to get their percentage-full-per-second rates
down
to match the grid-wide average. This would induce the same problems that
the
ring-offset and lumpy-distribution scheme has: servers which happen to sit
CW
of a self-throttling node will get more traffic than usual.
OTOH, it's a little bit easier than that: we don't need to engage in this
load-shaping work until we start to run out of servers. If we have at
least
"N" servers with space available, then reliability is unaffected by the
rate
at which they're filling up. So we could have servers accept shares at
full
speed until it looked like the grid was starting to fill up, then have
them
switch into a mode where they defer requests to other servers more and
more
(to obtain uniform fill rates) as the remaining space dwindles. The
shaping
effect would be negligible in a grid with lots of free space. A managed
grid,
for which new servers are added before the grid gets full, would never
need
to engage in load shaping. But any amount of load shaping that *was* being
performed would put off the day at which the first server gets full.
So, in summary, I am re-convinced that linear ring-offset has real
problems,
and that permuted-list provides a more uniform bytes-per-second inlet
rate,
which is easier to deal with and gives better system-wide properties.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/302#comment:12>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list