[volunteergrid2-l] Why high availability is crucial

Sat Jan 15 23:10:23 UTC 2011

It may seem that the reason maintaining high node uptime is important is so
that files can be retrieved reliably, i.e. read-availability.  In fact, the
bigger hurdle is maintaining write-availability.  This is fairly obvious,
since to read you only need K servers and to write you need H servers and
usually H is significantly larger than K.

I think it's even more important than it appears, however, because I think
there's value in setting H very close to S (the number of servers in the
grid).  If S=20 and H=18, then clearly it's crucial that availability of
individual servers be very high, otherwise the possibility of more than two
servers being down at once is high, and the grid is then unavailable for
writes.

So, why would you want to set H very high, rather than just sticking with
the 3/7/10 parameters provided by default?

There are two reasons you might want to increase H.  The first is to
increase read-reliability and the second is so that you can increase K and
reduce expansion while maintaining a certain level of read-reliability.  For
purposes of determining the likelihood that a file will be available at some
point in the future, I ignore N.  Setting H and N to different values is
basically saying "I'll accept one level of reliability, but if I happen to
get lucky I'll get a higher one".  That's fine, but when determining what
parameters to choose, it's H and K that make the difference.  In fact if S
happens to decline so that at the moment of your upload S=H, then any value
of N > H is a waste.

If you want to find out what kinds of reliability you can expect from
different parameters, there's a tool in the Tahoe source tree.
 Unfortunately, I haven't done the work to make it available from the web
UI, but if you want you can use it like this:

1.  Go to the tahoe/src directory.
2.  Run python without any command-line arguments to start the python
interpreter.
3.  Type "import allmydata.util.statistics as s" to import the statistics
module and give it a handy label (s)
4.  Type "s.pr_file_loss([p]*H, K)", where "p" is the server reliability,
and H and K are the values you want to evaluate.

What value to use for p?  Well, ideally it's the probability that the data
on the server will _not_ become lost before your next repair cycle.  To be
conservative, I just use the server _availability_ target, which I'm
proposing is 0.95.

The value you get is an estimate of the likelihood that your file will be
lost before the next repair cycle.  If you want to understand how it's
calculated and maybe argue with me about its validity, read my lossmodel
paper (in the docs dir).  I think it's a very useful figure.

However, unless you're only storing one file, it's only part of the story.
 Suppose you're going to store 10,000 files.  On a sufficiently-large grid
(which volunteergrid2 will not be), you can model the survival or failure of
each file independently, which means the probability that all of your files
survive is "(1-s.pr_file_loss([p]*H, K))**10000".  Since volunteergrid2 will
not be big enough for the independent-survival model to be accurate, the
real estimate would fall somewhere between that figure and
"1-s.pr_file_loss([p]*H, K)", which is the single-file survival probability.
 To be conservative, I choose to pay attention to lower probability, which
is the 10,000-file number.

Anyway, if you use that tool and spend some time playing with different
values of H and K, what you find is that if you increase H you can increase
K and reduce your expansion factor while maintaining your survival
probability.  If you think about it, this makes intuitive sense, because
although you're decreasing the amount of redundancy, you're actually
increasing the number of servers that must fail in order for your date to
get lost.  With 3/7, if five servers fail, your data is gone.  With 7/15,
nine servers must fail.  With 35/50, 16 must fail.  Of course that's five
out of seven, nine out of 15 and 16 out of 50, but still, with relatively
high availability numbers, the odds of those failure rates are very close to
the same.

>From a read-performance perspective there's also some value in increasing K,
because it will allow more parallelism of downloads -- at least in theory.
 With the present Tahoe codebase that doesn't help as much as it should, but
it will be fixed eventually.  (At present, you do download in parallel from
K servers, but all K downloads are limited to the speed of the slowest, so
your effective bandwidth is K*min(server_speeds).  If that were fixed, it
would just be the sum of the bandwidth available to the K servers).

So, if we can take as a given that larger values of K and H are a good thing
(and I'm happy to go into more detail about why that is if anyone likes;
I've glossed over a lot here), then the best way to choose your parameters
is to, ideally, set H=S and then choose the largest K that gives you the
level of reliability you're looking for.

But if you set H=S, then even a single server being unavailable means that
the grid is unavailable for writes.  So you want to set H a little smaller
than S.  How much smaller?  That depends on what level of server
availability you have, and what level of write-availability you require.

I'd like to have 99% write-availability.  If we have a 95% individual server
availability and a grid of 20 servers, the probability that at least a given
number of servers is available at any given moment is:

20 servers: 35.8%
19 servers: 73.6%
18 servers: 92.5%
17 servers: 98.4%
16 servers: 99.7%
15 servers: 99.9%

Again, if anyone would like to understand the way I calculated those, just
ask.

At 99.9% availability, if I can't write to the grid it's more likely because
my network connection is down than because there aren't enough servers to
satisfy H=15.

So, that's why I'd really like everyone to commit to trying to maintain 95+%
availability on individual servers.  In practice if you have a situation
which takes your box down for a few days, it's not a huge deal, because more
than likely most of the nodes will have >95% availability, but what we don't
want is a situation (like we have over on volunteergrid1) where a server is
unavailable for weeks.

If you can't commit to keeping your node available nearly all the time, I
would rather that you're not in the grid.  Sorry if that seems harsh, but I
really want this to be a production grid that we can actually use with very
high confidence that will always work, for both read and write.

Also, sorry for the length of this e-mail :-)

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/volunteergrid2-l/attachments/20110115/6520cf59/attachment.html>