[volunteergrid2-l] Finally starting to use VG2

Wed Oct 19 08:15:24 PDT 2011

(Sorry for the flurry of e-mails this morning)

I'm happy to announce that I'm finally starting to actually use VG2 as my
primary backup system.  I think we have the necessary number of
highly-reliable storage servers now to make in eminently usable.  Over the
next month or two (however long it takes), I'll be storing about 200 GB in
the grid.  I'm getting a per-file upload rate of about 130 KB/s, which is
pretty decent.  Assuming that were constant, my data would take about 17
days to upload, but there are pauses between files so I expect it to take
considerably longer.

My Tahoe redundancy settings are 7/12/12.  Since there are currently 13
active nodes that means that all of you should be seeing about 17 KB/s of
traffic from me (130 KB/s * 12/7 = 223 KB/s total upstream traffic from me,
divided by 13 active nodes = ~17 KB/s).

If we can get up to 20 reliable nodes, I'll probably change my settings to
11/17/17, or maybe even 12/18/18.  My rationale for these settings, BTW, is
based on some calculations done using some utility functions embedded in
Tahoe.  To get them, I did:

$ *cd tahoe/src*
$ *python*
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> *import allmydata.util.statistics as s*
>>> *s.pr_file_loss([.95]*12, 7)*
1.1107789644043027e-05
>>> *s.pr_file_loss([.95]*17, 10)*
6.3136217078519583e-07
>>> *s.pr_file_loss([.95]*17, 11)*
9.7284215413383693e-06
>>> *s.pr_file_loss([.95]*18, 11)*
1.0862151393128548e-06
>>> *s.pr_file_loss([.95]*18, 12)*
1.5228007433536422e-05
>>> *s.pr_file_loss([.95]*18, 13)*
0.00017196620536118082

The pr_file_loss() function computes the probability that a single file will
be lost, based on the arguments, which are:

   - A list of server reliability probabilities (I'm assuming our servers
   are 95% reliable)
   - The number of shares required to reconstruct the file.

The list of server reliabilities actually represents the reliabilities of
the shares deployed, so you should use your shares.happy value.  In earlier
version of Tahoe-LAFS there was a further complication that multiple shares
could be delivered to one server and happiness still achieved.  It's
possible to construct a probability function that models that, but it's a
little more complicated.  With the newest versions of Tahoe (1.8 and newer,
I think), that shouldn't be an issue.

Note that this function also effectively assumes that either you have the
URI of the file or that shares.happy represents all or nearly all the nodes
in the grid.  If shares.happy is a small percentage of the nodes in the grid
then there's another complication because the expected reliability of each
file becomes (arguably) independent of the other files.  Since you typically
don't have the direct URI of a given file, that situation means you have to
consider the possibility that the directory nodes between your "root"
directory and the target file might be lost.  It also means that since you
really want *all* of your files to survive, you should really choose a
per-file reliability target that ensures that the probability of all files
surviving is acceptably high.  But if shares.happy is pretty close to the
total number of nodes in the grid then all of that complexity goes away,
because your files will basically all live or die together.

The downside of setting shares.happy to a number very close to the size of
the grid is that a few nodes being down means you can't upload files
successfully (i.e. poor write-availability).  But that's why we demand high
availability of our storage nodes :-)

One other note:  I was intentionally a little vague about the term
"reliability".  Tahoe devs usually use two terms, "reliability" and
"availability".  "Availability" represents the probability that your file is
available at any given time T in the future, which is dependent on the
availability of the shares needed to recover your file at time T.
 "Reliability" represents the probability that your file ever becomes
available at or after some time T in the future.  In other words,
reliability is kind of a limit function of availability.  In practical
terms, file availability means that the servers holding the necessary shares
are up when you look, while reliability means the shares actually still
exist on nodes that will be up sometime.

As a matter of conservatism, I use the expected availability figure as the
expected reliability figure, since barring catastrophe reliability is
strictly higher than availability.  My goal is 99.99% reliability, so I look
for settings that give me pr_file_loss < 1e-4.

For more details on availability/reliability and how the computations are
done see my lossmodel paper, which is in the Tahoe source tree (in
docs/proposed/lossmodel.lyx), or at http://goo.gl/UtDeH

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20111019/2aed6ee8/attachment.html>