[volunteergrid2-l] Recommended settings

Thu Jun 30 09:25:41 PDT 2011

On Wed, Jun 29, 2011 at 11:36 PM, Brad Rupp <bradrupp at gmail.com> wrote:

> On 6/29/2011 5:12 PM, Shawn Willden wrote:
>
>> which would have prevented the worst of the allmydata problem.
>>
>
> What was the allmydata problem?  The reason I ask is that I don't want to
> have a "problem" with the data in VG2.

Allmydata.com was a commercial venture selling cloud storage using
Tahoe-LAFS, and was the company that funded all of the initial Tahoe
development.  There was (is, actually, though it needs maintenance) a fairly
nice Windows client that provided Dropbox-like functionality. On the
backend, allmydata had a large number of servers hosted in a couple of data
centers.  They had hundreds of nodes in their grid by the end.  They were
using the default encoding parameters, N=10, K=3.  I'm not sure if H existed
at the time, but it wasn't really relevant, because there were always more
than 10 nodes available.

As they scaled up and added more nodes, they began to suffer from more
hardware failures, as is inevitable.  If you have hundreds of machines, a
few are going to be broken at any given time.  In many cases they still hold
data, but if they're down it's unavailable.  So whenever at eight or more
machines were down, any files that happened to have shares on the
unavailable nodes became unavailable.  Given that allmydata was hosting
billions (trillions?) of files for thousands (tens of thousands?) of people
with shares spread across all those machines, any random set of 8 nodes
probably all held shares for at least one file.  I think this may have been
compounded by some repairer bugs (my memory is hazy, and I don't think
allmydata ever provided a complete post-mortem anyway).

What made this really problematic was that some of those unavailable files
were dirnodes.  There are a lot of ways to handle your Tahoe storage, but
the most common is to have a single directory tree per user.  If the dirnode
storing a user's root directory is unavailable, then _all_ of that user's
files are unavailable.  By design, without the cap for a file, it's not even
possible to find the file data, and you couldn't decrypt it if you did.
 Without the dirnode which stores the caps, the files in the directory are
essentially _gone_, even if all of the bits are present.

Most directory trees also end up being fairly deep, and losing any dirnode
in the chain means that every directory and every file below that dirnode is
gone.

The net result was that allmydata's users increasingly couldn't retrieve
their data.  I suspect that the number of actual cases was quite small, but
if you put yourself in such a user's shoes you can imagine how angry you'd
be.  Not only is your precious data gone (even if allmydata says they'll
have it back Real Soon Now), but this is a commercial service that
advertised extreme reliability, for which you've paid good money.  Not a lot
of money, but enough that you really think you deserve what you paid for.
 What do you do?  Complain on every forum you can reach, of course.

I think allmydata's business was running very close to the edge financially
anyway -- that's pure speculation, but strongly supported by the way they
were letting their technical staff go even as the staff was fighting
increasing technical problems.  I'm sure that funding issues played a big
role in their inability to keep machines up and running, and I suspect that
the problems may have been exacerbated by the fact that it was likely the
older machines that were failing first, the machines that were in the grid
when it was smaller and therefore received more shares of the first files
users created.  The first file a user creates is their root dirnode.

All of this created a death spiral.  Worsening financial exacerbated
technical difficulties.  Technical difficulties worsened client relations.
 Failing client relations further lowered revenues.  Eventually the company
failed.

The part of the story that's relevant to us is the nature of the technical
difficulties.  If N is much smaller than T, then it becomes not just
possible but _likely_ that a small fraction of the nodes being down makes
some files unavailable.

>From a statistical perspective, if you have T >> N, then the standard model
accurately calculates the reliability of any given file, but if you have
per-file reliability of r and n files, the probability that _all_ of your
files are available (which I call "total reliability") approaches r**n.
 Even if r is something like 99.999%, as n gets big r**n can get
unacceptably small.  T has to be much larger than n before the total
reliability really approaches r**n.  The closer N is to T the closer total
reliability stays to r.

If you think a little about that analysis, you'll see there are two
solutions.  One solution to this problem is to make sure that however many
nodes there are, the shares for each file are on nearly all of them.
 Ideally, if all your files have shares on every node on the grid then all
of your files live or die together -- and with probability r, directly
derived from server reliabilities, K and N.  And, note that with this
solution dirnodes don't need higher reliability because they're exactly as
likely to go away as individual files.

Another solution is to increase r to apparently-insane levels.  Like 1 -
1E-15, or even higher.  This ensures that r**n stays acceptably small as n
grows.  How do you increase r?  By increasing N.  Simply increasing N
without adjusting K increases N/K -- the expansion factor.  But it turns out
that if you increase N and K together, you can get higher reliability with
lower expansion factor.  The erasure coding becomes more effective at
maximizing reliability while minimizing expansion as you increase N.  At
some point network overheads become problematic, but if you ignore that,
bigger N is always better.

If you allow T >> N and try to ramp up r, it also makes sense to increase r
for dirnodes even more, because of the "link in a chain" effect that if you
break a dirnode everything reachable from it is lost.

So, from a mathematical perspective, allmydata's problem was that they
didn't model that exponentially-decreasing total reliability as the number
of files grew, and they didn't consider the impact on reliability of the
linking effect of dirnodes.

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20110630/9a3fb96c/attachment.html>