[tahoe-dev] Estimating reliability
Shawn Willden
shawn-tahoe at willden.org
Mon Jan 12 21:09:09 PST 2009
On Monday 12 January 2009 08:17:33 pm Brian Warner wrote:
> Well, what I think we'll wind up with is a table (parameterized on P_i and
> A), that says "if there are 10 shares left, there's a 99.99x% chance of the
> file surviving for another A seconds",
I'm not sure I see how the table would work, if you're allowing the storage
servers to have different p_i. However, the computation is very fast for
small share sets (less than a few hundred), so you don't really need a lookup
table.
> If P_i falls, or if A increases, then we may have to increase R to achieve
> the desired survivability, if it's even possible.
If increasing R doesn't do the job, the repairer can increase N/k. This is
getting to the heart of my point about 't'. Given a concrete survivability
goal and some reasonably-good per-peer reliability estimates, there are a
number of parameters a smart client can adjust in order to make sure that the
goal is met.
Given a sufficiently-complete model, we may be able to automatically optimize
k, N, A and R to minimize a cost function (of storage and bandwidth) while
ensuring that the survivability goal is met. This makes our user-tunable
parameters directly expressive of what the user cares about, rather than
these arcane values which require understanding of the way the system works,
rather than what the user wants it to do.
> A and P_i would give you a matrix of decay probabilities Md(A,P_i): the
> chance that a file which starts the period with X shares will finish the
> period with Y shares. One side of the diagonal is full of zeros: we can
> only ever lose shares.
Yep, that's what the chain matrix will look like. One thing that occurs to me
is that with variable p_i, you can't really say that the probability of any
file that starts with 8 shares has a probabilty x of ending with 6, because
the probability depends on the p_i of the specific 8 peers holding the
shares. I can think of a couple of ways to address that. Hmm.
> We'll do repair whenever it is desired and possible (i.e. k<=i<R), so we
> construct a second "post-repair" matrix Mpr, in which e.g. Mpr(start=10,
> finish=10) = sum([Md(start=10,finish=i) for i in range(k,R)+[N]]).
I'm not following your notation here.
> It's possible that at some point, the distribution of numbers of
> shares sort of flattens out, and lim(Psurvive(t=x)/Psurvive(t=x-1), x->inf)
> is well-defined (and, I suspect, would approach that limit from above).
That matches my intuition, that in the limit it looks much like a geometric
progression, with r converging to a limit. Hopefully, we'll find out.
> So yeah, avoiding all that math and sticking with the worst-case scenario
> instead, we assume that we've got R shares at the start of each period, and
> care about the probability that we wind up with fewer than k at the end. In
> that model, we only care about one row of the transition matrix (the
> start=R row), and we really only care about one number: the chance that
> we'll land in the bad end of that row (finish<k). This is then the same
> from one period to the next, and the long-term probability is easier to
> calculate (and the actual value would be better than this model). I'm not
> sure that this simplified model makes it at all easy to figure out the
> amount of repair work, however.
Me neither. It's also not clear to me now how I thought I could do it with
the threshold-based scheme. Apparently I needed a larger margin to write in.
If R=N, it's easy.
> cool stuff, I'm tempted to spend a couple of days playing with this again
> (but really I should be working on code instead..)
Hehe. Yeah, I'm not getting anywhere on the backup client because I'm using
my available cycles playing with this stuff.
Shawn
More information about the tahoe-dev
mailing list