[tahoe-dev] [tahoe-lafs] #700: have servers publish Bloom filter of which shares they have
tahoe-lafs
trac at allmydata.org
Tue May 12 06:44:09 PDT 2009
#700: have servers publish Bloom filter of which shares they have
------------------------------+---------------------------------------------
Reporter: warner | Owner:
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: code-performance | Version: 1.4.1
Keywords: | Launchpad_bug:
------------------------------+---------------------------------------------
Comment(by swillden):
Very interesting data structure; one I hadn't encountered before.
According to the Wikipedia writeup, the numbers appear to be a little
better than what you mentioned, though. Unless I'm misreading, an optimal
Bloom filter with a 1% false positive rate should only consume 1.44
log_2(100) = 0.217 bits per key inserted, so a 1M-entry server would have
a Bloom filter about 212 KiB in size.
That's very cool, although it seems like it would only be a win for
clients who were retrieving a very large number of shares.
However, it might also be an efficient way for a repairer to check the
status of a large number of files, assuming it can trust the Bloom filters
to be accurate (i.e. servers are honest about what they have, modulo false
positives). Given the number of servers m, the FEC parameters n and k,
the false positive rate r and a number of positives j, it's possible to
calculate the probability p that at least a number t of shares exist.
if f(k;n,p) is the binomial distribution PMF, then the probability that at
least t shares exist is 1 - sum([ f(j-i;m-i,r) for i in range(0,t) ]).
Actually, for numerical accuracy this should be summed over range(t, 0,
-1), since the values are largest for i=t and can get very small as i->0.
If p is sufficiently high, then the repairer can assume that no repair is
needed. If the probability is too low (if j < t, the probability is
zero), then the repairer should query the servers whose Bloom filters
indicate the presence of a share to determine the actual number of
available shares. If that is too low, of course, then the repairer should
repair the file.
"Probabilistic repair" should probably choose a higher value for t than
deterministic repair, perhaps even t=n, and of course the threshold for p
should be pretty stringent. Still, for smallish networks (m not much
larger than n) with conservative FEC parameters (n significantly larger
than k), I think this might be a win, reducing network traffic and share
server load while allowing for more frequent repair checks.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/700#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list