[tahoe-dev] [tahoe-lafs] #700: have servers publish Bloom filter of which shares they have

Tue May 12 06:44:09 PDT 2009

#700: have servers publish Bloom filter of which shares they have
------------------------------+---------------------------------------------
 Reporter:  warner            |           Owner:           
     Type:  enhancement       |          Status:  new      
 Priority:  major             |       Milestone:  undecided
Component:  code-performance  |         Version:  1.4.1    
 Keywords:                    |   Launchpad_bug:           
------------------------------+---------------------------------------------

Comment(by swillden):

 Very interesting data structure; one I hadn't encountered before.
 According to the Wikipedia writeup, the numbers appear to be a little
 better than what you mentioned, though.  Unless I'm misreading, an optimal
 Bloom filter with a 1% false positive rate should only consume 1.44
 log_2(100) = 0.217 bits per key inserted, so a 1M-entry server would have
 a Bloom filter about 212 KiB in size.

 That's very cool, although it seems like it would only be a win for
 clients who were retrieving a very large number of shares.

 However, it might also be an efficient way for a repairer to check the
 status of a large number of files, assuming it can trust the Bloom filters
 to be accurate (i.e. servers are honest about what they have, modulo false
 positives).  Given the number of servers m, the FEC parameters n and k,
 the false positive rate r and a number of positives j, it's possible to
 calculate the probability p that at least a number t of shares exist.

 if f(k;n,p) is the binomial distribution PMF, then the probability that at
 least t shares exist is 1 - sum([ f(j-i;m-i,r) for i in range(0,t) ]).
 Actually, for numerical accuracy this should be summed over range(t, 0,
 -1), since the values are largest for i=t and can get very small as i->0.

 If p is sufficiently high, then the repairer can assume that no repair is
 needed.  If the probability is too low (if j < t, the probability is
 zero), then the repairer should query the servers whose Bloom filters
 indicate the presence of a share to determine the actual number of
 available shares.  If that is too low, of course, then the repairer should
 repair the file.

 "Probabilistic repair" should probably choose a higher value for t than
 deterministic repair, perhaps even t=n, and of course the threshold for p
 should be pretty stringent.  Still, for smallish networks (m not much
 larger than n) with conservative FEC parameters (n significantly larger
 than k), I think this might be a win, reducing network traffic and share
 server load while allowing for more frequent repair checks.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/700#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid