[tahoe-dev] [tahoe-lafs] #700: have servers publish Bloom filter of which shares they have

Mon May 11 11:14:33 PDT 2009

#700: have servers publish Bloom filter of which shares they have
------------------------------+---------------------------------------------
 Reporter:  warner            |           Owner:           
     Type:  enhancement       |          Status:  new      
 Priority:  major             |       Milestone:  undecided
Component:  code-performance  |         Version:  1.4.1    
 Keywords:                    |   Launchpad_bug:           
------------------------------+---------------------------------------------
 It occurred to me that we could shave a round trip off the marginal
 immutable download time by having each server publish a
 [http://en.wikipedia.org/wiki/Bloom_filter Bloom filter] of which shares
 they are holding (keyed by storage-index). The server would update this
 filter in the share-crawler loop, so a large server (like the
 allmydata.com servers, with 1-3M objects each) would probably build a new
 one every few hours. Clients could lazily download a copy in the
 background. At file-download time, the clients could consult the Bloom
 filter instead of asking the servers directly, and assume that a hit in
 the filter means that the server actually has the share (and thus start
 the conversation by asking for the first chunk of data, instead of asking
 whether they have a share or not).

 An optimal Bloom filter with a 1% error rate requires about 9.6 bits per
 element, so a 1M-share server would require about 1.2MB for its filter.
 Less full servers would require considerably less space to hold the
 filter. Clients then need to decide how they want to optimize their
 bandwidth: downloading a few MB of filters ahead of time, or spending a
 roundtrip to find out which server has the shares of interest.

 If share density is high (i.e. most servers accepted the shares during the
 initial upload, and there has not been much server churn since then), it
 might make more sense to skip the filter and just ask the servers to start
 sending data (or report that they have none). The bloom filter would be
 most useful if it is expensive to talk to a server that does not end up
 having a share.

 In either case, when we redesign the immutable-share retrieval protocol,
 we'll probably want to pay attention to having a cheap way to start
 fetching share data as quickly as possible, maybe something like
 {{{readv(storageindex, spanvectors, shares_to_ignore) -> dict(shnums to
 spans), allshnums}}}

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/700>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid