Opened at 2009-05-11T18:14:32Z
Last modified at 2009-12-24T20:48:48Z
#700 new enhancement
have servers publish Bloom filter of which shares they have
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-storage | Version: | 1.4.1 |
Keywords: | performance repair | Cc: | |
Launchpad Bug: |
Description
It occurred to me that we could shave a round trip off the marginal immutable download time by having each server publish a Bloom filter of which shares they are holding (keyed by storage-index). The server would update this filter in the share-crawler loop, so a large server (like the allmydata.com servers, with 1-3M objects each) would probably build a new one every few hours. Clients could lazily download a copy in the background. At file-download time, the clients could consult the Bloom filter instead of asking the servers directly, and assume that a hit in the filter means that the server actually has the share (and thus start the conversation by asking for the first chunk of data, instead of asking whether they have a share or not).
An optimal Bloom filter with a 1% error rate requires about 9.6 bits per element, so a 1M-share server would require about 1.2MB for its filter. Less full servers would require considerably less space to hold the filter. Clients then need to decide how they want to optimize their bandwidth: downloading a few MB of filters ahead of time, or spending a roundtrip to find out which server has the shares of interest.
If share density is high (i.e. most servers accepted the shares during the initial upload, and there has not been much server churn since then), it might make more sense to skip the filter and just ask the servers to start sending data (or report that they have none). The bloom filter would be most useful if it is expensive to talk to a server that does not end up having a share.
In either case, when we redesign the immutable-share retrieval protocol, we'll probably want to pay attention to having a cheap way to start fetching share data as quickly as possible, maybe something like readv(storageindex, spanvectors, shares_to_ignore) -> dict(shnums to spans), allshnums
Change History (6)
comment:1 Changed at 2009-05-12T13:44:09Z by swillden
comment:2 Changed at 2009-05-19T00:46:00Z by warner
Oh! Good one, yes, a repair-agent can certainly take advantage of this.
The 9.6 bits per key number came from wikipedia too.. I'll have to reread the page to see which number seems more accurate.
comment:3 Changed at 2009-12-04T04:57:53Z by davidsarah
- Component changed from code-performance to code-storage
- Keywords performance added
comment:4 Changed at 2009-12-13T02:28:18Z by davidsarah
- Keywords repair added
comment:5 Changed at 2009-12-13T05:02:25Z by zooko
I always use Norm Hardy's pages on Bloom Filters, which have a very clear explanation of how to calculate the size:
http://www.cap-lore.com/code/BloomTheory.html
(Also I very much like Norm Hardy as a scientist and a person.)
I like Shawn's observation, which I interpret as: A checker (unlike a downloader) is typically interested in lots and lots of shares at once. If the checker has a pregenerated manifest of files that he wants to check on (as opposed to if he is actively traversing a directory structure and checking on files as he finds them), then he might have a list of lots of shares in hand, and want to use this Bloom Filter trick to download one Bloom Filter from each storage server and then probabilistically find out which of them are likely to be all right without doing any further transactions -- sweet! Oh, and if he didn't start with a pregenerated manifest, it is still a win to traverse the directory structure, and accumulate the verify caps of all the files in the first pass and then do the Bloom Filter check in the second pass. Interesting!
By the way, you can always add new elements into a Bloom Filter cheaply, so storage servers which have a Bloom Filter already built when they receive a new share can instantly add that new share into the filter. You can't remove items from a Bloom Filter. I guess the only time storage servers remove shares from disk is when they do a garbage collection pass over their entire store, so I guess the thing to do is when you start a garbage collection pass you also throw away your Bloom Filter and start building a new one, adding in every share which you don't collect, and then the new Bloom Filter will be ready at the end of your garbage collection pass.
comment:6 Changed at 2009-12-24T20:48:48Z by warner
One approach we've discussed a lot is for an occasionally-connected client (who holds a rootcap) to generate a list of currently-active storage-index values (known in the code as a "manifest"), and deliver it to an always-connected maintenance node, like a checker/verifier/repairer/lease-renewer agent. The agent would take on responsibility for the files indicated by the manifest: updating their leases, repairing as necessary. So the "pregenerated manifest of files" is actually a pretty reasonable thing to use.
There are actually two directions in which a bloom filter might get used. We've discussed the first, in which the server generates the filter and publishes it for use by clients: this is most useful to accelerate the DYHB "Do You Have Block" query, which is used at the start of a download, and is the only query sent by the Checker. For download, we can tolerate the false-positive rate of the Bloom filter because we're going to ask more questions later (like fetching the actual share data), so false positives merely cause a minor performance hit. For the Checker, we have to be more conscious of the percentages, because false positives impact file health.
(incidentally, we should take a step back and think about what sorts of failures we're anticipating here.. our current servers don't just delete shares on a whim, and we believe that disk errors tend to take out the whole disk instead of taking out individual shares, so it seems unlikely that these DYHB queries will ever return different answers from one day to the next, and the Checker is far more likely to experience a whole server going offline than an individual share disappearing. I've been looking for an excuse to use a Bloom filter for years now, but I shouldn't let that desire push me into wasting time on building something that won't actually be of much use).
The second direction for using a Bloom filter would be to take the client's manifest and send it to the storage server, saying "please do something with any shares that match this list". This wouldn't be useful for a checker, but it could be used by a slow lease-updater process (one in which a share-crawler had a list of outstanding per-account bloom filters, with instructions to add/renew a lease on anything that matched). OTOH, it would probably be easier to have a share-to-account (one-to-many) mapping table on each server, and have the client renew a "lease" on the account in general. (this is the scheme that we've discussed before, in which each client sends one message per storage server per renewal period, instead of one per (SS*share*period), which would be a awful lot of messages).
Very interesting data structure; one I hadn't encountered before. According to the Wikipedia writeup, the numbers appear to be a little better than what you mentioned, though. Unless I'm misreading, an optimal Bloom filter with a 1% false positive rate should only consume 1.44 log_2(100) = 0.217 bits per key inserted, so a 1M-entry server would have a Bloom filter about 212 KiB in size.
That's very cool, although it seems like it would only be a win for clients who were retrieving a very large number of shares.
However, it might also be an efficient way for a repairer to check the status of a large number of files, assuming it can trust the Bloom filters to be accurate (i.e. servers are honest about what they have, modulo false positives). Given the number of servers m, the FEC parameters n and k, the false positive rate r and a number of positives j, it's possible to calculate the probability p that at least a number t of shares exist.
if f(k;n,p) is the binomial distribution PMF, then the probability that at least t shares exist is 1 - sum([ f(j-i;m-i,r) for i in range(0,t) ]). Actually, for numerical accuracy this should be summed over range(t, 0, -1), since the values are largest for i=t and can get very small as i->0.
If p is sufficiently high, then the repairer can assume that no repair is needed. If the probability is too low (if j < t, the probability is zero), then the repairer should query the servers whose Bloom filters indicate the presence of a share to determine the actual number of available shares. If that is too low, of course, then the repairer should repair the file.
"Probabilistic repair" should probably choose a higher value for t than deterministic repair, perhaps even t=n, and of course the threshold for p should be pretty stringent. Still, for smallish networks (m not much larger than n) with conservative FEC parameters (n significantly larger than k), I think this might be a win, reducing network traffic and share server load while allowing for more frequent repair checks.