[tahoe-dev] [tahoe-lafs] #700: have servers publish Bloom filter of which shares they have
tahoe-lafs
trac at allmydata.org
Mon May 11 11:14:33 PDT 2009
#700: have servers publish Bloom filter of which shares they have
------------------------------+---------------------------------------------
Reporter: warner | Owner:
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: code-performance | Version: 1.4.1
Keywords: | Launchpad_bug:
------------------------------+---------------------------------------------
It occurred to me that we could shave a round trip off the marginal
immutable download time by having each server publish a
[http://en.wikipedia.org/wiki/Bloom_filter Bloom filter] of which shares
they are holding (keyed by storage-index). The server would update this
filter in the share-crawler loop, so a large server (like the
allmydata.com servers, with 1-3M objects each) would probably build a new
one every few hours. Clients could lazily download a copy in the
background. At file-download time, the clients could consult the Bloom
filter instead of asking the servers directly, and assume that a hit in
the filter means that the server actually has the share (and thus start
the conversation by asking for the first chunk of data, instead of asking
whether they have a share or not).
An optimal Bloom filter with a 1% error rate requires about 9.6 bits per
element, so a 1M-share server would require about 1.2MB for its filter.
Less full servers would require considerably less space to hold the
filter. Clients then need to decide how they want to optimize their
bandwidth: downloading a few MB of filters ahead of time, or spending a
roundtrip to find out which server has the shares of interest.
If share density is high (i.e. most servers accepted the shares during the
initial upload, and there has not been much server churn since then), it
might make more sense to skip the filter and just ask the servers to start
sending data (or report that they have none). The bloom filter would be
most useful if it is expensive to talk to a server that does not end up
having a share.
In either case, when we redesign the immutable-share retrieval protocol,
we'll probably want to pay attention to having a cheap way to start
fetching share data as quickly as possible, maybe something like
{{{readv(storageindex, spanvectors, shares_to_ignore) -> dict(shnums to
spans), allshnums}}}
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/700>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list