[tahoe-dev] Faster share retrieval by using probabilistic set-testing ?

Johannes Zafolo at gmx.net
Mon Mar 5 07:30:57 UTC 2012


Hello,

As I am trying to understand performance issues, I would 
like to make a suggestion. I hope it isn't too stupid...

In the class StorageServer, there is a method for finding shares:

###########################
    def _get_bucket_shares(self, storage_index):
        """Return a list of (shnum, pathname) tuples for files that hold
        shares for this storage_index. In each tuple, 'shnum' will
    always be
        the integer form of the last component of 'pathname'."""
        storagedir = os.path.join(self.sharedir,
    storage_index_to_dir(storage_index))
        try:
            for f in os.listdir(storagedir):
                if NUM_RE.match(f):
                    filename = os.path.join(storagedir, f)
                    yield (int(f), filename)
        except OSError:
            # Commonly caused by there being no buckets at all.
            pass
###########################

Basically, this seems to check whether a storage_index matches 
members of a set of stored shares.

As far as I understand, if a k/N share is retrieved, the client has to
ask each of at least k storage nodes whether they possess matching
shares. Now, the simplest way is just to look whether a file name with
the share key exists, which seem to happen above. This consumes a little
bit time, as disk access is relatively slow.
As a possible alternative, there exists an efficient algorithm 
called "Bloom Filter" which is able to tell very fast whether a set
member is *probably* there, this works by implementing a probabilistic
set which can be held in memory and is much smaller than the actual
list of shares.

Because it is probabilistic, it can happen that there
are false positives (i.e. in 1 % of cases a share is guessed to be
there, but actually is not on disk), however there are
never false negatives (if the algorithm tells that
the share isn't there, this is always the truth). Because
99 % of disk accesses can be avoided, this could speed up things. 
The name of the algorithm is "Bloom filter" though its in reality a kind
of set, there is a well-written Wikipedia Article about it.  

http://en.wikipedia.org/wiki/Bloom_Filter

They have also been described by Don Knuth (TAOCP, Errata of Vol. 3).

Bloom filters are used in network caches like squid. 
Because the algorithm is based on hashing, it is also
feasible that the sets data gets transferred to the clients
without disclosure of confidential information; In 
other words, in a huge grid, clients could look up 
themselves which server will probably have shares.

An especially interesting property of these sets for the purpose
of caching is that it is easy and very efficient to combine 
them. One can say: give me the union set of the shares refreshed
this month and the shares refreshed last month
and the shares refreshed two months ago.

If there is any interest, I can implement it easily in
Python, using the bitarray module it's not more than a few lines...

Johannes


More information about the tahoe-dev mailing list