Version 11 (modified by zooko, at 2010-07-21T15:43:23Z) (diff)


Zooko says:

Different users of Tahoe-LAFS have different desires for "Which servers should I upload which shares to?".

  • wants to upload to a random selection, evenly distributed among servers which are not full; This is, unsurprisingly, what Tahoe v1.5 currently does.
  • Brian has mentioned that an deployment might prefer to have the servers with more remaining capacity receiving more shares, thus "filling up faster" than the servers with less remaining capacity (#872).
  • Kevin Reid wants, at least for one of his use cases, to specify several servers each of which is guaranteed to get at least K shares of each file, in addition to potentially other servers also getting shares.
  • Shawn Willden wants, likewise, to specify a server (e.g. his mom's PC) which is guaranteed to get at least K shares of certain files (the family pictures and movies files).
  • Some people -- I'm sorry I forget who -- have said they want to upload at least K shares to the K fastest servers.
  • Jacob Appelbaum and Harold Gonzales want to specify a set of servers which collectively are guaranteed to have at least K shares -- they intend to use this to specify the ones that are running as Tor hidden services and thus are attack-resistant (but also extra slow-and-expensive to reach). Interestingly the server selection policy on download should be that the K servers which are Tor hidden services should be downloaded from as a last resort.
  • Several people -- again I'm sorry I've forgotten specific attribution -- want to identify which servers live in which cluster or co-lo or geographical area, and then to distribute shares evenly across clusters/colos/geographical-areas instead of evenly across servers.
    • Here's an example of this desire, Nathan Eisenberg asked on the mailing list for "Proximity Aware Decoding":
    • If you have K+1 shares stored in a single location then you can repair after a loss (such as a hard drive failure) in that location without having to transfer data from other locations. This can save bandwidth expenses (since inter-location bandwidth is typically free), and of course it also means you can recover from that hard drive failure in that one location even if all the other locations have been stomped to death by Godzilla.

As I have emphasized a few times, we really should not try to write a super-clever algorithm into Tahoe which satisfies all of these people, plus all the other crazy people that will be using Tahoe-LAFS for other things in the future. Instead, we need some sort of configuration language or plugin system so that each crazy person can customize their own crazy server selection policy. I don't know the best way to implement this yet -- a domain specific language? Implement the above-mentioned list of seven policies into Tahoe-LAFS and have an option to choose which of the seven you want for this upload? My current favorite approach is: you give me a Python function. When the time comes to upload a file, I'll call that function and then use whichever servers it said to use.

Brian says:

Having a function or class to control server-selection is a great idea. The current code already separates out responsibility for server-selection into a distinct class, at least for immutable files (source:src/allmydata/immutable/ Tahoe2PeerSelector). It would be pretty easy to make the uploader use different classes according to a tahoe.cfg option.

However, there are some additional properties that need to be satified by the server-selection algorithm for it to work at all. The basic Tahoe model is that the filecap is both necessary and sufficient (given some sort of grid membership) to recover the file. This means that the eventual downloader needs to be able to find the same servers, or at least have a sufficiently high probability of finding "enough" servers within a reasonable amount of time, using only information which is found in the filecap.

If the downloader is allowed to ask every server in the grid for shares, then anything will work. If you want to keep the download setup time low, and/or if you expect to have more than a few dozen servers, then the algorithm needs to be able to do something better. Note that this is even more of an issue for mutable shares, where it is important that publish-new-version is able to track down and update all of the old shares: the chance of accidental rollback increases when it cannot reliably/cheaply find them all.

Another potential goal is for the download process to be tolerant of new servers, removed servers, and shares that have been moved (possibly as the result of repair or "rebalancing"). Some use cases will care about this, while others may never change the set of active servers and won't care.

It's worth pointing out the properties we were trying to get when we came up with the current "TahoeTwo" algorithm:

  • for mostly static grids, download uses minimal do-you-have-share queries
  • adding one server should only increase download search time by 1/numservers
  • repair/rebalancing/migration may move shares to new places, including servers which weren't present at upload time, and download should be able to find and use these shares, even though the filecap doesn't change
  • traffic load-balancing: all non-full servers get new shares at the same bytes-per-second, even if serverids are not uniformly distributed

We picked the pseudo-random permuted serverlist [dependent on the StorageIndex] to get these properties. I'd love to be able to get stronger diversity among hosts, racks, or data centers, but I don't yet know how to get that and get the properties listed above, while keeping the filecaps small.


If you're using an immutable file upload erasure-coding helper then either you can't use this new feature of a new strategy for share placement or else we have to define some way for the user to communicate their share placement strategy to the helper.


The main ticket:

  • #573 (Allow client to control which storage servers receive shares)

Related tickets:

  • #302 (stop permuting peerlist, use SI as offset into ring instead?)
  • #466 (extendable Introducer protocol: dictionary-based, signed announcements)
  • #467 (change peer-selection to allow introducerless explicit serverlist, alternative backends)
  • #872 (Adjust the probability of selecting a node according to its storage capacity (or other fitness measure))