#872 new enhancement

Adjust the probability of selecting a node according to its storage capacity (or other fitness measure)

Reported by: davidsarah Owned by: davidsarah
Priority: major Milestone: undecided
Component: code-peerselection Version: 1.5.0
Keywords: performance scalability space-efficiency preservation bandwidth Cc:
Launchpad Bug:

Description

If the probability of the peer selection algorithm putting a node close to the beginning of the list were proportional to its storage capacity, then that would better tolerate grids with a wide range of node capacities.

With a uniform selection probability, as at present, small-capacity nodes will be expected to receive many requests to store shares that they don't have room for, and to download shares that they don't have.

See http://allmydata.org/pipermail/tahoe-dev/2009-December/003408.html and followups for mailing list discussion.

Change History (20)

comment:1 Changed at 2009-12-27T05:13:22Z by davidsarah

comment:2 follow-up: Changed at 2009-12-27T06:49:50Z by warner

bwahaha, welcome to a big can of worms :)

source:docs/specifications/outline.txt (section 3: "Server Selection Algorithm, filecap format") is worth reading. It points out the requirement that all of the uploader's choices are somehow recorded and made available to the downloader. Or, rather, the downloader's sequence of servers needs to be "well" correlated with the uploader's sequence.

So any upload-time code which is influenced by things like current remaining server space will need a way to record its choices (or the information which went into that choice) in the filecap, so it can influence the downloader in the same way.

That said, choosing servers according to capacity would serve the purpose of filling servers at the same time as opposed to filling them at the same rate. (i.e., all servers become full at the same moment, versus each server sees the same inbound bytes-per-second rate). If all servers had the same capacity, these two options would be identical.

Part of the discussion in #302 is about whether this is good, important, or irrelevant. In general, I think that full-at-the-same-time is good, but I'm not sure it's actually better than fill-at-the-same-rate. I believe that maximum reliablity occurs when each file has as many choices for servers as possible, but those options will dwindle over time as servers get full. A system which probabilistically favors some servers over others (based upon capacity or whatever) will have fewer choices to work with.

Hm, I think there's a rigorous argument in there somewhere. The entropy of the server-selection process (given a random storage-index) should be fairly well-defined. A non-probabilistic algorithm will just give you log2 of the number of possible choices. A probabilistic algorithm would be like that, but with each choice weighted by the probability of its selection. (I used to know this stuff, really I did.. I'll look up my old Information Theory textbook when I get home).

With that sort of definition, we could evaluate different algorithms according to how well they maximize that entropy. Moreover, the entropy is a function of the current state of the grid (like how many free servers are left), and that state will evolve in different ways according to the algorithm we choose. So we can further evaluate that entropy over time. Any non-homogeneous grid will see the entropy drop over time, as the grid fills up and the choices dwindle. We could develop a metric to talk about the entropy averaged across all files: maybe the best algorithm is the one that manages the highest average entropy, or perhaps the lowest variance, or something.

A probabilistic selection algorithm will always have lower per-file entropy than a non-probabilistic one, given the same number of potential servers (well, a non-uniform-probability algorithm, to be precise). But if it manages to preserve server availability longer, then the entropy averaged over the life of the grid (from empty to full) might be higher. *That*'s probably the way we should investigate the value of a different algorithm.

comment:3 in reply to: ↑ 2 ; follow-ups: Changed at 2009-12-27T17:33:40Z by davidsarah

Replying to warner:

I believe that maximum reliablity occurs when each file has as many choices for servers as possible, but those options will dwindle over time as servers get full. A system which probabilistically favors some servers over others (based upon capacity or whatever) will have fewer choices to work with.

I agree with the first sentence, but not the second. The "expected full at the same time" property will tend to maximize the number of storage nodes available to accept shares for as long as possible; that's why I believe it is better for file preservation.

There's a fairly straightforward way to change the selection algorithm to achieve this property. Suppose that capacity estimates are multiples of some unit C. If a storage node has capacity estimate e*C, we give it e entries in the list to be permuted (there's a way to make this more efficient; see below). That is, permute the list and then discard duplicates later than the first occurrence of a given node. The effect is similar to splitting each storage server into e virtual nodes that share the same disk space, but with the important difference that the upload algorithm will still try not to put multiple shares on a given server.

This means that the capacity estimate of a given storage node can't change, and must be known by the Introducer so that it can tell all other nodes. (The actual capacity can be different to the estimate; that won't cause any greater problems than at present.)

The performance of this algorithm as given above is poor when the sum of all e is large, but it can be improved by selecting the servers using a binary search tree rather than an explicit list. That is, each step of a Fisher-Yates shuffle would choose a random element from the search tree weighted by its capacity estimate, then delete that element from the tree. This is equivalent to using an arbitrarily small C.

A [non-uniform-probability] selection algorithm will always have lower per-file entropy than a [uniform-probability] one, given the same number of potential servers. But if it manages to preserve server availability longer, then the entropy averaged over the life of the grid (from empty to full) might be higher.

I'm not sure that selection entropy is the main issue. The two most important things we want are:

  • for the servers-of-happiness criterion to be met for as many uploads as possible;
  • for shares to be placed on the servers nearest the start of the list (hence with the best download performance).

Both of these are affected primarily by the proportion of servers that are available, not by their probability of selection.

comment:4 in reply to: ↑ 3 ; follow-up: Changed at 2009-12-27T21:30:17Z by warner

Replying to davidsarah:

Replying to warner:

I believe that maximum reliablity occurs when each file has as many choices for servers as possible, but those options will dwindle over time as servers get full. A system which probabilistically favors some servers over others (based upon capacity or whatever) will have fewer choices to work with.

I agree with the first sentence, but not the second. The "expected full at the same time" property will tend to maximize the number of storage nodes available to accept shares for as long as possible; that's why I believe it is better for file preservation.

I think there are two levels of reliability in action here. The first and most important is to avoid doubling-up of shares (the "servers of happiness" threshold, but really it's strictly >=N servers). Certainly your reliability drops significantly when you go below this number of available servers.

The second order effect is the decorrelation of per-file server sets, which is the entropy thing I'm talking about. It only makes sense to talk about this one after you've ensured that you have the first level for everything.

Imagine that you had 20 servers, the usual 3-of-10 encoding, and the selection rule was that on even days you used servers 1-10, and on odd days you used servers 11-20. Each file would have the first kind of reliability (every file would use N distinct servers). But the second kind of reliability would be marginal: an attacker who destroys the right 50% of the servers would completely kill half the files (in fact they could fatally wound half the files with just 40% of the servers).

In contrast, if each file gets a random selection of all twenty servers, then there's minimal correlation between the servers used by any two files. An attacker who destroys servers 1-10 would expect to kill just 2126/184756 = 1.15% of the files.

from itertools import combinations
possibilities = 0
killed = 0
for servers in combinations(range(20), 10):
    possibilities += 1
    if len(set(servers) - set(range(10))) < 3:
        killed += 1
print killed, possibilities

So I think the first goal is to keep >=N servers free for as long as possible (ideally until the very last file fills the grid), but if we can achieve that, then our second goal should be to maximize the number of ways in which files are uploaded.

There's a fairly straightforward way to change the selection algorithm to achieve this property.

Yeah, I like the simplicity of that. But we need a stable way to inform the downloader about the capacities we saw, so they can get to the same list. Maybe a layer of indirection could help: the serverlist is stored in stable, well-known places that do not depend upon server capacity (and the serverlist isn't big enough to fill those places much), but the shares can go elsewhere (to places chosen for the fill-at-the-same-time goal).

I'm not sure that selection entropy is the main issue. The two most important things we want are:

  • for the servers-of-happiness criterion to be met for as many uploads as possible;
  • for shares to be placed on the servers nearest the start of the list (hence with the best download performance).

Both of these are affected primarily by the proportion of servers that are available, not by their probability of selection.

I'll agree with all of that. Certainly selection entropy is less important than the servers-of-happiness (really >=N) criterion. I don't know how it should compare against download performance.. probably below. I guess I'd put selection entropy as the third item in your list.

I hope to explain my entropy concept better.

Here's another example of why I think the probabilistic approach needs to be evaluated against the entropy concept. Imagine that you've got 3-of-10 encoding and 15 servers: 10 big ones and 5 tiny ones. The probabilistic algorithm will almost always pick the 10 big ones and ignore the 5 tiny ones. So even though we've nominally got 15 free servers, we rarely actually use them all. So almost every file we upload will share a server-set (big1-big10), making them more vulnerable (as a group). The entropy of the selection algorithm will be nearly zero, since the tiny servers are chosen with such low probability. The entropy will remain mostly constant over time, though, since you'll probably fill the tiny servers at the same time as the big servers, so your choices will remain about the same for the whole time.

Of course, if you send any more traffic towards those tiny servers (such as if you went for same-rate instead of same-time), they'll fill up sooner than the big ones, and they'll quickly be full. At that point, the entropy drops to zero, because you have exactly one option.

Since the servers in this example are not of uniform size, this loss of entropy is inevitable. There's a finite number of possibilities, and each byte you upload consumes some of them. A completely homogeneous grid with uniformly-sized uploads will run out of selection entropy all at the same time, just as the last file causes the grid to be full. The entropy-versus-time graph (from t=0 to t=grid-is-full) is flat. For heterogeneous grids and a non-probabilstic algorithm the graph looks like a step-wise decrementing function, starting high, dropping a bit after each server fills up, but flat inside each region (the last plateau is at 0, when there are exactly N servers left, then there's a region when we're doubling up shares that we'd represent with a red line or negative numbers or something else). I think a capacity-sensitive algorithm's graph would look completely flat: since all servers should fill at the same time, there should be no steps, but the overall entropy will be lower than if you chose freely between the initially-available servers.

A flat graph would mean that late-uploaded files are just as good as early-uploaded files. A decreasing curve means that early files have it better than late files (or, to be more precise, that a batch of files uploaded early will have less mutual-correlation than a similar batch uploaded late: killing a random set of servers would be expected to kill more of the late files than the early ones).

I suspect that the area under this curve is constant, independent of the selection algorithm, and that the area is a function of the set of server capacities. It would be maximal for homogeneous servers.

I'm merely thinking that it might be possible to measure the shape of this curve for different selection algorithms, and that it's something to keep in mind when picking one. If my suspicion about those shapes is correct, then the probabilistic approach seems the "fairest", in that it would give equal distributions to both early and late files.

I still don't know how to record enough information about the server choices into the filecap, though. Capacities will change over time, and to make this work right for the uploader, they'll need to keep changing their probabilities in response to new "how much space do you have left" updates from the servers.

comment:5 in reply to: ↑ 4 Changed at 2009-12-27T22:39:45Z by davidsarah

  • Keywords preservation added

Replying to warner:

I think there are two levels of reliability in action here. The first and most important is to avoid doubling-up of shares (the "servers of happiness" threshold, but really it's strictly >=N servers). Certainly your reliability drops significantly when you go below this number of available servers.

The second order effect is the decorrelation of per-file server sets, which is the entropy thing I'm talking about. It only makes sense to talk about this one after you've ensured that you have the first level for everything.

Agreed, but I think that "expected full at the same time" is likely to help with this decorrelation as well. The reason is that if some servers are full -- even if >= N servers are not full -- then the choice of servers has been reduced.

For example, suppose you have N small-capacity servers and N large-capacity servers. If you choose servers uniformly, then all of the small-capacity servers will fill up first, and then the choice of servers for shares of subsequent files will be reduced to N. So by attacking only the N large-capacity servers, the attacker is disproportionately likely to kill the more recently added files. (Reading the later part of your comment, it seems we're in violent agreement on this.)

Note that this is a very likely situation in practice given the tendency to add servers in batches with increasing capacities (as in the case of allmydata); and in that case even rebalancing all shares would not help. With the non-uniform choice, OTOH, then rebalancing would restore the random distribution of all shares (regardless of when their file was originally uploaded) across servers.

The attacker still gets a greater advantage from killing a server with a higher capacity, but only to the extent that we would expect because it holds more shares. When the servers are mostly full, we cannot avoid that property.

The extent of the bias is also limited by the attempt to place only one share on each server when uploading. Suppose you have one server that has 100 times the capacity of all the others. The algorithm I suggested will almost always place one share of each file on that server -- but only one. This seems like reasonable behaviour (even for this unreasonably extreme configuration): it uses the capacity of the supersized server as well as possible without relying on it excessively.

There's a fairly straightforward way to change the selection algorithm to achieve this property.

Yeah, I like the simplicity of that. But we need a stable way to inform the downloader about the capacities we saw, so they can get to the same list.

If capacity estimates are fixed, then informing the downloader about them is no more difficult than informing the downloader about public keys. If the actual capacity of a server increases relative to its estimate, then the effect of that is never any worse than the uniform-probability selection. So I think there's a good case for just following the "worse is better" approach of assuming fixed capacity estimates (which makes the selection algorithm stable -- or as stable given server changes as it is now).

The two most important things we want are:

  • for the servers-of-happiness criterion to be met for as many uploads as possible;
  • for shares to be placed on the servers nearest the start of the list (hence with the best download performance).

[...]

I guess I'd put selection entropy as the third item in your list.

Agreed.

I hope to explain my entropy concept better.

Here's another example of why I think the probabilistic approach needs to be evaluated against the entropy concept. Imagine that you've got 3-of-10 encoding and 15 servers: 10 big ones and 5 tiny ones. The [non-uniform] probabilistic algorithm will almost always pick the 10 big ones and ignore the 5 tiny ones. So even though we've nominally got 15 free servers, we rarely actually use them all.

The same is true of the uniform algorithm as soon as the tiny servers fill up, which will be soon. The main difference seems to be that the non-uniform algorithm spreads out the non-uniformity in server selection over time.

[...]

A completely homogeneous grid with uniformly-sized uploads will run out of selection entropy all at the same time, just as the last file causes the grid to be full. The entropy-versus-time graph (from t=0 to t=grid-is-full) is flat. For heterogeneous grids and a non-probabilstic algorithm the graph looks like a step-wise decrementing function, starting high, dropping a bit after each server fills up, but flat inside each region (the last plateau is at 0, when there are exactly N servers left, then there's a region when we're doubling up shares that we'd represent with a red line or negative numbers or something else). I think a capacity-sensitive algorithm's graph would look completely flat: since all servers should fill at the same time, there should be no steps, but the overall entropy will be lower than if you chose freely between the initially-available servers.

You mean it will start at a lower value, I assume? That also matches my intuition (although we should do some simulations), and I think that the flat graph is preferable, because we don't want to favour earlier files at the expense of later ones.

I suspect that the area under this curve is constant, independent of the selection algorithm,

Now there's a bold prediction (that we should be able to test by simulation).

and that the area is a function of the set of server capacities. It would be maximal for homogeneous servers.

I'm merely thinking that it might be possible to measure the shape of this curve for different selection algorithms, and that it's something to keep in mind when picking one. If my suspicion about those shapes is correct, then the probabilistic approach seems the "fairest", in that it would give equal distributions to both early and late files.

I still don't know how to record enough information about the server choices into the filecap, though. Capacities will change over time, and to make this work right for the uploader, they'll need to keep changing their probabilities in response to new "how much space do you have left" updates from the servers.

As explained above, I don't think this is necessary.

comment:6 in reply to: ↑ 3 Changed at 2009-12-27T23:05:34Z by davidsarah

Replying to davidsarah:

Suppose that capacity estimates are multiples of some unit C. If a storage node has capacity estimate e*C, we give it e entries in the list to be permuted (there's a way to make this more efficient; see below). That is, permute the list and then discard duplicates later than the first occurrence of a given node.

[...]

The performance of this algorithm as given above is poor when the sum of all e is large, but it can be improved by selecting the servers using a binary search tree rather than an explicit list. That is, each step of a Fisher-Yates shuffle would choose a random element from the search tree weighted by its capacity estimate, then delete that element from the tree. This is equivalent to using an arbitrarily small C.

I didn't explain this well. The idea is that for each node of the search tree, you keep track of the total weights of its left and right subtrees. That allows you to pick a random node in the tree with probability proportional to its weight, by making depth binary choices where depth ~= log2 n for n servers. Deleting a node also takes depth time, because you have to update the total weights on the ancestors of the deleted node. The overall time is therefore O(n log n).

(The is more like the original Fisher-Yates shuffle than the Durstenfeld variant.)

comment:7 Changed at 2009-12-30T00:14:02Z by davidsarah

  • Keywords bandwidth added

Part of the goal of #543 ('rebalancing manager') is "to smooth out disk usage among all servers (more by percentage than by absolute usage)." This ticket might help by giving such a rebalancer less to do.

comment:8 Changed at 2009-12-30T04:35:52Z by davidsarah

Ah, there is another constraint on the shuffle algorithm: it must be approximately stable when servers are added or removed. The existing algorithm (essentially, hash each peerid and sort by the hashes) is stable because adding or removing a server just adds or removes its hash, and the other hashes are sorted in the same order. The first algorithm described in comment:3 is also stable in this sense, because it can be defined in a similar way by hashing the peerid and a small integer. (It's easy to make this compatible with the existing scheme when all capacity estimates are equal.)

However the Fisher-Yates-based algorithm described in comment:6 is not stable in the required sense, and I don't see how to make it so (a pity, because I'd just finished implementing it :-/ ).

comment:9 Changed at 2009-12-30T04:49:05Z by zooko

I'm not convinced that this stability or "consistent hashing" property is a hard requirement. All of the Tahoe-LAFS grids that have been deployed so far (with one exception) have so few storage servers that most reads query every server. The one exception is the allmydata.com production grid, which has about a hundred servers. It might work just fine to query all one hundred of them on every read, too.

Whether the consistent hashing property is important to real deployments is an empirical measurement question, IMO, and my guess is that for all of the current small grids the answer is "no measurable impact" and for allmydata.com the answer is "measurable impact, but not a critical problem".

comment:10 follow-up: Changed at 2009-12-30T05:11:49Z by davidsarah

Even if this stability property is not critical, it seems that losing it would be an unnecessary regression that might prevent us from scaling up to larger grids.

The original algorithm in comment:3 keeps this property while still meeting the goals of this ticket. I don't think the fact that it is less efficient when (sum of all e) is large would be a serious obstacle. Besides, I have an idea about how to do better, but I'll have to think about it some more.

comment:11 in reply to: ↑ 10 Changed at 2009-12-30T07:01:53Z by davidsarah

Replying to davidsarah:

Besides, I have an idea about how to do better, but I'll have to think about it some more.

The idea worked out.

The comment:3 algorithm is equivalent to picking the minimum hash value out of e independent hash values for each server. We can get the same result by taking a single hash, and transforming it so that it follows the same distribution as the minimum of e hashes would have done.

Let Xe be the distribution given by the minimum of e independent uniform distributions U1..e, each on [0, 1). The cumulative distribution function of Xe is given by:

F_Xe(x) = P(Xe <= x)
= P(min(U1, U2, ... Ue) <= x)
= 1 - P(U1 > x) P(U2 > x) ... P(Ue > x)
= 1 - (1-x)e

Then we can use inverse transform sampling to generate samples from Xe. For that we need the inverse of F_Xe which is

(F_Xe)-1(y) = 1 - (1-y)(1/e)

So if we let y be the peer id hash for a given server scaled to the range [0, 1), and e be its capacity estimate, then sorting according to 1 - (1-y)(1/e) will give the same distribution of permutations that we would have got by the comment:3 algorithm.

Plotting (F_Xe)-1 for various e gives results that are intuitively reasonable in order for this to work: increasing e biases the transformed hash toward lower values that are more likely to be near the start of the list (but for any e, there is still some chance that the server will not be picked).

comment:12 Changed at 2009-12-30T07:12:27Z by davidsarah

Also notice that:

  • when e = 1, this is compatible with the existing algorithm.
  • if a server's capacity estimate increases, it can only get nearer the start of the list, i.e. it should still be assigned a share if it had one before.

comment:13 Changed at 2009-12-31T00:42:13Z by davidsarah

To be be more concrete, at source:src/allmydata/storage_client.py#L121 :

key = peer_selection_index
return sorted(servers, key=lambda x: sha.new(key+x[0]).digest())

would change to something like

hash_limit = 256**sha.digest_size
def weighted_hash(x):
    hash = sha.new(peer_selection_index + x[0]).digest()
    # We don't bother to normalize y to [0, 1), since we're only interested
    # in the order of the weighted hashes.
    y = bigendian_to_int(hash)
    return (-((hash_limit - y)**(1.0 / capacity_estimate)), y)

return sorted(servers, key=weighted_hash)

using this utility function to convert a binary string to an integer (which inexplicably doesn't seem to be in the stdlib):

def bigendian_to_int(s):
    n = 0 
    for i in s:
        n = n*256 + ord(i)
    return n

comment:14 Changed at 2009-12-31T00:48:04Z by davidsarah

Incidentally, I know that Python floating point arithmetic might not give exactly the same results between machines. That shouldn't matter because it can only have the effect of swapping two servers next to each other in the permuted order, which we should be tolerant of.

comment:15 Changed at 2010-01-02T20:15:55Z by warner

neat!

The stability of this all still depends upon the stability of the capacity estimates, right? I gather you've been assuming that any given server would publish its total capacity in a fixed record, along with its nodeid. I've been assuming that each server would publish it's current-remaining-capacity periodically, in a record that changes over time, like the one that contains the location hints and version info.

I like the adaptiveness of schemes that keep publishing updates as remaining space dwindles. There will be a lot of random noise in our traffic rates, and if these rates are adjusted over time to match the remaining space, then we'll get a nice feedback loop to compensate for accidental fluctuations. Also, server operators are likely to add or remove space at various times, and it would be nice to be able to adapt to that.

But I don't know how to build a system with all of these nice properties at once: smooth filling of servers by percentage instead of rate, stable ordering of servers between upload time and download time, short filecaps, minimal auxilliary information (like an explicit serverlist stored in some intermediate location).

Even if this stability property is not critical, it seems that losing it would be an unnecessary regression that might prevent us from scaling up to larger grids.

Good argument. (Zooko and I have discussed download-time query flooding before, and we usually tend to land on the same sides each time). I don't yet know how to scale tahoe up to millions of nodes, but I think it will be important to have a well-defined and stable place to find your shares (even if you have to do O(log(N)) queries to find them). Giving up on that now, by requiring an O(N) search, feels like it will make that sort of scaling much much harder.

Maybe we should discuss the properties of a protocol with an intermediate step. I wrote up some if this in #599. The idea would be that upload-time could place shares anywhere it likes (to achieve good load-balancing, or geographical diversity, or ideal download bandwidth, whatever), but it would then write down a list of which servers got used, and store *that list* in a specific (well-known, stable) set of places.

Download reliability would depend upon having both one copy of the sharelist available and >=k shares. But the list should be small enough to let us afford 1-of-N encoding and have lots of copies, so the sharelist's impact on reliability should be dwarfed by the share's impact (if you can get 10 or 20 dBA better, it'll be lost in the noise).

However, repairers and rebalancers need to participate in the protocol: they have to find and update the sharelist. And we have to quantify how much of a problem it would be for the sharelist to be wrong, because some of the sharelist-holding servers might be unavailable when you go to move or create some shares. It's effectively adding a bit of mutable state to your normally-immutable shares, with all the CAP Theorem consequences that entails.

#599 suggests putting list of "where are the other shares" hints on each share, which would turn your download algorithm into "search normally for the first share, then use the hints to accelerate the search for the others". This would get rid of the potential reliablity penalty (since you get fate-sharing between sharelists and shares), but couldn't accomodate completely arbitrary share placement: you'd have to find at least one share before you found all the other ones. So it might help improve performnce on large grids (where, due to regular churn, you might normally have to query hundreds or thousands of servers to find enough shares), but still wouldn't really permit the use of fancy load-balancing share-placement algorithms like what we're discussing here.

comment:16 Changed at 2010-01-03T02:10:51Z by davidsarah

Replying to warner:

neat!

Thanks.

The stability of this all still depends upon the stability of the capacity estimates, right?

Increasing the capacity estimate of a node can only move it nearer the start of the list for any given file (storage id). Similarly, decreasing the capacity estimate of a node can only move it further from the start of the list for any given file. I think this is the strongest stability property that could be expected.

(When I said "the peer id hash for a given server" in comment:11, I actually meant the hash of the peer id and the storage id. The properties above hold because the sample biasing is done after computing the hash, and doesn't affect its input.)

I gather you've been assuming that any given server would publish its total capacity in a fixed record, along with its nodeid. I've been assuming that each server would publish it's current-remaining-capacity periodically, in a record that changes over time, like the one that contains the location hints and version info.

Using the remaining capacity rather than the total capacity would make the peer selection less stable. It should still be quite stable while the grid is not close to being full (since servers will tend to fill at a rate proportional to their initial capacity), but when any given server is nearly full, its remaining space relative to other servers would no longer be a good approximation to its initial capacity relative to other servers.

I like the adaptiveness of schemes that keep publishing updates as remaining space dwindles. There will be a lot of random noise in our traffic rates, and if these rates are adjusted over time to match the remaining space, then we'll get a nice feedback loop to compensate for accidental fluctuations.

Yes. I'm not sure yet whether this outweighs the stability issue.

Also, server operators are likely to add or remove space at various times, and it would be nice to be able to adapt to that.

That could work if file caps contain an epoch number, and if the history of all remaining capacities at each epoch can be obtained by all nodes. However,

  • putting the epoch number in a cap would leak information about when the file was uploaded to holders of that cap. (It would need to be present in a verify cap in order to verify the shares.)
  • two immutable files with the same contents, encoding parameters, and convergence secret would not necessarily have the same read cap, or verify cap, or share locations. The latter could potentially lead to thrashing of share locations if a file is rebalanced using more than one cap with different epoch numbers.

But I don't know how to build a system with all of these nice properties at once: smooth filling of servers by percentage instead of rate, stable ordering of servers between upload time and download time, short filecaps, minimal auxilliary information (like an explicit serverlist stored in some intermediate location).

The epoch number scheme has most of these properties, with the caveats above, but the auxiliary information is the full history of remaining capacities (or fitness values; see below) at each epoch.

Even if this stability property is not critical, it seems that losing it would be an unnecessary regression that might prevent us from scaling up to larger grids.

Good argument. (Zooko and I have discussed download-time query flooding before, and we usually tend to land on the same sides each time). I don't yet know how to scale tahoe up to millions of nodes, but I think it will be important to have a well-defined and stable place to find your shares (even if you have to do O(log(N)) queries to find them).

The nodes don't need to maintain connections to all other nodes, so that's not the scaling constraint. The obvious scaling constraint is the size of the location info for other nodes. When you get to the point where that information takes an unreasonable amount of memory, you can split the network into subgrids, with each subgrid having a set of supernodes (with should be the most available nodes in each subgrid). Then each node only needs to know the locations of all the supernodes, and each supernode only needs to know the locations of other nodes within its subgrid. This creates a small-world network in which any node is at most two hops from any other. So, you can scale to roughly the square of the number of nodes that would otherwise be feasible: use the permuted list algorithm to pick the supernodes for a given file, and have them use the algorithm again to route to the actual storage servers.

But this isn't the right ticket to discuss scaling to many nodes; that would be #235.

Giving up on that now, by requiring an O(N) search, feels like it will make that sort of scaling much much harder.

Maybe we should discuss the properties of a protocol with an intermediate step. I wrote up some if this in #599. The idea would be that upload-time could place shares anywhere it likes (to achieve good load-balancing, or geographical diversity, or ideal download bandwidth, whatever), but it would then write down a list of which servers got used, and store *that list* in a specific (well-known, stable) set of places.

I think that the epoch scheme is a refinement of that. Note that the bias doesn't have to be by capacity; it could use any fitness function. Using a single fitness value for each server wouldn't give you geographic diversity, but it would allow biasing by bandwidth etc.

Download reliability would depend upon having both one copy of the sharelist available and >=k shares. But the list should be small enough to let us afford 1-of-N encoding and have lots of copies, so the sharelist's impact on reliability should be dwarfed by the share's impact (if you can get 10 or 20 dBA better, it'll be lost in the noise).

Yes. The size of a history of fitness values need not be much greater than the location and public key hash info, as long as there are not too many epochs.

However, repairers and rebalancers need to participate in the protocol: they have to find and update the sharelist. And we have to quantify how much of a problem it would be for the sharelist to be wrong, because some of the sharelist-holding servers might be unavailable when you go to move or create some shares. It's effectively adding a bit of mutable state to your normally-immutable shares, with all the CAP Theorem consequences that entails.

Count me as a skeptic of the relevance of the CAP theorem; it depends on a very strong notion of consistency. In any case, if the auxiliary information is the history of fitness values, then that history is only extended, not changed, so we don't really have mutable state.

#599 suggests putting list of "where are the other shares" hints on each share, which would turn your download algorithm into "search normally for the first share, then use the hints to accelerate the search for the others". This would get rid of the potential reliablity penalty (since you get fate-sharing between sharelists and shares), but couldn't accomodate completely arbitrary share placement: you'd have to find at least one share before you found all the other ones.

I'm not sure this has a significant advantage over the epoch number approach. It has the same disadvantages wrt. convergent immutable files, although it wouldn't necessarily leak information about when a file was uploaded.

comment:17 Changed at 2010-01-03T02:18:01Z by davidsarah

  • Summary changed from Adjust the probability of selecting a node according to its storage capacity to Adjust the probability of selecting a node according to its storage capacity (or other fitness measure)

comment:18 Changed at 2010-01-03T02:41:38Z by davidsarah

Fun analogy for anyone who knows image processing: if the cdf of the uniform distribution corresponds to a greyscale ramp, then applying the '1 - (1-x)e' bias is effectively applying a gamma function to lighten or darken it, increasing the chance of picking a lighter or darker shade.

comment:19 Changed at 2010-12-30T02:30:27Z by davidsarah

  • Keywords space-efficiency added

comment:20 Changed at 2011-08-20T02:46:21Z by davidsarah

  • Owner set to davidsarah

On the p2p-hackers list, Tony Arcieri wrote:

There's a few things missing from Tahoe which I have seen endlessly discussed which would need to be added for it to fill this role. The first would be a way for peers to weight themselves in terms of their available storage capacity. Perhaps Tahoe could utilize a self-assigned weight score?

Version 0, edited at 2011-08-20T02:46:21Z by davidsarah (next)
Note: See TracTickets for help on using tickets.