#302 closed task (invalid)

stop permuting peerlist, use SI as offset into ring instead?

Reported by: warner Owned by: zooko
Priority: major Milestone: undecided
Component: code-peerselection Version: 0.7.0
Keywords: repair newcaps newurls performance preservation upload Cc:
Launchpad Bug:

Description (last modified by zooko)

We were chatting today about a subject that comes up about every three months: why do we use a consistently-permuted peerlist when choosing where shares should be placed?

The StorageIndex is the output of a hash function and therefore randomly distributed, so the set of all storage indices in the network should be (in the long run) evenly distributed across the SHA-256 2256 numberspace.

Our current Tahoe2 algorithm uses the file's Storage Index to permute a list of all known storage servers (by hashing the (SI+peerid) string and sorting the result). This adds an additional level of random-distribution: each file gets a different ordering of storage servers.

The alternate approach that Zooko suggested today was to skip the permutation and instead define the algorithm to be:

  • put the storage servers in a circle, placed by their nodeid
  • use the storage index as a starting location
  • travel clockwise, asking each server in turn to accept a lease
  • continue travelling until all shares are placed, or we've run out of servers

The reason we went with the permuted list was because we were concerned about the effects of non-uniform storage server capacities. There are two situations to look at: light loading (i.e. nobody is full and all lease requests are accepted), and heavy loading (some or most storage servers are full, and are rejecting lease requests, so the uploader will visit other servers to place their shares).

For light loading, the non-permuted approach causes any individual storage server to experience a load roughly equal to the percentage of the ring that is covered by its N counter-clockwise neighbors. On average, this will be the same for all peers, but some peerids might be clustered more than others, resulting in more traffic to those peers than the rest.

For heavy loading, once a server is full, all of the traffic that would have landed on it will be redirected clockwise around the ring (imagine rain flowing across the ground until it finds a hole large enough to form a puddle). This will result in a concentration of load on the server just past the full region, and may cause that server to fill quickly. The likely result is a large section of the ring which is full, while there may be more space available on the other side of the ring.

In contrast, the permuted approach removes the correlation between server locations: each file sees all servers in different locations. So instead of having "hot spots" (either caused by randomly-clustered peerids, or randomly-clustered full servers), the load will be distributed more evenly. We would expect the heavily-loaded grid to see all servers get full at roughly the same time.

There are two arguments in favor of switching to the non-permuted approach. The first is simplicity: it is slightly easier to explain the non-permuted algorithm, and it is easier to predict where any given file's shares will wind up. The second is a potential benefit for repair. The issue is as follows: a storage server has just died (the hard drive experienced a fatal error), and all of those shares are gone. Repair can be invoked to replace the missing shares and bring the file back up to full strength, but which files need to be repaired? The most convenient list of storage indexes that need repairing was on the server that just died. Is there some other way to construct this list of repair work?

(The assumption is that failure-driven repair is more sustainable than constant repair of all known files. This depends heavily upon the numbers we use: how many servers, how many files, how many clients, how many repair processes, what bandwidth is consumed by repair, etc).

The benefit that non-permuted share distribution would offer is in the resulting correlation between shares held by server B and those held by its neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have never rejected a lease request and are equally old, then every storage index on server B will also be on either A or C (assuming k>=2). Thus, if B dies, we can use A and C to construct the list of repair work that needs to be done.

However, Rob astutely pointed out that there are plenty of other ways to accomplish this same job. For example, each server could be assigned a "buddy", using a simple foolscap protocol, and each time server B adds a new share, it tells its buddy "D" about the storage index. If B dies, we ask the buddy for the share list and dump it into the repair queue. We could use a central server for this purpose, or distribute it out: what really matters is that we have a way to find out who the buddy is.

We need to think more about this. I like the load distribution properties of permuting, and I'm not particularly concerned about the descriptive complexity, but I too am concerned about the repair design, and would like to leave ourselves some tricks to pull out in case we run into future problems. The B-probably-has-A+C benefit of non-permutation breaks down if any of the servers are full or new, so I'm less convinced that it will be a significant help. But, so much of this is guesswork.. we don't really know.

Attachments (3)

ringsim.py (10.0 KB) - added by warner at 2009-12-26T04:37:16Z.
Brian's v1 simulator
ringsim.2.py (10.7 KB) - added by warner at 2009-12-26T05:59:43Z.
Brian's v2 simulator, prints nodeid gaps and min/max nodeid
ringsim.3.py (8.0 KB) - added by warner at 2009-12-27T02:53:46Z.
v3 of brian's simulator

Download all attachments as: .zip

Change History (28)

comment:1 Changed at 2008-02-08T04:01:27Z by zooko

Suppose you are going to tell some database -- either a very decentralized special-purpose database such as the "buddy system", or a centralized general database such as an Oracle DB -- which files need to be checked for repair if you die.

If we permute the peerlist based on the fileid, then you need to tell that database the complete set of files that you are holding shares on, and you need to update that database whenever you add a share, or else lazily allow a certain degree of lag between receiving shares of a new file and updating that database. Likewise, it might be good to update that database whenever you remove a share.

If we don't permute the peerlist based on the fileid, then you could summarize by notifying that database about the upper bound and lower bound of the shares that you hold. If due to some exceptional condition (you are the only storage server reachable to some uploader) you accept shares that are outside your normal range, then you might want to notify the database about those ones individually, but in the common case you don't need to notify the database about changes because the changes fall into your normal bounds.

comment:2 Changed at 2008-02-08T04:04:09Z by zooko

I'm not convinced that the load distribution properties of permuted-peerlist are actually better. Intuitively it feels better, and especially if you call correlations of maxed out servers "hot spots", but I don't think we really know whether one server that is out of space shedding overload to consecutive servers will be better or worse than that server shedding overload evenly to all other servers.

comment:3 Changed at 2008-02-08T04:24:24Z by zooko

I will offer more detail about my last comment, in which I say that I'm not convinced that the load-distribution properties of permuted-peerlist are better.

Which load-distribution pattern is better will depend on two things: our monitoring/alerting mechanism, and our provisioning mechanism. We don't know yet what either of those two will look like.

Let me give a couple of concrete examples of how certain plausible such mechanisms would make the load-distribution pattern of the linear peerlist better than the load-distribution pattern of the permuted peerlist:

  1. Monitoring/alerting mechanism: suppose we have a red light that goes on (e.g. an alert that triggers, sending e-mail to the operations team, for example), when a server reaches a critical low level of available storage. Suppose that, unbeknownst to the ops folks, the user access patterns have recently changed so that the grid as a whole is rapidly running out of space.
  1. a. In universe A, with the linear peer list, one of those red lights goes on, because one of the storage servers is out of space. Subsequently, the red light from the next server over goes on, because the load shed from the first one has caused the second one to run out of space. And so forth, at a linear rate -- a new server goes red every K minutes or hours.
  1. b. In universe B, with the permuted peer list, one of those red lights goes on, because one of the storage servers is out of space. No other alarms go off, because its load is shed evenly onto all other servers. Later, after K minutes or hours times the number of storage servers that we have, they all go red simultaneously.
  1. Provisioning: suppose one of the ways that we like to provision more storage space is to roll out a new server. If we have a linear peer list then you know that it will soak up overload which is shed from peers counterclockwise from it in the right. If we have a permuted peer list then it will soak up overload which is shed evenly from all peers. The former knowledge may be useful, for example if some of your current nodes are smaller capacity (i.e. older) than others -- you might want to deploy the new node clockwise from them.

Now, I'm not saying that our monitoring/alerting/provisioning mechanisms will necessarily be like this. I'm saying that since we don't know that much about what those mechanisms will be like, we should choose the simpler scheme, even though personally the "more distributed" feel of the permuted scheme appeals to me.

An important point in the above hypothetical scenarios is that the linear ordering allows the ops team to conceptualize and predict the states of their servers. Simplicity is valuable for such reasons.

comment:4 Changed at 2008-06-01T20:48:33Z by warner

  • Milestone changed from eventually to undecided

comment:5 Changed at 2008-08-05T19:37:44Z by warner

Zooko and I discussed this in more depth in Boulder the other week, and he's convinced me to at least put a space for "which peer-selection method should be used" into the next version of the URI string format. One setting would mean the permuted-index that Tahoe currently uses. Another setting would mean skip the permutation step.

It will take some further convincing before I'm ready to believe that non-permuted should be the default. But the main argument that made me feel that non-permuted should definitely be an option was that it may give grid operators more control over the peer-selection process.

Specifically: imagine that we give storage servers two separate published identifiers (included in their Introducer announcements). The first is their cryptographically-secure foolscap nodeid, as usual. The second is a new "storage index ring position", and is an arbitrary 160-bit number, not authenticated at all. The default would be for both to be the same, but grid operators could change the ring position all they wanted.

The non-permuted peer-selection algorithm would key off these ring position numbers. They would still use the nodeid for things like computing shared secrets (for write-enablers, etc), but they'd use the ring position numbers to decide which servers they should talk to.

Now, suppose the grid operators bring 8 machines online, 4 in an east-coast colo facility, and 4 in a west-coast one (always plan for meteors). Each machine has 4 hard drives. How do you obtain maximum distribution of the shares? With the permuted peer list, you just hope you get lucky.. on average the shares will be fairly well distributed, but sometimes you'll wind up with 4 shares on the same machine (one per spindle), and much of the time you'll get more shares on one coast than the other.

But with non-permuted peer-selection, you just have the grid operators assign ring positions to be equally distributed around the ring, in an order that achieves the distribution you want:

east-coast machine-1 spindle-1 west-coast machine-5 spindle-1 east-coast machine-2 spindle-1 west-coast machine-6 spindle-1 ... west-coast machine-8 spindle-1 east-coast machine-1 spindle-2 ...

Until the servers fill up, this will provide maximum distribution of files across failure boundaries. If the servers are fairly homogeneous, they're likely to fill up at about the same time.

Now, there are things to be discussed about non-homogeneous servers, and friendnet/non-managed gris in general, since I'm still concerned about what happens when a big server shadows a second one by sitting right before it in the ring. In a non-permuted scheme, the second server will see less traffic than you'd expect (not zero traffic, the actual amount depends upon how much space is between the N'th CCW node and the N-1'th CCW node). With random ring-location allocations, I fear that this non-uniform distribution may be a problem, but it will require some more analysis before I'd feel I understood the issue well enough to know. The kind of simulation Zooko was doing recently on the mailing list is a start, but he and I identified some fundamental problems with that work (starting with "what are we trying to measure?") that meant we can't really draw any conclusions from the results.

comment:6 Changed at 2008-09-05T19:36:46Z by zandr

This is not a win for ops, actually.

Let's discuss our current case. I have 40 storage nodes running now. In order to avoid lumpy distributions of blocks, we need to distribute the ring positions of these nodes evenly. (There is some discussion about this, but I'm with Brian: uneven distribution of nodes will result in uneven distribution of blocks)

So now, I bring up 10 new nodes in LA. I want them spread around the ring, but there's no easy way to make that distribution uniform. So to get back to a uniform distribution, I need to change every node's index.

That's a HUGE admin overhead, which will cause me to request a tool to do it for me. So now I have some foolscap-speaking control mechanism, or we end up with a queen setting blockserver ranges again. Both of those are significant complexity for very little real gain.

One thing that Brian and I discussed as an alternative was to have a tag associated with a storage server that described it. I can think of maybe three tiers, and values up to 20 or so for each tier, so we should probably use 8 bytes. :) So byte 1 would be site number, byte two would be chassis number, etc. We'd then select a set of peers with a large average distance in those tags, possibly by selecting one peer, then reshuffling the list so the next peer is as far as possible from the first, etc. This isn't a well-formed suggestion yet, but something to think about.

I actually think that the current system is fine, and works well. My only real concern with the current mechanism is that I'm somewhat concerned about different drive sizes. It feels like disks should fill at a uniform rate on a percentage basis, not on a bytes basis. I'm willing to be convinced this isn't a problem, however.

comment:7 Changed at 2008-09-06T05:42:17Z by warner

To follow up on the "lumpy distribution" point: Zooko and I have gone back and forth on this a bit. My current belief is that randomly-distributed nodeids will not provide uniform share-upload request rates. Let me throw out some rough analysis.

Imagine N=2 and four servers uniformly spread around the ring at A:90deg, B:180deg, C:270deg, D:0/360deg. Clearly any SI that lands in the 1-90 range will put a share on A and B, likewise if it lands in 90-180 the shares will go to B and C, etc.

Now if we move A to 80deg, then SIs from 1-80 will land on A and B, 80-180 will land on B and C, 181-270 on C and D, 271-360 will land on D and A. Assuming SIs are uniformly distributed, we get the following upload rates:

  • A: 80+90 = 170
  • B: 80+100 = 180
  • C: 100+90 = 190
  • D: 90+90 = 180

So we get non-uniform upload rates: server C (which is N shares around the ring from the larger-than-average gap) will fill up faster than the rest, and server A (which sits just after a smaller gap) will fill up slower.

If N=10 and we have 80 servers spread around, the same effect holds: the server that is 10 away from the larger gap will get more shares. It's possible that the effect is diluted a bit by large values of N (although I don't currently believe it), but the effect will be persistent: as long as those servers sit at those ring positions, their relative upload rates will be different.

The balancing tool that Zandr refers to would basically need to look at the set of servers and shift their ring positions to make the gaps all uniform. Each time a server was added, these positions would need to be readjusted. OTOH, you could roughly control the traffic rate for any particular server by adjusting the N'th CCW gap a bit.

But, this tool would be obligatory: the unmanaged behavior is bad. Using (effectively) randomly-chosen serverids as ring positions will result in (significantly, I believe) different upload rates, whereas we'd really prefer the upload rates to be completely uniform (specifically we want to fill all servers at the same time, to put off doubling up shares as much as possible). Whereas with the current permuted-list algorithm, the default unmanaged behavior is good.

So, I'm not yet convinced either way, I'm willing to leave space in the next version of the URI for a "peer-selection-algorithm number", so we can have some files uploaded without permutation. But I'm less convinced that it's a good idea than I was last month when I was in Boulder.

comment:8 Changed at 2009-10-28T07:35:23Z by davidsarah

  • Keywords newcaps newurls added

comment:9 Changed at 2009-11-02T08:39:34Z by warner

Zooko suggested I add a note about the following idea which came up in tahoe-dev:

Suppose an attacker gets to kill N servers of their choosing, and want to cause as much damage as possible. And suppose that there were far more than N servers in the grid, and we're using 1-of-N encoding. Now, if we're using the permuted-list algorithm, they could pick one file to completely kill (choose an arbitrary file, locate its servers, kill them all; boom, the file is dead). But killing two files is awfully hard: you'd have to be lucky and find two files that happen to permute to the same first N servers. I think the chance of killing a second file is like 1 over (M choose N), where M is the size of the grid: i.e., the number of permutations is huge. And of course killing a third file is that probability squared, etc.

Whereas if you aren't using the permuted-list algorithm, and shares are placed on consecutive servers starting at the SI, the attacker can do a lot more damage. They just take out any N consecutive servers. They'll completely kill 1/M of the files on the grid (since there are only M total permutations in use, one for each server). And they'll kill all-but-one of the shares for another 2/M files (the two immediate neighbors), and all-but-two of another 2/M files, etc, in a sort of triangularly-shaped distribution.

So I still think that permuted-list provides better properties.

comment:10 Changed at 2009-12-23T19:38:01Z by zooko

The simulator that I wrote showed that the effect Brian described in comment:7 was lost in the noise -- the permuted-per-file uploads and the flat ring uploads filled up their servers in an indistinguishably variant and noisy pattern. The two share placement strategies were identical for the first 96% of the run (that is: zero servers were full after the first 96% of the time, and then all servers were full at 100% of the run, regardless of which share placement strategy was used). Also the difference between the two strategies in that four percent interval were not salient. It was entirely up to luck (i.e., the pattern of which storage indexes appeared at random) whether some servers would fill up significantly faster than others, regardless of which share placement strategy was used.

Here is my note about it which includes a link to the simulator code: http://allmydata.org/pipermail/tahoe-dev/2008-July/000676.html .

comment:11 Changed at 2009-12-24T22:07:17Z by warner

Thanks for providing a link to the code! That will help me alot.

I have failed to take the time to review the simulator code properly (as I said I'd try to do the last time this was brought up), but what I remember at the time was not believing that it was testing the right thing: I seem to recall believing that it was averaging together several values or several runs and thereby destroying the information which we were most interested in. So I need to examine the results carefully (and probably write a simulator or two of my own) because I'll be willing to act upon them.

I think it would be helpful to simulate a progression of encoding parameters, where we increase N from 1 to 10, and see how it affects the share distribution. The "lumpiness" that I mentioned earlier would be most noticible when N=1 and become less noticable as N grows. Perhaps Zooko's simulator was run only with N=10. Knowing just how much this effect goes away with higher values of N would be useful (as well as knowing how much it increases with other parameters, like the variance of storage-server capacities).

Changed at 2009-12-26T04:37:16Z by warner

Brian's v1 simulator

comment:12 Changed at 2009-12-26T05:57:31Z by warner

ok, so the "ringsim.py" simulator that I just attached to this ticket demonstrates one of the concerns I described above: non-permuted peer-selection will result in higher bytes-per-second upload rates to some servers than to others. (I haven't yet built a simulator to investigate the effect of full servers shedding load onto their clockwise neighbors).

Run the code like python ./ringsim.py --seed=abc --permute=1 . It will create a ring of 100 servers using "abc" as a seed to decide their nodeids. (any specific seed will result in a consisent distribution of nodeids).

Then it will upload files (each with a size in randrange(2GiB), mean size 1GiB) one at a time. Every few thousand uploads it will analyze the space used per-server and emit a report line like:

uploaded 16000
min/max/(exp) usage-pf-ps 33.86 MB/38.66 MB/(35.72 MB): spread-pf: 4.80 MB (13.45%) stddev: 1.00 MB (2.81%)

The first block of numbers is "usage-per-file-per-server", meaning how much storage space was used on each server, divided by the total number of files that had been uploaded so far. If we pretend that we're uploading files at a rate of one per second, this is actually measuring bytes-per-second. The "min" value of 33.86 MB means that the least-used server had received 33.86MB per file (i.e. per second). The most-used (fullest) server had received 38.66MB per second. Our average filesize of 1GiB and 3-of-10 encoding parameters means that we'd expect to place 35.72MB per-server per-file.

The "spread-pf" is the difference between the least-used and most-used servers: 4.80MB = 38.66-33.86. 13.45% is that spread expressed as a percentage of the expected usage value.

The "stddev" is the standard deviation of all 100 servers' usage values. If usage were perfectly uniform, this would be zero. 2.81% is the standard deviation expressed as a percentage of the expected usage value.

The simulator will run nearly forever. Run it with --permute=1 and notice how the min/max values converge on the expected value over time, and how the spread and stddev drop towards zero. In my test run, after 200000 files, the spread was down to 1.61MB (4.5%) and the stddev down to 265kB (0.74%). This is the law of large numbers in action.

Now, re-run the simulator with --permute=0 and --seed=abc. It runs much faster (because linear ring-offset selection is a lot easier than hash-based permutation). Look at the usage report for 16000 files:

uploaded 16000
min/max/(exp) usage-pf-ps 7.10 MB/55.78 MB/(35.81 MB): spread-pf: 48.69 MB (135.96%) stddev: 12.12 MB (33.84%)

The spread is enormous, as is the standard deviation. The least-used server is using roughly an eighth as much space as the most-full server, whereas in the permuted case they were using within 15% of each other.

And if you let it run for a while and look at the 200000 file report, it doesn't get better over time:

uploaded 200000
min/max/(exp) usage-pf-ps 6.90 MB/56.05 MB/(35.69 MB): spread-pf: 49.15 MB (137.73%) stddev: 12.17 MB (34.10%)

Even after 200k files, the least-to-most-used ratio is 8x. And the stddev is basically constant.

A bit of extra code reveals why. The least-used server has a nodeid that starts with dff96a, and the neighboring portion of the sorted list of serverids (with the separation between each node and its CCW neighbor) shows:

<Server dee63f41a151782ed22d012814dbce4e> 0112bfce339ff4781d6b7a6f9705aae2
<Server df510eb7034c80b737719d76ca1d19d6> 006acf7561fb088865449c4eb5414b88
<Server df6848a848d3a59966e5835a1883fa77> 001739f1458724e22f73e5e34e66e0a1
<Server dfaff86b8c4d56ca38e673186fb11f56> 0047afc34379b130d200efbe572d24df
<Server dfba0da90d512319e6447e6464a57bb3> 000a153d8103cc4fad5e0b4bf4f45c5d
<Server dff96a791f1d835d6be65724b91e1225> 003f5cd011cc604385a1d8c054789672 <--
<Server e10e706a39fd09fe19db22fc498ed1ed> 011505f11adf86a0adf4cbd79070bfc8
<Server e1bfb715d3cec4799ad3bb2e50f523fe> 00b146ab99d1ba7b80f8983207665211
<Server e2f95cb1e2313cf26e738b3e8a0b8934> 0139a59c0e627878d39fd01039166536

100 uniformly-distributed servers would have a separation of 028F5C28F6... but the randomly-chosen nodeids in our --seed=abc ring are not uniformly distributed. In this case, lucky node dff96a happened to land unusually close after node dfba0, with a separation of just 003f5d..., about one tenth the ideal (uniform) separation. In fact it sits at the end of an unusally dense cluster of nodeids.

(what we actually care about is the separation between node dff96a and it's 10'th CCW neighbor, since we're encoding each file into 10 shares. The separation between dfba0 and dff96a is a big contributor to this, but not the whole thing).

And similarly, the most-used server was 4f5ab8, and that portion of the ring looks like:

<Server 3be5ef429769acbb7b9bb73443ea9fee> 0183079e758e2d8ac094ba3269f908fb
<Server 3f0b2004a830fc609b61621ee3b77b1f> 032530c210c74fa51fc5aaea9fccdb31
<Server 4681adee12a4e943c07fed69b644c640> 07768de96a73ece3251e8b4ad28d4b21
<Server 4f5ab87f270f850a05c443a14fa2042e> 08d90a91146a9bc645445637995d3dee <--
<Server 50719e5b06af03bbea0f362bed7e4dd3> 0116e5dbdf9f7eb1e44af28a9ddc49a5
<Server 52b741b1eb3e4d31ef44b78845b13a5f> 0245a356e48f49760535815c5832ec8c
<Server 54b497bd7905c60256a6a8735b6c2581> 01fd560b8dc778d06761f0eb15baeb22

The 4f5ab8 node is sitting just clockwise of an unusually large gap, from 4681ad, with a separation of 08d90b, about 3.75 times the ideal (uniform) separation.

This is the "lumpy distribution" problem that I was worried about. The effect is reduced when shares are spread over more servers. If I re-run the simulation with --N=40 (3-of-40 encoding), I see a spread of about 50% the expected value, and a stddev of about 15%. There is a corresponding increase in the effect when shares are spread over fewer servers: --N=5 gives me a spread of 195% and a sddev of 46%.

The effect is easiest to understand when k=N=1. In that case, the inlet rate for any given server is strictly equal to the total upload rate times the fraction of the ring that lies between that server and its nearest CCW neighbor. For our "abc" seed, the smallest separation is between node b80ea and b8159, with a gap of 0006ea (which is 1/95 of the uniform gap), and the largest is between 6ef56 and 7b41d, with a gap of 0c4c6 (about 4.8 times the uniform gap). So we'd expect to see lucky node b8159 to get about 0.0095 of the total traffic, and unlucky 7b41d to get about .048 of the traffic, and a ratio between the two of about 455x.

And indeed, although b8159 and 5a82c are in competition for least-used server, after about 300000 files, we get this report:

uploaded 292000
min/max/(exp) usage-pf-ps 153.15 kB/52.44 MB/(10.68 MB): spread-pf: 52.29 MB (489.40%) stddev: 10.69 MB (100.02%)
least: b8159546f332951c52367c4ad92fd9f7
most: 7b41d2d31e6180d42eec56221d02cf4d

And 10.68MB/153kB is 70x, and 52.44/10.68 is 4.9x, matching our expectations of 95x and 4.8x pretty closely.

If you re-run the program with e.g. --seed=def --permute=0, you'll get a different distribution of nodeids, which happens to get a 4x ratio between most-full and least-full, and a stddev of about 30%. Better, but still pretty bad. --seed=def --permute=1 behaves just as well as the "abc" seed: stddev is again down to 0.74% after 200k files.

If you're lucky and find a seed that gives you a uniform distribution, then --permute=0 should give you the same statistics as --permute=1. But for most seeds (i.e. most grids), you'll get a very lumpy distribution. --permute=1 tolerates arbitrarily lumpy server distributions.

So, based upon this simulation, I'm fairly convinced that permuted-list is necessary to avoid long-term uneven upload rates to different servers. A simple linear ring-offset algorithm will subject servers to vastly different loads unless the nodeids can be controlled to maintain a uniform distribution (which means changing them every time a server is added or removed).

Now, do we actually need uniform upload rates? What we really want, to attain maximum reliability, is to never double-up shares. That means we want all servers to become full at the same time, so instead of equal bytes-per-second for all servers, we actually want equal percentage-of-space-per-second for all servers. That's much trickier. If we could completely (and continuously) control nodeids (by decoupling peer-selection index values from cryptographic-backed server pubkeys), we could adjust them to achieve inter-server gaps that compensate for how much space they have remaining: small servers would be clustered closer together, large servers would be placed CW from large gaps. The math necessary to do this strikes me as pretty complicated, and I think that changing nodeids over time would damage efficient retrievability, since shares will no longer be in the ideal places when the downloader tries to perform the same peer-selection routine as the uploader did.

We could also have servers refuse some percentage of incoming shares even if they had space for them, to get their percentage-full-per-second rates down to match the grid-wide average. This would induce the same problems that the ring-offset and lumpy-distribution scheme has: servers which happen to sit CW of a self-throttling node will get more traffic than usual.

OTOH, it's a little bit easier than that: we don't need to engage in this load-shaping work until we start to run out of servers. If we have at least "N" servers with space available, then reliability is unaffected by the rate at which they're filling up. So we could have servers accept shares at full speed until it looked like the grid was starting to fill up, then have them switch into a mode where they defer requests to other servers more and more (to obtain uniform fill rates) as the remaining space dwindles. The shaping effect would be negligible in a grid with lots of free space. A managed grid, for which new servers are added before the grid gets full, would never need to engage in load shaping. But any amount of load shaping that *was* being performed would put off the day at which the first server gets full.

So, in summary, I am re-convinced that linear ring-offset has real problems, and that permuted-list provides a more uniform bytes-per-second inlet rate, which is easier to deal with and gives better system-wide properties.

Changed at 2009-12-26T05:59:43Z by warner

Brian's v2 simulator, prints nodeid gaps and min/max nodeid

comment:13 Changed at 2009-12-26T16:12:04Z by zooko

Thanks for doing this work to simulate it and write up such a detailed and useful report! I think you are right that the unpermuted share placement can often (depending on node id placement and N) result in significantly higher inlet rates to some storage servers than others. But as you say it isn't clear how much this matters: "Now, do we actually need uniform upload rates? What we really want, to attain maximum reliability, is to never double-up shares. That means we want all servers to become full at the same time, so instead of equal bytes-per-second for all servers, we actually want equal percentage-of-space-per-second for all servers."

Note that in actual deployment, storage servers end up being of multiple generations, so for example on the allmydata.com prodgrid the oldest servers are running 1 TB hard drives, then once those started filling up we deployed the thumper which comprises about 48 storage servers each with a 0.5 TB hard drive, then once the thumper started getting full we deployed a few more servers, including ten which each had a 2 TB hard drive. The point is that there was never a time (after the initial deployment started to fill up) where we had similar amounts of free space on lots of servers so that equal inlet rates would lead to equal time-to-full.

My simulator (mentioned earlier in this thread) reported time-to-full instead of reporting inlet rate, and it indicated that regardless of whether you have permuted or non-permuted share placement, if you start with a large set of empty, same-sized servers and start filling them, then once the first one gets full then very quickly they all get full.

Note that there are two separate arguments: 1. A more uniform inlet rate might not be so important. 2. The time between the first one filling and the last one filling is a small fraction of the time between the start of the grid and the last one filling (regardless of share placement strategy).

I guess I'm not sure how you got from "do we actually need uniform upload rates?" to "easier to deal with and gives better system-wide properties" in your comment:12.

Oh! Also note that "What we really want, to attain maximum reliability, is to never double-up shares" is at least partially if not fully addressed by #778.

comment:14 follow-up: Changed at 2009-12-27T02:39:10Z by davidsarah

  • Keywords performance preservation upload added

The results of Brian's and Zooko's simulations seem contradictory to me. Suppose that one server is filling up x times faster than the average of all the servers. Then, for the case where all servers have the same amount of storage, I'd expect that server to fill up at the point where the grid is only approximately 1/x full. Is it my expectation that is wrong, or is it one of the simulations (hopefully not both :-)?

In any case, if #778 is fixed, then a non-uniform upload rate shouldn't affect preservation of files, for as long as there are still happy servers that are not full. But if some servers become full early, then that will affect upload performance because uploads will be expected to have to contact more servers before finding one that is non-full.

Also, if #778 is fixed but #543 and #699 aren't, then in order to recover from a full grid you would need to simultaneously add happy servers (no matter how much extra space each new server has). That sounds like it could be an operational problem. So fixing #543 and #699 should be a priority after releasing 1.6, I think.

Changed at 2009-12-27T02:53:46Z by warner

v3 of brian's simulator

comment:15 in reply to: ↑ 14 Changed at 2009-12-27T03:35:07Z by davidsarah

Replying to davidsarah:

... But if some servers become full early, then that will affect upload performance because uploads will be expected to have to contact more servers before finding one that is non-full.

And also download performance, because fewer shares will be in the expected places near the front of the permuted list.

comment:16 Changed at 2009-12-27T04:00:55Z by warner

Zooko: you make excellent points about heterogeneous server capacities. I've always gone back and forth between different mental images of how a Tahoe grid "ought" to work:

  • set up N homogeneous-sized servers, run until they're full, then quit your universe
  • set up N homogeneous-sized servers, run until they're almost full, then add more, repeat. (the first few years of allmydata.com, in which each "phase" or "wave" of servers usually consisted of 20 drives, 1TB each, four drives to a 1U chassis)
  • do that, but sometimes add heterogeneous-sized mixes of servers (the last year of allmydata.com, with the 48*500GB thumper)
  • collect a bunch of friends with completely randomly sized servers, watch them come and go over time (testgrid, volunteer grid)

I've also gone back and forth in my assumptions about how files will come and go. In the latest-snapshot-only backup world, we generally assume that a new user starts by uploading a whole bunch of files, then switches into a "maintenance" mode in which a small percentage of those files are replaced by new versions every few days. In maintenance mode, you're probably using more space than you're deleting, but at a much slower rate than during the initial upload. If you're adding more users than you're removing, the overall usage rate will remain positive, so you'll have to add more disk space eventually. It's like economic inflation, but for disk space.

Each of these styles will result in different sorts of share placement behavior. The allmydata.com phase1/phase2 style has resulted in files living in a specific generation of server, which of course is not ideal for downloads (since the shares are never in the ideal place): rebalancing is the nominal answer, but the traffic costs may be prohibitive.

Zooko writes:

I guess I'm not sure how you got from "do we actually need uniform upload rates?" to "easier to deal with and gives better system-wide properties" in your comment:12.

Let me first correct something I said. Brian writes:

Now, do we actually need uniform upload rates? What we really want, to attain maximum reliability, is to never double-up shares.

Actually, we want more than that. We want to maximize diversity of share placement. If we never doubled-up shares, but always put files on the same sets of servers (either servers 1-10, or servers 11-20, or 21-30, but never any other sets like 1,3,7,14,16,21,26,28,29,30), then random server failures have a higher probability of killing files. We get the lowest probability of losing files when there is minimal correlation between share placement of each file. We've discussed this before, partially in this ticket (the "Suppose an attacker gets to kill N servers of their choosing" comment:9), and partially elsewhere. I've built a simulator to prove this before, so I can expand on the idea if it's contentious.

So for maximum reliability, which I'll put into the category of "better system-wide properties", we want to keep as many servers available and in the pool of non-full servers for as long as possible.

In the allmydata.com phase1/phase2 scenario, all the servers had the same sized drives. We started with 20 servers (1TB each), and they all started to get full at about the same time. I think it took something like 20 weeks to fill those. When they were about 90% full (say two weeks to go), we bought a new batch of servers and brought them online. This dropped the uniform (per-server) inlet rate in half (since we then had 40 servers instead of 20). Four weeks later, the old batch finally filled up, at which point we were back down to 20 free servers and the inlet rate bumped back up to normal. We had 18 weeks of files spread among just the phase1 servers, then 2 weeks of files spread among phase1+phase2, then a duration (17 weeks? 19? math is hard) when files were only stored on the phase2 servers, then we spun up a phase3, etc. Even when we added the thumper, with its 48 drives of 500GB each, the available server space consisted mostly of uniformly-sized drives (i.e. phase4 was entirely the thumper).

To get maximum reliability from that environment, you want to fill them at equal bytes-per-second rates, because that way they'll all get full at the same time (and you'll always have a maximum number of hosts to choose from). You'd get even better reliability if you bought all your phases ahead of time, and filled 40 drives at half the rate for 40 weeks, instead of 20 drives for 20 weeks and then a different 20 drives for the next 20 weeks. But capital costs, interest rates, predicting future needs, etc (all the usual business decisions) encourage just-in-time procurement policies, so 20 active servers was a reasonable compromise.

If we had upload rates that were as non-uniform as the ring-offset algorithm would provide, we'd fill some servers in just a few weeks, reducing the pool of active servers from 20 down to 18 or 16 or something, which would reduce overall reliability; not as badly as losing 11 servers such that we were down to 9 active ones and started doubling up shares, but still a measureable loss.

Now, if you have a bunch of heterogeneous servers, like on the testgrid or the volunteergrid, it's more complicated. There is no way to maintain maximum diversity: either you shape your traffic (intentionally sending more shares to large servers than small ones, in an attempt to fill everybody at the same time), or you send uniform rates to everybody and fill the small ones first and then end up sending more shares for the later files to the large servers. Both approaches eventually result in non-ideal distribution. I think we just have to accept this fact: files that are uploaded during periods of more free servers will be more reliable. This might be at the beginning of time, when small servers aren't full yet, or later when new servers join the grid, or just after some sort of rebalancing process causes small servers to become briefly available again.

Having non-uniform upload rates in a heterogeneous grid means that a lot of your properties become the luck of the draw. Really small servers will be filled so quickly that you can just ignore them. The number of available servers at any given point in time will depend upon both the variation between upload rates and the variation in server capacity. My previous simulation showed that k=3,N=10,servers=100 could easily result in an 8x variation in upload rate. We could imagine that the smallest server anyone might run would be say 10GB, and the largest would be say 4TB, giving a 400x range in capacity. Both are huge.

So having uniform traffic rates is a good thing for homogeneous grids (or piecewise homogeneous grids like allmydata.com's successive waves of servers), and it would be hard to predict its effects on a heterogeneous grid. Good for one and neutral/unknown for the other.

As for "easier to deal with", I think it will be easier for us to tell server operators that they'll each be receiving 1/X of the total upload traffic (where X is the mostly-constant number of non-full servers, something that will be easy to display on the Welcome page), than to tell them that their traffic will different for each server and will depend in hard-to-predict ways upon the exact distribution of nodeids and the encoding parameters chosen by each uploader. We could build a tool that took k+N+serverlist and computed a rate for any given server, and display that on the welcome page, but I'd expect users to look at us funny and wonder why it keeps changing and why they can't control it (assuming that it proves hard to decouple nodeid from ring position). I think users would understand the practical effects of a uniform rate much more easily, and then be able to answer useful questions like "how long will it be until my server is full".

Zooko writes:

My simulator (mentioned earlier in this thread) reported time-to-full instead of reporting inlet rate, and it indicated that regardless of whether you have permuted or non-permuted share placement, if you start with a large set of empty, same-sized servers and start filling them, then once the first one gets full then very quickly they all get full.

With all due respect, your conclusion is wrong. In fact, the message you wrote in http://allmydata.org/pipermail/tahoe-dev/2008-July/000676.html doesn't actually support this conclusion:

Here is the result of "simulate_load.py --iters=128 --permute":

[GRAPHS]

I'm not sure that I can see any difference between those two.

[CONCLUSIONS] ..

Whoops! Stop the presses! I misled myself by running many iterations (128 of them), averaging the results together, and plotting just the average. If you run "simulate_load.py --iters=1" and look at the output of a single actual run you'll see that the results from any given run are all over the map, and that the variability of the system totally swamps any subtle difference in the average behavior between permuted and simple variants. (There are a couple of examples of --iters=1 runs appended.)

Which, if you had removed the text that was negated by the misleading averaging-of-many-iterations, would have just read:

you'll see that the results from any given run are all over the map, and that the variability of the system totally swamps any subtle difference in the average behavior between permuted and simple variants.

from which it is hard to draw any conclusions. (in the future, if you conclude A and then realize that A is wrong and then examine B, then please write a message that says "B" instead of a message that says "A! Not A! B". I'll get more out of it and I'll be more convinced by your B conclusions. Much of my vague disbelief about your simulator's results were due to the misleading train of thought through A+!A territory :-).

Now, the examples of --iters=1 you attached *do* appear to support your conclusion, in that it showed about 137k files before the first server was filled, and about 140k files when the first wrap occurred. I ran your simulator some more myself (with the same settings as my own: 3-of-10, 100 servers) to learn more, and saw similar results, all of which were at odds with the simulator that I wrote yesterday.

When I run the current version (v3) of my ringsim.py with --permute=0, I see the first server being filled at about 83k files, whereas the first wrapped file (which occurs when the numservers-N'th server is filled) occurs at about 128k files, and the entire grid fills at about 129k files. This hardly counts as "once the first one gets full then very quickly they all get full": the two events are separated by a third of the total uploads.

Now, why did your simulation results differ so greatly from mine? I was missing this until just today, but the source of the problem is a bug in your simulator. The SI-as-offset-into-ring algorithm, as defined in the Description of this ticket, is:

  • put the storage servers in a circle, placed by their nodeid
  • use the storage index as a starting location
  • travel clockwise, asking each server in turn to accept a lease
  • continue travelling until all shares are placed, or we've run out of servers

However your simulate_load.py code which simulated this algorithm is:

        else:
            # rotate a random number
            rot = random.randrange(0, len(servers))
            servers = servers[rot:] + servers[:rot]

which actually implements an entirely different algorithm, one which is clearly insensitive to the distribution of the nodeids. My simulator implements the real rotate-by-SI algorithm, and is thus sensitive to the lumpy distribution. That's why I was seeing a big difference between first-full and last-full, and you were not.

Your simulator's algorithm is even simpler. Maybe we should consider pursuing it, instead of these other two? Let's call it "random-spin". The full specification would be something like:

  • put the storage servers in a circle, placed by their nodeid
  • calculate a pseudorandom "spin count", perhaps equal to SI mod len(servers)
  • start with the spincount'th clockwise server
  • travel clockwise, asking each server in turn to accept a lease
  • continue travelling until all shares are placed, or we've run out of servers

Unfortunately using SI % len(servers) for the spincount calculation would be massively sensitive to the number of servers: adding or removing a single server would effectively randomize its value, causing the downloader to look in entirely the wrong place. We need something more stable.

Perhaps if the spin count were recorded in the filecap? We could either generate it randomly (and give up any pretense of convergence), or use the SI to calculate it in the first place (and get location-of-storage convergence iff the serverlist remained exactly the same). Either way, if we write it down, the downloader could use a that copy from the filecap, making it a bit more stable.

How sensitive is this to server churn? Specifically, if a server gets added, how much overlap will there be between the downloader's servers-to-query list and the uploader's? If the new node appears CW of the last (most-CW) uploader-server, then nobody will notice. If it is added CCW of the last uploader-server, then the overlap will be reduced by one. Adding multiple nodes in a single area affects the overlap to a varying extent depending upon the insertion point: it hurts the most if it's CCW of the first uploader-server, dropping to zero impact if it's CW of the last uploader-server.

This is a bit troubling, because it means that some files are going to be more sensitive to churn than others. A file that has a spincount of zero is most tolerant: 10% churn means only 10% loss of overlap. But if the spincount is high, so the set of uploader-servers starts at the 90% point of the ring (11 o-clock, nodeid 0xff..), then every single inserted server will mess up the overlap. With 3-of-10 encoding, adding 10 servers anywhere on the ring (even if there were a million servers there already) would double the downloader's search distance, and removing 10 servers would cause the downloader to have to query every single node in the grid to find its shares (the starting point would be beyond all the right nodes, so it'd have to search the long way around).

I believe the permuted-list algorithm is more uniformly sensitive to churn: regardless of where the file lives, adding 10% new nodes (i.e. adding 10 nodes to a grid that used to have 100 nodes) will reduce the overlap by 10%. Adding 10 servers to a million-node grid would have negligible effect.

I suspect that random-spin is, on average, equally sensitive to churn as permuted-list, but I believe there would be a huge variation from one file to the next, enough to cause significant problems for downloaders.

So again I'm back to favoring permuted-list.

Oh! Also note that "What we really want, to attain maximum reliability, is to never double-up shares" is at least partially if not fully addressed by #778.

Well, no, #778 ("use servers-of-happiness instead of shares-of-happiness") is a question of what qualifies as a successful upload versus a failed upload. It doesn't say anything about how to choose the servers. Instead, as I understand it, it's about evaluating the results of an upload to decide whether it needs to be repeated (by reporting it as a failure to the user, who will presumeable try it again). #778 seems to provide a function that is run after the upload is complete, once which takes the set of surviving servers (i.e. the ones which stuck around until the end of the upload process) and returns a success/fail boolean.

To attain maximum reliability, in the long run, requires a server-selection policy/algorithm that gives us good results and maximum diversity across thousands and millions of uploads, with continual server churn. #778 is too narrowly scoped for that.

comment:17 Changed at 2009-12-27T04:38:39Z by warner

davidsarah writes:

The results of Brian's and Zooko's simulations seem contradictory to me.

Suppose that one server is filling up x times faster than the average of all the servers. Then, for the case where all servers have the same amount of storage, I'd expect that server to fill up at the point where the grid is only approximately 1/x full.

Yes, that's correct. When my ring-offset simulation (ringsim.py) is run with parameters --seed=abc --servers=40 --k=1 --N=1 --permute=0 --fileseed=123, then X is about 4 (about 9MB/upload vs an expected average of 2.28MB/file). We'd expect the grid to become full after 439201 uploads, but the first server becomes full after only 110344 uploads. (the current version of ringsim.py is fully deterministic: you should be able to run it with the same parameters and get exactly the same results).

Note that my previous simulator didn't make it easy to tell when a server became full, and had set the server capacity ridiculously high (1PB) to focus on the long-term bytes-per-second inlet rates (it also didn't actually implement the fall-over-to-other-servers code that would be necessary with finite capacity). The current version puts more attention on the grid-is-full period and handles server fall-over accurately.

Is it my expectation that is wrong, or is it one of the simulations (hopefully not both :-)?

Zooko's simulation was actually of a different algorithm, "random-spin", which has uniform inlet rates, but has other problems, described above.

In any case, if #778 is fixed, then a non-uniform upload rate shouldn't affect preservation of files, for as long as there are still happy servers that are not full.

On a per-file basis, yes. I think a non-uniform upload rate in a homogeneous grid will result in the (fewer than $HAPPY non-full servers) state happening slightly earlier. As evidence, I look at ringsim.py's results for the "FIRST FILE WRAPPED" event, which is a more stringent condition than $HAPPY would be (it's equivalent to setting $HAPPY=N). With --seed=abc, I see ring-offset (i.e. --permute=0) getting the first wrap at 124801 files, and the grid being full at 126738 files, and I see permuted-list get the wrap at 126528 and grid-full at 126738. So permuted-list was able to upload about 1700 more files without doubling-up shares than ring-offset could do.

But I'll continue to argue that, when you look at multiple files as a group, you get better reliability when you have more servers, in which case the filling-some-servers-earlier behavior will be detrimental to even non-wrapped files (ones that #778 would declare as successful).

But if some servers become full early, then that will affect upload performance because uploads will be expected to have to contact more servers before finding one that is non-full.

I haven't even been thinking about upload *performance*, but yeah, that's right. I suspect that random-spin and ring-offset will suffer problems here: some section of the ring will get full, and then any upload which touches that section will have to hunt a long distance for free servers (whereas other portions of the ring will have short searches because they're all free). Whereas permuted-list will effectively get a random list of servers every time, so the uploader's search distance should grow smoothly as the grid fills up, each file getting roughly the same distance. In other words, the mean of the upload search distance will grow as the grid fills, but for permuted-list the standard deviation will remain low, whereas for random-spin and ring-offset the stdev will be high.

comment:18 Changed at 2009-12-27T04:53:11Z by davidsarah

"either you shape your traffic (intentionally sending more shares to large servers than small ones, in an attempt to fill everybody at the same time)"

This is essentially #872.

comment:19 Changed at 2010-01-10T23:36:13Z by zooko

Oh! You're right, my simulator was not simulating random-index-into-ring correctly. Sorry about that.

comment:20 Changed at 2010-02-22T23:49:33Z by zooko

Okay, I was wrong because my simulator had a bug, and Brian was right that a flat ring has substantially worse load-balancing properties. I think we should close this ticket as wontfix, but I'd like to read through it one last time and extract any valuable bits first.

comment:21 Changed at 2010-12-29T08:37:53Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned

comment:22 Changed at 2010-12-29T19:55:51Z by davidsarah

In Boris Mejías' thesis, the non-uniform distribution problem in ring-based DHTs is mentioned in section 2.3.1. He references this paper on how to solve it, although I haven't checked whether that would be applicable to Tahoe.

Last edited at 2010-12-29T19:56:42Z by davidsarah (previous) (diff)

comment:23 Changed at 2012-09-10T17:51:02Z by davidsarah

I think this ticket should now be wontfixed.

comment:24 Changed at 2013-03-27T14:37:09Z by zooko

  • Description modified (diff)

comment:25 Changed at 2013-06-25T15:58:18Z by zooko

  • Description modified (diff)
  • Resolution set to invalid
  • Status changed from assigned to closed

Okay, I'm finally closing this ticket! Brian was totally right, and his original idea of permuting the ring per-file-id was great, we've used it all along, and it does help a lot with issues of load-balancing storage which are endemic to the whole consistent-hashing concept. Thanks, Brian! ☺

Note: See TracTickets for help on using tickets.