| 1 | The "Denver Airport" Protocol |
|---|
| 2 | |
|---|
| 3 | (discussed whilst returning robk to DEN, 12/1/06) |
|---|
| 4 | |
|---|
| 5 | This is a scaling improvement on the "Select Peers" phase of Tahoe2. The |
|---|
| 6 | problem it tries to address is the storage and maintenance of the 1M-long |
|---|
| 7 | peer list, and the relative difficulty of gathering long-term reliability |
|---|
| 8 | information on a useful numbers of those peers. |
|---|
| 9 | |
|---|
| 10 | In DEN, each node maintains a Chord-style set of connections to other nodes: |
|---|
| 11 | log2(N) "finger" connections to distant peers (the first of which is halfway |
|---|
| 12 | across the ring, the second is 1/4 across, then 1/8th, etc). These |
|---|
| 13 | connections need to be kept alive with relatively short timeouts (5s?), so |
|---|
| 14 | any breaks can be rejoined quickly. In addition to the finger connections, |
|---|
| 15 | each node must also remain aware of K "successor" nodes (those which are |
|---|
| 16 | immediately clockwise of the starting point). The node is not required to |
|---|
| 17 | maintain connections to these, but it should remain informed about their |
|---|
| 18 | contact information, so that it can create connections when necessary. We |
|---|
| 19 | probably need a connection open to the immediate successor at all times. |
|---|
| 20 | |
|---|
| 21 | Since inbound connections exist too, each node has something like 2*log2(N) |
|---|
| 22 | plus up to 2*K connections. |
|---|
| 23 | |
|---|
| 24 | Each node keeps history of uptime/availability of the nodes that it remains |
|---|
| 25 | connected to. Each message that is sent to these peers includes an estimate |
|---|
| 26 | of that peer's availability from the point of view of the outside world. The |
|---|
| 27 | receiving node will average these reports together to determine what kind of |
|---|
| 28 | reliability they should announce to anyone they accept leases for. This |
|---|
| 29 | reliability is expressed as a percentage uptime: P=1.0 means the peer is |
|---|
| 30 | available 24/7, P=0.0 means it is almost never reachable. |
|---|
| 31 | |
|---|
| 32 | |
|---|
| 33 | When a node wishes to publish a file, it creates a list of (verifierid, |
|---|
| 34 | sharenum) tuples, and computes a hash of each tuple. These hashes then |
|---|
| 35 | represent starting points for the landlord search: |
|---|
| 36 | |
|---|
| 37 | starting_points = [(sharenum,sha(verifierid + str(sharenum))) |
|---|
| 38 | for sharenum in range(256)] |
|---|
| 39 | |
|---|
| 40 | The node then constructs a reservation message that contains enough |
|---|
| 41 | information for the potential landlord to evaluate the lease, *and* to make a |
|---|
| 42 | connection back to the starting node: |
|---|
| 43 | |
|---|
| 44 | message = [verifierid, sharesize, requestor_furl, starting_points] |
|---|
| 45 | |
|---|
| 46 | The node looks through its list of finger connections and splits this message |
|---|
| 47 | into up to log2(N) smaller messages, each of which contains only the starting |
|---|
| 48 | points that should be sent to that finger connection. Specifically we sent a |
|---|
| 49 | starting_point to a finger A if the nodeid of that finger is <= the |
|---|
| 50 | starting_point and if the next finger B is > starting_point. Each message |
|---|
| 51 | sent out can contain multiple starting_points, each for a different share. |
|---|
| 52 | |
|---|
| 53 | When a finger node receives this message, it performs the same splitting |
|---|
| 54 | algorithm, sending each starting_point to other fingers. Eventually a |
|---|
| 55 | starting_point is received by a node that knows that the starting_point lies |
|---|
| 56 | between itself and its immediate successor. At this point the message |
|---|
| 57 | switches from the "hop" mode (following fingers) to the "search" mode |
|---|
| 58 | (following successors). |
|---|
| 59 | |
|---|
| 60 | While in "search" mode, each node interprets the message as a lease request. |
|---|
| 61 | It checks its storage pool to see if it can accomodate the reservation. If |
|---|
| 62 | so, it uses requestor_furl to contact the originator and announces its |
|---|
| 63 | willingness to host the given sharenum. This message will include the |
|---|
| 64 | reliability measurement derived from the host's counterclockwise neighbors. |
|---|
| 65 | |
|---|
| 66 | If the recipient cannot host the share, it forwards the request on to the |
|---|
| 67 | next successor, which repeats the cycle. Each message has a maximum hop count |
|---|
| 68 | which limits the number of peers which may be searched before giving up. If a |
|---|
| 69 | node sees itself to be the last such hop, it must establish a connection to |
|---|
| 70 | the originator and let them know that this sharenum could not be hosted. |
|---|
| 71 | |
|---|
| 72 | The originator sends out something like 100 or 200 starting points, and |
|---|
| 73 | expects to get back responses (positive or negative) in a reasonable amount |
|---|
| 74 | of time. (perhaps if we receive half of the responses in time T, wait for a |
|---|
| 75 | total of 2T for the remaining ones). If no response is received with the |
|---|
| 76 | timeout, either re-send the requests for those shares (to different fingers) |
|---|
| 77 | or send requests for completely different shares. |
|---|
| 78 | |
|---|
| 79 | Each share represents some fraction of a point "S", such that the points for |
|---|
| 80 | enough shares to reconstruct the whole file total to 1.0 points. I.e., if we |
|---|
| 81 | construct 100 shares such that we need 25 of them to reconstruct the file, |
|---|
| 82 | then each share represents .04 points. |
|---|
| 83 | |
|---|
| 84 | As the positive responses come in, we accumulate two counters: the capacity |
|---|
| 85 | counter (which gets a full S points for each positive response), and the |
|---|
| 86 | reliability counter (which gets S*(reliability-of-host) points). The capacity |
|---|
| 87 | counter is not allowed to go above some limit (like 4x), as determined by |
|---|
| 88 | provisioning. The node keeps adding leases until the reliability counter has |
|---|
| 89 | gone above some other threshold (larger but close to 1.0). |
|---|
| 90 | |
|---|
| 91 | [ at download time, each host will be able to provide the share back with |
|---|
| 92 | probability P times an exponential decay factor related to peer death. Sum |
|---|
| 93 | these probabilities to get the average number of shares that will be |
|---|
| 94 | available. The interesting thing is actually the distribution of these |
|---|
| 95 | probabilities, and what threshold you have to pick to get a sufficiently |
|---|
| 96 | high chance of recovering the file. If there are N identical peers with |
|---|
| 97 | probability P, the number of recovered shares will have a gaussian |
|---|
| 98 | distribution with an average of N*P and a stddev of (??). The PMF of this |
|---|
| 99 | function is an S-curve, with a sharper slope when N is large. The |
|---|
| 100 | probability of recovering the file is the value of this S curve at the |
|---|
| 101 | threshold value (the number of necessary shares). |
|---|
| 102 | |
|---|
| 103 | P is not actually constant across all peers, rather we assume that it has |
|---|
| 104 | its own distribution: maybe gaussian, more likely exponential (power law). |
|---|
| 105 | This changes the shape of the S-curve. Assuming that we can characterize |
|---|
| 106 | the distribution of P with perhaps two parameters (say meanP and stddevP), |
|---|
| 107 | the S-curve is a function of meanP, stddevP, N, and threshold... |
|---|
| 108 | |
|---|
| 109 | To get 99.99% or 99.999% recoverability, we must choose a threshold value |
|---|
| 110 | high enough to accomodate the random variations and uncertainty about the |
|---|
| 111 | real values of P for each of the hosts we've selected. By counting |
|---|
| 112 | reliability points, we are trying to estimate meanP/stddevP, so we know |
|---|
| 113 | which S-curve to look at. The threshold is fixed at 1.0, since that's what |
|---|
| 114 | erasure coding tells us we need to recover the file. The job is then to add |
|---|
| 115 | hosts (increasing N and possibly changing meanP/stddevP) until our |
|---|
| 116 | recoverability probability is as high as we want. |
|---|
| 117 | ] |
|---|
| 118 | |
|---|
| 119 | The originator takes all acceptance messages and adds them in order to the |
|---|
| 120 | list of landlords that will be used to host the file. It stops when it gets |
|---|
| 121 | enough reliability points. Note that it does *not* discriminate against |
|---|
| 122 | unreliable hosts: they are less likely to have been found in the first place, |
|---|
| 123 | so we don't need to discriminate against them a second time. We do, however, |
|---|
| 124 | use the reliability points to acknowledge that sending data to an unreliable |
|---|
| 125 | peer is not as useful as sending it to a reliable one (there is still value |
|---|
| 126 | in doing so, though). The remaining reservation-acceptance messages are |
|---|
| 127 | cancelled and then put aside: if we need to make a second pass, we ask those |
|---|
| 128 | peers first. |
|---|
| 129 | |
|---|
| 130 | Shares are then created and published as in Tahoe2. If we lose a connection |
|---|
| 131 | during the encoding, that share is lost. If we lose enough shares, we might |
|---|
| 132 | want to generate more to make up for them: this is done by using the leftover |
|---|
| 133 | acceptance messages first, then triggering a new Chord search for the |
|---|
| 134 | as-yet-unaccepted sharenums. These new peers will get shares from all |
|---|
| 135 | segments that have not yet been finished, then a second pass will be made to |
|---|
| 136 | catch them up on the earlier segments. |
|---|
| 137 | |
|---|
| 138 | Properties of this approach: |
|---|
| 139 | the total number of peers that each node must know anything about is bounded |
|---|
| 140 | to something like 2*log2(N) + K, probably on the order of 50 to 100 total. |
|---|
| 141 | This is the biggest advantage, since in tahoe2 each node must know at least |
|---|
| 142 | the nodeid of all 1M peers. The maintenance traffic should be much less as a |
|---|
| 143 | result. |
|---|
| 144 | |
|---|
| 145 | each node must maintain open (keep-alived) connections to something like |
|---|
| 146 | 2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the |
|---|
| 147 | introducer). |
|---|
| 148 | |
|---|
| 149 | during upload, each node must actively use 100 connections to a random set |
|---|
| 150 | of peers to push data (just like tahoe2). |
|---|
| 151 | |
|---|
| 152 | The probability that any given share-request gets a response is equal to the |
|---|
| 153 | number of hops it travels through times the chance that a peer dies while |
|---|
| 154 | holding on to the message. This should be pretty small, as the message |
|---|
| 155 | should only be held by a peer for a few seconds (more if their network is |
|---|
| 156 | busy). In tahoe2, each share-request always gets a response, since they are |
|---|
| 157 | made directly to the target. |
|---|
| 158 | |
|---|
| 159 | I visualize the peer-lookup process as the originator creating a |
|---|
| 160 | message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I |
|---|
| 161 | would like to store X bytes of data for file Y (share #Z) on a system close |
|---|
| 162 | to (but not below) nodeid STARTING_POINT. If you find this amenable, please |
|---|
| 163 | contact me at FURL so we can make arrangements.". These messages are then |
|---|
| 164 | bundled together according to their rough destination (STARTING_POINT) and |
|---|
| 165 | sent somewhere in the right direction. |
|---|
| 166 | |
|---|
| 167 | Download happens the same way: lookup messages are disseminated towards the |
|---|
| 168 | STARTING_POINT and then search one successor at a time from there. There are |
|---|
| 169 | two ways that the share might go missing: if the node is now offline (or has |
|---|
| 170 | for some reason lost its shares), or if new nodes have joined since the |
|---|
| 171 | original upload and the search depth (maximum hop count) is too small to |
|---|
| 172 | accomodate the churn. Both result in the same amount of localized traffic. In |
|---|
| 173 | the latter case, a storage node might want to migrate the share closer to the |
|---|
| 174 | starting point, or perhaps just send them a note to remember a pointer for |
|---|
| 175 | the share. |
|---|
| 176 | |
|---|
| 177 | Checking: anyone who wishes to do a filecheck needs to send out a lookup |
|---|
| 178 | message for every potential share. These lookup messages could have a higher |
|---|
| 179 | search depth than usual. It would be useful to know how many peers each |
|---|
| 180 | message went through before being returned: this might be useful to perform |
|---|
| 181 | repair by instructing the old host (which is further from the starting point |
|---|
| 182 | than you'd like) to push their share closer towards the starting point. |
|---|