Context Navigation

source: trunk/docs/proposed/denver.txt

Visit:

Last change on this file was 91565f4, checked in by Brian Warner <warner@…>, at 2008-06-03T06:07:02Z
docs: move files that are about future plans into docs/proposed/, to clearly separate them from descriptions of the present codebase
Property mode set to `100644`
File size: 10.2 KB

Line
1	The "Denver Airport" Protocol
2
3	(discussed whilst returning robk to DEN, 12/1/06)
4
5	This is a scaling improvement on the "Select Peers" phase of Tahoe2. The
6	problem it tries to address is the storage and maintenance of the 1M-long
7	peer list, and the relative difficulty of gathering long-term reliability
8	information on a useful numbers of those peers.
9
10	In DEN, each node maintains a Chord-style set of connections to other nodes:
11	log2(N) "finger" connections to distant peers (the first of which is halfway
12	across the ring, the second is 1/4 across, then 1/8th, etc). These
13	connections need to be kept alive with relatively short timeouts (5s?), so
14	any breaks can be rejoined quickly. In addition to the finger connections,
15	each node must also remain aware of K "successor" nodes (those which are
16	immediately clockwise of the starting point). The node is not required to
17	maintain connections to these, but it should remain informed about their
18	contact information, so that it can create connections when necessary. We
19	probably need a connection open to the immediate successor at all times.
20
21	Since inbound connections exist too, each node has something like 2*log2(N)
22	plus up to 2*K connections.
23
24	Each node keeps history of uptime/availability of the nodes that it remains
25	connected to. Each message that is sent to these peers includes an estimate
26	of that peer's availability from the point of view of the outside world. The
27	receiving node will average these reports together to determine what kind of
28	reliability they should announce to anyone they accept leases for. This
29	reliability is expressed as a percentage uptime: P=1.0 means the peer is
30	available 24/7, P=0.0 means it is almost never reachable.
31
32
33	When a node wishes to publish a file, it creates a list of (verifierid,
34	sharenum) tuples, and computes a hash of each tuple. These hashes then
35	represent starting points for the landlord search:
36
37	starting_points = [(sharenum,sha(verifierid + str(sharenum)))
38	for sharenum in range(256)]
39
40	The node then constructs a reservation message that contains enough
41	information for the potential landlord to evaluate the lease, and to make a
42	connection back to the starting node:
43
44	message = [verifierid, sharesize, requestor_furl, starting_points]
45
46	The node looks through its list of finger connections and splits this message
47	into up to log2(N) smaller messages, each of which contains only the starting
48	points that should be sent to that finger connection. Specifically we sent a
49	starting_point to a finger A if the nodeid of that finger is <= the
50	starting_point and if the next finger B is > starting_point. Each message
51	sent out can contain multiple starting_points, each for a different share.
52
53	When a finger node receives this message, it performs the same splitting
54	algorithm, sending each starting_point to other fingers. Eventually a
55	starting_point is received by a node that knows that the starting_point lies
56	between itself and its immediate successor. At this point the message
57	switches from the "hop" mode (following fingers) to the "search" mode
58	(following successors).
59
60	While in "search" mode, each node interprets the message as a lease request.
61	It checks its storage pool to see if it can accomodate the reservation. If
62	so, it uses requestor_furl to contact the originator and announces its
63	willingness to host the given sharenum. This message will include the
64	reliability measurement derived from the host's counterclockwise neighbors.
65
66	If the recipient cannot host the share, it forwards the request on to the
67	next successor, which repeats the cycle. Each message has a maximum hop count
68	which limits the number of peers which may be searched before giving up. If a
69	node sees itself to be the last such hop, it must establish a connection to
70	the originator and let them know that this sharenum could not be hosted.
71
72	The originator sends out something like 100 or 200 starting points, and
73	expects to get back responses (positive or negative) in a reasonable amount
74	of time. (perhaps if we receive half of the responses in time T, wait for a
75	total of 2T for the remaining ones). If no response is received with the
76	timeout, either re-send the requests for those shares (to different fingers)
77	or send requests for completely different shares.
78
79	Each share represents some fraction of a point "S", such that the points for
80	enough shares to reconstruct the whole file total to 1.0 points. I.e., if we
81	construct 100 shares such that we need 25 of them to reconstruct the file,
82	then each share represents .04 points.
83
84	As the positive responses come in, we accumulate two counters: the capacity
85	counter (which gets a full S points for each positive response), and the
86	reliability counter (which gets S*(reliability-of-host) points). The capacity
87	counter is not allowed to go above some limit (like 4x), as determined by
88	provisioning. The node keeps adding leases until the reliability counter has
89	gone above some other threshold (larger but close to 1.0).
90
91	[ at download time, each host will be able to provide the share back with
92	probability P times an exponential decay factor related to peer death. Sum
93	these probabilities to get the average number of shares that will be
94	available. The interesting thing is actually the distribution of these
95	probabilities, and what threshold you have to pick to get a sufficiently
96	high chance of recovering the file. If there are N identical peers with
97	probability P, the number of recovered shares will have a gaussian
98	distribution with an average of N*P and a stddev of (??). The PMF of this
99	function is an S-curve, with a sharper slope when N is large. The
100	probability of recovering the file is the value of this S curve at the
101	threshold value (the number of necessary shares).
102
103	P is not actually constant across all peers, rather we assume that it has
104	its own distribution: maybe gaussian, more likely exponential (power law).
105	This changes the shape of the S-curve. Assuming that we can characterize
106	the distribution of P with perhaps two parameters (say meanP and stddevP),
107	the S-curve is a function of meanP, stddevP, N, and threshold...
108
109	To get 99.99% or 99.999% recoverability, we must choose a threshold value
110	high enough to accomodate the random variations and uncertainty about the
111	real values of P for each of the hosts we've selected. By counting
112	reliability points, we are trying to estimate meanP/stddevP, so we know
113	which S-curve to look at. The threshold is fixed at 1.0, since that's what
114	erasure coding tells us we need to recover the file. The job is then to add
115	hosts (increasing N and possibly changing meanP/stddevP) until our
116	recoverability probability is as high as we want.
117	]
118
119	The originator takes all acceptance messages and adds them in order to the
120	list of landlords that will be used to host the file. It stops when it gets
121	enough reliability points. Note that it does not discriminate against
122	unreliable hosts: they are less likely to have been found in the first place,
123	so we don't need to discriminate against them a second time. We do, however,
124	use the reliability points to acknowledge that sending data to an unreliable
125	peer is not as useful as sending it to a reliable one (there is still value
126	in doing so, though). The remaining reservation-acceptance messages are
127	cancelled and then put aside: if we need to make a second pass, we ask those
128	peers first.
129
130	Shares are then created and published as in Tahoe2. If we lose a connection
131	during the encoding, that share is lost. If we lose enough shares, we might
132	want to generate more to make up for them: this is done by using the leftover
133	acceptance messages first, then triggering a new Chord search for the
134	as-yet-unaccepted sharenums. These new peers will get shares from all
135	segments that have not yet been finished, then a second pass will be made to
136	catch them up on the earlier segments.
137
138	Properties of this approach:
139	the total number of peers that each node must know anything about is bounded
140	to something like 2*log2(N) + K, probably on the order of 50 to 100 total.
141	This is the biggest advantage, since in tahoe2 each node must know at least
142	the nodeid of all 1M peers. The maintenance traffic should be much less as a
143	result.
144
145	each node must maintain open (keep-alived) connections to something like
146	2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the
147	introducer).
148
149	during upload, each node must actively use 100 connections to a random set
150	of peers to push data (just like tahoe2).
151
152	The probability that any given share-request gets a response is equal to the
153	number of hops it travels through times the chance that a peer dies while
154	holding on to the message. This should be pretty small, as the message
155	should only be held by a peer for a few seconds (more if their network is
156	busy). In tahoe2, each share-request always gets a response, since they are
157	made directly to the target.
158
159	I visualize the peer-lookup process as the originator creating a
160	message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I
161	would like to store X bytes of data for file Y (share #Z) on a system close
162	to (but not below) nodeid STARTING_POINT. If you find this amenable, please
163	contact me at FURL so we can make arrangements.". These messages are then
164	bundled together according to their rough destination (STARTING_POINT) and
165	sent somewhere in the right direction.
166
167	Download happens the same way: lookup messages are disseminated towards the
168	STARTING_POINT and then search one successor at a time from there. There are
169	two ways that the share might go missing: if the node is now offline (or has
170	for some reason lost its shares), or if new nodes have joined since the
171	original upload and the search depth (maximum hop count) is too small to
172	accomodate the churn. Both result in the same amount of localized traffic. In
173	the latter case, a storage node might want to migrate the share closer to the
174	starting point, or perhaps just send them a note to remember a pointer for
175	the share.
176
177	Checking: anyone who wishes to do a filecheck needs to send out a lookup
178	message for every potential share. These lookup messages could have a higher
179	search depth than usual. It would be useful to know how many peers each
180	message went through before being returned: this might be useful to perform
181	repair by instructing the old host (which is further from the starting point
182	than you'd like) to push their share closer towards the starting point.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: