Context Navigation

Back to Ticket #217

Ticket #217: mutable.txt

File mutable.txt, 35.8 KB (added by warner, at 2007-12-31T21:04:43Z)
new DSA-based mutable-file protocol

Line
1
2	(protocol proposal, work-in-progress, not authoritative)
3
4	= Mutable Files =
5
6	Mutable File Slots are places with a stable identifier that can hold data
7	that changes over time. In contrast to CHK slots, for which the
8	URI/identifier is derived from the contents themselves, the Mutable File Slot
9	URI remains fixed for the life of the slot, regardless of what data is placed
10	inside it.
11
12	Each mutable slot is referenced by two different URIs. The "read-write" URI
13	grants read-write access to its holder, allowing them to put whatever
14	contents they like into the slot. The "read-only" URI is less powerful, only
15	granting read access, and not enabling modification of the data. The
16	read-write URI can be turned into the read-only URI, but not the other way
17	around.
18
19	The data in these slots is distributed over a number of servers, using the
20	same erasure coding that CHK files use, with 3-of-10 being a typical choice
21	of encoding parameters. The data is encrypted and signed in such a way that
22	only the holders of the read-write URI will be able to set the contents of
23	the slot, and only the holders of the read-only URI will be able to read
24	those contents. Holders of either URI will be able to validate the contents
25	as being written by someone with the read-write URI. The servers who hold the
26	shares cannot read or modify them: the worst they can do is deny service (by
27	deleting or corrupting the shares), or attempt a rollback attack (which can
28	only succeed with the cooperation of at least k servers).
29
30	== Consistency vs Availability ==
31
32	There is an age-old battle between consistency and availability. Epic papers
33	have been written, elaborate proofs have been established, and generations of
34	theorists have learned that you cannot simultaneously achieve guaranteed
35	consistency with guaranteed reliability. In addition, the closer to 0 you get
36	on either axis, the cost and complexity of the design goes up.
37
38	Tahoe's design goals are to largely favor design simplicity, then slightly
39	favor read availability, over the other criteria.
40
41	As we develop more sophisticated mutable slots, the API may expose multiple
42	read versions to the application layer. The tahoe philosophy is to defer most
43	consistency recovery logic to the higher layers. Some applications have
44	effective ways to merge multiple versions, so inconsistency is not
45	necessarily a problem (i.e. directory nodes can usually merge multiple "add
46	child" operations).
47
48	== The Prime Coordination Directive: "Don't Do That" ==
49
50	The current rule for applications which run on top of Tahoe is "do not
51	perform simultaneous uncoordinated writes". That means you need non-tahoe
52	means to make sure that two parties are not trying to modify the same mutable
53	slot at the same time. For example:
54
55	* don't give the read-write URI to anyone else. Dirnodes in a private
56	directory generally satisfy this case, as long as you don't use two
57	clients on the same account at the same time
58	* if you give a read-write URI to someone else, stop using it yourself. An
59	inbox would be a good example of this.
60	* if you give a read-write URI to someone else, call them on the phone
61	before you write into it
62	* build an automated mechanism to have your agents coordinate writes.
63	For example, we expect a future release to include a FURL for a
64	"coordination server" in the dirnodes. The rule can be that you must
65	contact the coordination server and obtain a lock/lease on the file
66	before you're allowed to modify it.
67
68	If you do not follow this rule, Bad Things will happen. The worst-case Bad
69	Thing is that the entire file will be lost. A less-bad Bad Thing is that one
70	or more of the simultaneous writers will lose their changes. An observer of
71	the file may not see monotonically-increasing changes to the file, i.e. they
72	may see version 1, then version 2, then 3, then 2 again.
73
74	Tahoe takes some amount of care to reduce the badness of these Bad Things.
75	One way you can help nudge it from the "lose your file" case into the "lose
76	some changes" case is to reduce the number of competing versions: multiple
77	versions of the file that different parties are trying to establish as the
78	one true current contents. Each simultaneous writer counts as a "competing
79	version", as does the previous version of the file. If the count "S" of these
80	competing versions is larger than N/k, then the file runs the risk of being
81	lost completely. If at least one of the writers remains running after the
82	collision is detected, it will attempt to recover, but if S>(N/k) and all
83	writers crash after writing a few shares, the file will be lost.
84
85
86	== Small Distributed Mutable Files ==
87
88	SDMF slots are suitable for small (<1MB) files that are editing by rewriting
89	the entire file. The three operations are:
90
91	* allocate (with initial contents)
92	* set (with new contents)
93	* get (old contents)
94
95	The first use of SDMF slots will be to hold directories (dirnodes), which map
96	encrypted child names to rw-URI/ro-URI pairs.
97
98	=== SDMF slots overview ===
99
100	Each SDMF slot is created with a DSA public/private key pair, using a
101	system-wide common modulus and generator, in which the private key is a
102	random 256 bit number, and the public key is a larger value (about 2048 bits)
103	that can be derived with a bit of math from the private key. The public key
104	is known as the "verification key", while the private key is called the
105	"signature key".
106
107	The 256-bit signature key is used verbatim as the "write capability". This
108	can be converted into the 2048ish-bit verification key through a fairly cheap
109	set of modular exponentiation operations; this is done any time the holder of
110	the write-cap wants to read the data. (Note that the signature key can either
111	be a newly-generated random value, or the hash of something else, if we found
112	a need for a capability that's stronger than the write-cap).
113
114	This results in a write-cap which is 256 bits long and can thus be expressed
115	in an ASCII/transport-safe encoded form (base62 encoding, fits in 72
116	characters, including a local-node http: convenience prefix).
117
118	The private key is hashed to form a 256-bit "salt". The public key is also
119	hashed to form a 256-bit "pubkey hash". These two values are concatenated,
120	hashed, and truncated to 192 bits to form the first 192 bits of the read-cap.
121	The pubkey hash is hashed by itself and truncated to 64 bits to form the last
122	64 bits of the read-cap. The full read-cap is 256 bits long, just like the
123	write-cap.
124
125	The first 192 bits of the read-cap are hashed and truncated to form the first
126	64 bits of the storage index. The last 64 bits of the read-cap are hashed to
127	form the last 64 bits of the storage index. This gives us a 128-bit storage
128	index.
129
130	The verification-cap is the first 64 bits of the storage index plus the
131	pubkey hash, 320 bits total. The verification-cap doesn't need to be
132	expressed in a printable transport-safe form, so it's ok that it's longer.
133
134	The read-cap is hashed one way to form an AES encryption key that is used to
135	encrypt the salt; this key is called the "salt key". The encrypted salt is
136	stored in the share. The private key never changes, therefore the salt never
137	changes, and the salt key is only used for a single purpose, so there is no
138	need for an IV.
139
140	The read-cap is hashed a different way to form the master data encryption
141	key. A random "data salt" is generated each time the share's contents are
142	replaced, and the master data encryption key is concatenated with the data
143	salt, then hashed, to form the AES CTR-mode "read key" that will be used to
144	encrypt the actual file data. This is to avoid key-reuse. An outstanding
145	issue is how to avoid key reuse when files are modified in place instead of
146	being replaced completely; this is not done in SDMF but might occur in MDMF.
147
148	The private key is hashed one way to form the salt, and a different way to
149	form the "write enabler master". For each storage server on which a share is
150	kept, the write enabler master is concatenated with the server's nodeid and
151	hashed, and the result is called the "write enabler" for that particular
152	server. Note that multiple shares of the same slot stored on the same server
153	will all get the same write enabler, i.e. the write enabler is associated
154	with the "bucket", rather than the individual shares.
155
156	The private key is hashed a third way to form the "data write key", which can
157	be used by applications which wish to store some data in a form that is only
158	available to those with a write-cap, and not to those with merely a read-cap.
159	This is used to implement transitive read-onlyness of dirnodes.
160
161	The public key is stored on the servers, as is the encrypted salt, the
162	(non-encrypted) data salt, the encrypted data, and a signature. The container
163	records the write-enabler, but of course this is not visible to readers. To
164	make sure that every byte of the share can be verified by a holder of the
165	verify-cap (and also by the storage server itself), the signature covers the
166	version number, the sequence number, the root hash "R" of the share merkle
167	tree, the encoding parameters, and the encrypted salt. "R" itself covers the
168	hash trees and the share data.
169
170	The read-write URI is just the private key. The read-only URI is the read-cap
171	key. The verify-only URI contains the the pubkey hash and the first 64 bits
172	of the storage index.
173
174	FMW:b2a(privatekey)
175	FMR:b2a(readcap)
176	FMV:b2a(storageindex[:64])b2a(pubkey-hash)
177
178	Note that this allows the read-only and verify-only URIs to be derived from
179	the read-write URI without actually retrieving any data from the share, but
180	instead by regenerating the public key from the private one. Uses of the
181	read-only or verify-only caps must validate the public key against their
182	pubkey hash (or its derivative) the first time they retrieve the pubkey,
183	before trusting any signatures they see.
184
185	The SDMF slot is allocated by sending a request to the storage server with a
186	desired size, the storage index, and the write enabler for that server's
187	nodeid. If granted, the write enabler is stashed inside the slot's backing
188	store file. All further write requests must be accompanied by the write
189	enabler or they will not be honored. The storage server does not share the
190	write enabler with anyone else.
191
192	The SDMF slot structure will be described in more detail below. The important
193	pieces are:
194
195	* a sequence number
196	* a root hash "R"
197	* the data salt
198	* the encoding parameters (including k, N, file size, segment size)
199	* a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key)
200	* the verification key (not encrypted)
201	* the share hash chain (part of a Merkle tree over the share hashes)
202	* the block hash tree (Merkle tree over blocks of share data)
203	* the share data itself (erasure-coding of read-key-encrypted file data)
204	* the salt, encrypted with the salt key
205
206	The access pattern for read (assuming we hold the write-cap) is:
207	* generate public key from the private one
208	* hash private key to get the salt, hash public key, form read-cap
209	* form storage-index
210	* use storage-index to locate 'k' shares with identical 'R' values
211	* either get one share, read 'k' from it, then read k-1 shares
212	* or read, say, 5 shares, discover k, either get more or be finished
213	* or copy k into the URIs
214	* .. jump to "COMMON READ", below
215
216	To read (assuming we only hold the read-cap), do:
217	* hash read-cap pieces to generate storage index and salt key
218	* use storage-index to locate 'k' shares with identical 'R' values
219	* retrieve verification key and encrypted salt
220	* decrypt salt
221	* hash decrypted salt and pubkey to generate another copy of the read-cap,
222	make sure they match (this validates the pubkey)
223	* .. jump to "COMMON READ"
224
225	* COMMON READ:
226	* read seqnum, R, data salt, encoding parameters, signature
227	* verify signature against verification key
228	* hash data salt and read-cap to generate read-key
229	* read share data, compute block-hash Merkle tree and root "r"
230	* read share hash chain (leading from "r" to "R")
231	* validate share hash chain up to the root "R"
232	* submit share data to erasure decoding
233	* decrypt decoded data with read-key
234	* submit plaintext to application
235
236	The access pattern for write is:
237	* generate pubkey, salt, read-cap, storage-index as in read case
238	* generate data salt for this update, generate read-key
239	* encrypt plaintext from application with read-key
240	* application can encrypt some data with the data-write-key to make it
241	only available to writers (used for transitively-readonly dirnodes)
242	* erasure-code crypttext to form shares
243	* split shares into blocks
244	* compute Merkle tree of blocks, giving root "r" for each share
245	* compute Merkle tree of shares, find root "R" for the file as a whole
246	* create share data structures, one per server:
247	* use seqnum which is one higher than the old version
248	* share hash chain has log(N) hashes, different for each server
249	* signed data is the same for each server
250	* include pubkey, encrypted salt, data salt
251	* now we have N shares and need homes for them
252	* walk through peers
253	* if share is not already present, allocate-and-set
254	* otherwise, try to modify existing share:
255	* send testv_and_writev operation to each one
256	* testv says to accept share if their(seqnum+R) <= our(seqnum+R)
257	* count how many servers wind up with which versions (histogram over R)
258	* keep going until N servers have the same version, or we run out of servers
259	* if any servers wound up with a different version, report error to
260	application
261	* if we ran out of servers, initiate recovery process (described below)
262
263	==== Cryptographic Properties ====
264
265	This scheme protects the data's confidentiality with 192 bits of key
266	material, since the read-cap contains 192 secret bits (derived from an
267	encrypted salt, which is encrypted using those same 192 bits plus some
268	additional public material).
269
270	The integrity of the data (assuming that the signature is valid) is protected
271	by the 256-bit hash which gets included in the signature. The privilege of
272	modifying the data (equivalent to the ability to form a valid signature) is
273	protected by a 256 bit random DSA private key, and the difficulty of
274	computing a discrete logarithm in a 2048-bit field.
275
276	There are a few weaker denial-of-service attacks possible. If N-k+1 of the
277	shares are damaged or unavailable, the client will be unable to recover the
278	file. Any coalition of more than N-k shareholders will be able to effect this
279	attack by merely refusing to provide the desired share. The "write enabler"
280	shared secret protects existing shares from being displaced by new ones,
281	except by the holder of the write-cap. One server cannot affect the other
282	shares of the same file, once those other shares are in place.
283
284	The worst DoS attack is the "roadblock attack", which must be made before
285	those shares get placed. Storage indexes are effectively random (being
286	derived from the hash of a random value), so they are not guessable before
287	the writer begins their upload, but there is a window of vulnerability during
288	the beginning of the upload, when some servers have heard about the storage
289	index but not all of them.
290
291	The roadblock attack we want to prevent is when the first server that the
292	uploader contacts quickly runs to all the other selected servers and places a
293	bogus share under the same storage index, before the uploader can contact
294	them. These shares will normally be accepted, since storage servers create
295	new shares on demand. The bogus shares would have randomly-generated
296	write-enablers, which will of course be different than the real uploader's
297	write-enabler, since the malicious server does not know the write-cap.
298
299	If this attack were successful, the uploader would be unable to place any of
300	their shares, because the slots have already been filled by the bogus shares.
301	The uploader would probably try for peers further and further away from the
302	desired location, but eventually they will hit a preconfigured distance limit
303	and give up. In addition, the further the writer searches, the less likely it
304	is that a reader will search as far. So a successful attack will either cause
305	the file to be uploaded but not be reachable, or it will cause the upload to
306	fail.
307
308	If the uploader tries again (creating a new privkey), they may get lucky and
309	the malicious servers will appear later in the query list, giving sufficient
310	honest servers a chance to see their share before the malicious one manages
311	to place bogus ones.
312
313	The first line of defense against this attack is the timing challenges: the
314	attacking server must be ready to act the moment a storage request arrives
315	(which will only occur for a certain percentage of all new-file uploads), and
316	only has a few seconds to act before the other servers will have allocated
317	the shares (and recorded the write-enabler, terminating the window of
318	vulnerability).
319
320	The second line of defense is post-verification, and is possible because the
321	storage index is partially derived from the public key hash. A storage server
322	can, at any time, verify every public bit of the container as being signed by
323	the verification key (this operation is recommended as a continual background
324	process, when disk usage is minimal, to detect disk errors). The server can
325	also hash the verification key to derive 64 bits of the storage index. If it
326	detects that these 64 bits do not match (but the rest of the share validates
327	correctly), then the implication is that this share was stored to the wrong
328	storage index, either due to a bug or a roadblock attack.
329
330	If an uploader finds that they are unable to place their shares because of
331	"bad write enabler errors" (as reported by the prospective storage servers),
332	it can "cry foul", and ask the storage server to perform this verification on
333	the share in question. If the pubkey and storage index do not match, the
334	storage server can delete the bogus share, thus allowing the real uploader to
335	place their share. Of course the origin of the offending bogus share should
336	be logged and reported to a central authority, so corrective measures can be
337	taken. It may be necessary to have this "cry foul" protocol include the new
338	write-enabler, to close the window during which the malicious server can
339	re-submit the bogus share during the adjudication process.
340
341	If the problem persists, the servers can be placed into pre-verification
342	mode, in which this verification is performed on all potential shares before
343	being committed to disk. This mode is more CPU-intensive (since normally the
344	storage server ignores the contents of the container altogether), but would
345	solve the problem completely.
346
347	The mere existence of these potential defenses should be sufficient to deter
348	any actual attacks. Note that the storage index only has 64 bits of
349	pubkey-derived data in it, which is below the usual crypto guidelines for
350	security factors. In this case it's a pre-image attack which would be needed,
351	rather than a collision, and the actual attack would be to find a keypair for
352	which the public key can be hashed three times to produce the desired portion
353	of the storage index. We believe that 64 bits of material is sufficiently
354	resistant to this form of pre-image attack to serve as a suitable deterrent.
355
356
357	=== Server Storage Protocol ===
358
359	The storage servers will provide a mutable slot container which is oblivious
360	to the details of the data being contained inside it. Each storage index
361	refers to a "bucket", and each bucket has one or more shares inside it. (In a
362	well-provisioned network, each bucket will have only one share). The bucket
363	is stored as a directory, using the base32-encoded storage index as the
364	directory name. Each share is stored in a single file, using the share number
365	as the filename.
366
367	The container holds space for a container magic number (for versioning), the
368	write enabler, the nodeid which accepted the write enabler (used for share
369	migration, described below), a small number of lease structures, the embedded
370	data itself, and expansion space for additional lease structures.
371
372	# offset size name
373	1 0 32 magic verstr "tahoe mutable container v1" plus binary
374	2 32 20 write enabler's nodeid
375	3 52 32 write enabler
376	4 84 8 data size (actual share data present) (a)
377	5 92 8 offset of (8) count of extra leases (after data)
378	6 100 368 four leases, 92 bytes each
379	0 4 ownerid (0 means "no lease here")
380	4 4 expiration timestamp
381	8 32 renewal token
382	40 32 cancel token
383	72 20 nodeid which accepted the tokens
384	7 468 (a) data
385	8 ?? 4 count of extra leases
386	9 ?? n*92 extra leases
387
388	The "extra leases" field must be copied and rewritten each time the size of
389	the enclosed data changes. The hope is that most buckets will have four or
390	fewer leases and this extra copying will not usually be necessary.
391
392	The (4) "data size" field contains the actual number of bytes of data present
393	in field (7), such that a client request to read beyond 504+(a) will result
394	in an error. This allows the client to (one day) read relative to the end of
395	the file. The container size (that is, (8)-(7)) might be larger, especially
396	if extra size was pre-allocated in anticipation of filling the container with
397	a lot of data.
398
399	The offset in (5) points at the count of extra leases, at (8). The actual
400	leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
401	and (9) must be relocated by copying.
402
403	The server will honor any write commands that provide the write token and do
404	not exceed the server-wide storage size limitations. Read and write commands
405	MUST be restricted to the 'data' portion of the container: the implementation
406	of those commands MUST perform correct bounds-checking to make sure other
407	portions of the container are inaccessible to the clients.
408
409	The two methods provided by the storage server on these "MutableSlot" share
410	objects are:
411
412	* readv(ListOf(offset=int, length=int))
413	* returns a list of bytestrings, of the various requested lengths
414	* offset < 0 is interpreted relative to the end of the data
415	* spans which hit the end of the data will return truncated data
416
417	* testv_and_writev(write_enabler, test_vector, write_vector)
418	* this is a test-and-set operation which performs the given tests and only
419	applies the desired writes if all tests succeed. This is used to detect
420	simultaneous writers, and to reduce the chance that an update will lose
421	data recently written by some other party (written after the last time
422	this slot was read).
423	* test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
424	* the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
425	* each element of the test vector is read from the slot's data and
426	compared against the specimen using the desired (in)equality. If all
427	tests evaluate True, the write is performed
428	* write_vector=ListOf(TupleOf(offset, newdata))
429	* offset < 0 is not yet defined, it probably means relative to the
430	end of the data, which probably means append, but we haven't nailed
431	it down quite yet
432	* write vectors are executed in order, which specifies the results of
433	overlapping writes
434	* return value:
435	* error: OutOfSpace
436	* error: something else (io error, out of memory, whatever)
437	* (True, old_test_data): the write was accepted (test_vector passed)
438	* (False, old_test_data): the write was rejected (test_vector failed)
439	* both 'accepted' and 'rejected' return the old data that was used
440	for the test_vector comparison. This can be used by the client
441	to detect write collisions, including collisions for which the
442	desired behavior was to overwrite the old version.
443
444	In addition, the storage server provides several methods to access these
445	share objects:
446
447	* allocate_mutable_slot(storage_index, sharenums=SetOf(int))
448	* returns DictOf(int, MutableSlot)
449	* get_mutable_slot(storage_index)
450	* returns DictOf(int, MutableSlot)
451	* or raises KeyError
452
453	We intend to add an interface which allows small slots to allocate-and-write
454	in a single call, as well as do update or read in a single call. The goal is
455	to allow a reasonably-sized dirnode to be created (or updated, or read) in
456	just one round trip (to all N shareholders in parallel).
457
458	==== migrating shares ====
459
460	If a share must be migrated from one server to another, two values become
461	invalid: the write enabler (since it was computed for the old server), and
462	the lease renew/cancel tokens.
463
464	Suppose that a slot was first created on nodeA, and was thus initialized with
465	WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
466	moved from nodeA to nodeB.
467
468	Readers may still be able to find the share in its new home, depending upon
469	how many servers are present in the grid, where the new nodeid lands in the
470	permuted index for this particular storage index, and how many servers the
471	reading client is willing to contact.
472
473	When a client attempts to write to this migrated share, it will get a "bad
474	write enabler" error, since the WE it computes for nodeB will not match the
475	WE(nodeA) that was embedded in the share. When this occurs, the "bad write
476	enabler" message must include the old nodeid (e.g. nodeA) that was in the
477	share.
478
479	The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
480	H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
481	is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
482	anyone else. Also note that the client does not send a value to nodeB that
483	would allow the node to impersonate the client to a third node: everything
484	sent to nodeB will include something specific to nodeB in it.
485
486	The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
487	old write enabler from the share. It compares this against the value supplied
488	by the client. If they match, this serves as proof that the client was able
489	to compute the old write enabler. The server then accepts the client's new
490	WE(nodeB) and writes it into the container.
491
492	This WE-fixup process requires an extra round trip, and requires the error
493	message to include the old nodeid, but does not require any public key
494	operations on either client or server.
495
496	Migrating the leases will require a similar protocol. This protocol will be
497	defined concretely at a later date.
498
499	=== Code Details ===
500
501	The current FileNode class will be renamed ImmutableFileNode, and a new
502	MutableFileNode class will be created. Instances of this class will contain a
503	URI and a reference to the client (for peer selection and connection). The
504	methods of MutableFileNode are:
505
506	* replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
507	* get() -> [deferred] newdata, NotEnoughPeersError
508	* if there are multiple retrieveable versions in the grid, get() returns
509	the first version it can reconstruct, and silently ignores the others.
510	In the future, a more advanced API will signal and provide access to
511	the multiple heads.
512
513	The peer-selection and data-structure manipulation (and signing/verification)
514	steps will be implemented in a separate class in allmydata/mutable.py .
515
516	=== SMDF Slot Format ===
517
518	This SMDF data lives inside a server-side MutableSlot container. The server
519	is generally oblivious to this format, but it may look inside the container
520	when verification is desired.
521
522	This data is tightly packed. There are no gaps left between the different
523	fields, and the offset table is mainly present to allow future flexibility of
524	key sizes.
525
526	# offset size name
527	1 0 1 version byte, \x01 for this format
528	2 1 8 sequence number. 2^64-1 must be handled specially, TBD
529	3 9 32 "R" (root of share hash Merkle tree)
530	4 41 32 data salt (readkey is H(readcap+data_salt))
531	5 73 32 encrypted salt (AESenc(key=H(readcap), salt)
532	6 105 18 encoding parameters:
533	105 1 k
534	106 1 N
535	107 8 segment size
536	115 8 data length (of original plaintext)
537	7 123 36 offset table:
538	127 4 (9) signature
539	131 4 (10) share hash chain
540	135 4 (11) block hash tree
541	139 4 (12) share data
542	143 8 (13) EOF
543	8 151 256 verification key (2048bit DSA key)
544	9 407 40 signature=DSAsig(H([1,2,3,4,5,6]))
545	10 447 (a) share hash chain, encoded as:
546	"".join([pack(">H32s", shnum, hash)
547	for (shnum,hash) in needed_hashes])
548	11 ?? (b) block hash tree, encoded as:
549	"".join([pack(">32s",hash) for hash in block_hash_tree])
550	12 ?? LEN share data
551	13 ?? -- EOF
552
553	(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
554	This is the set of hashes necessary to validate this share's leaf in the
555	share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
556	(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
557	long. This is the set of hashes necessary to validate any given block of
558	share data up to the per-share root "r". Each "r" is a leaf of the share
559	has tree (with root "R"), from which a minimal subset of hashes is put in
560	the share hash chain in (8).
561
562	=== Recovery ===
563
564	The first line of defense against damage caused by colliding writes is the
565	Prime Coordination Directive: "Don't Do That".
566
567	The second line of defense is to keep "S" (the number of competing versions)
568	lower than N/k. If this holds true, at least one competing version will have
569	k shares and thus be recoverable. Note that server unavailability counts
570	against us here: the old version stored on the unavailable server must be
571	included in the value of S.
572
573	The third line of defense is our use of testv_and_writev() (described below),
574	which increases the convergence of simultaneous writes: one of the writers
575	will be favored (the one with the highest "R"), and that version is more
576	likely to be accepted than the others. This defense is least effective in the
577	pathological situation where S simultaneous writers are active, the one with
578	the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
579	the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
580	one with the highest "R" writes to k-1 shares and dies. Any other sequencing
581	will allow the highest "R" to write to at least k shares and establish a new
582	revision.
583
584	The fourth line of defense is the fact that each client keeps writing until
585	at least one version has N shares. This uses additional servers, if
586	necessary, to make sure that either the client's version or some
587	newer/overriding version is highly available.
588
589	The fifth line of defense is the recovery algorithm, which seeks to make sure
590	that at least one version is highly available, even if that version is
591	somebody else's.
592
593	The write-shares-to-peers algorithm is as follows:
594
595	* permute peers according to storage index
596	* walk through peers, trying to assign one share per peer
597	* for each peer:
598	* send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
599	* this means that we will overwrite any old versions, and we will
600	overwrite simultaenous writers of the same version if our R is higher.
601	We will not overwrite writers using a higher seqnum.
602	* record the version that each share winds up with. If the write was
603	accepted, this is our own version. If it was rejected, read the
604	old_test_data to find out what version was retained.
605	* if old_test_data indicates the seqnum was equal or greater than our
606	own, mark the "Simultanous Writes Detected" flag, which will eventually
607	result in an error being reported to the writer (in their close() call).
608	* build a histogram of "R" values
609	* repeat until the histogram indicate that some version (possibly ours)
610	has N shares. Use new servers if necessary.
611	* If we run out of servers:
612	* if there are at least shares-of-happiness of any one version, we're
613	happy, so return. (the close() might still get an error)
614	* not happy, need to reinforce something, goto RECOVERY
615
616	RECOVERY:
617	* read all shares, count the versions, identify the recoverable ones,
618	discard the unrecoverable ones.
619	* sort versions: locate max(seqnums), put all versions with that seqnum
620	in the list, sort by number of outstanding shares. Then put our own
621	version. (TODO: put versions with seqnum <max but >us ahead of us?).
622	* for each version:
623	* attempt to recover that version
624	* if not possible, remove it from the list, go to next one
625	* if recovered, start at beginning of peer list, push that version,
626	continue until N shares are placed
627	* if pushing our own version, bump up the seqnum to one higher than
628	the max seqnum we saw
629	* if we run out of servers:
630	* schedule retry and exponential backoff to repeat RECOVERY
631	* admit defeat after some period? presumeably the client will be shut down
632	eventually, maybe keep trying (once per hour?) until then.
633
634
635
636
637	== Medium Distributed Mutable Files ==
638
639	These are just like the SDMF case, but:
640
641	* we actually take advantage of the Merkle hash tree over the blocks, by
642	reading a single segment of data at a time (and its necessary hashes), to
643	reduce the read-time alacrity
644	* we allow arbitrary writes to the file (i.e. seek() is provided, and
645	O_TRUNC is no longer required)
646	* we write more code on the client side (in the MutableFileNode class), to
647	first read each segment that a write must modify. This looks exactly like
648	the way a normal filesystem uses a block device, or how a CPU must perform
649	a cache-line fill before modifying a single word.
650	* we might implement some sort of copy-based atomic update server call,
651	to allow multiple writev() calls to appear atomic to any readers.
652
653	MDMF slots provide fairly efficient in-place edits of very large files (a few
654	GB). Appending data is also fairly efficient, although each time a power of 2
655	boundary is crossed, the entire file must effectively be re-uploaded (because
656	the size of the block hash tree changes), so if the filesize is known in
657	advance, that space ought to be pre-allocated (by leaving extra space between
658	the block hash tree and the actual data).
659
660	MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
661	adds cache-line reads to allow random-access writes.
662
663	== Large Distributed Mutable Files ==
664
665	LDMF slots use a fundamentally different way to store the file, inspired by
666	Mercurial's "revlog" format. They enable very efficient insert/remove/replace
667	editing of arbitrary spans. Multiple versions of the file can be retained, in
668	a revision graph that can have multiple heads. Each revision can be
669	referenced by a cryptographic identifier. There are two forms of the URI, one
670	that means "most recent version", and a longer one that points to a specific
671	revision.
672
673	Metadata can be attached to the revisions, like timestamps, to enable rolling
674	back an entire tree to a specific point in history.
675
676	LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
677	provides explicit support for revision identifiers and branching.
678
679	== TODO ==
680
681	improve allocate-and-write or get-writer-buckets API to allow one-call (or
682	maybe two-call) updates. The challenge is in figuring out which shares are on
683	which machines. First cut will have lots of round trips.
684
685	(eventually) define behavior when seqnum wraps. At the very least make sure
686	it can't cause a security problem. "the slot is worn out" is acceptable.
687
688	(eventually) define share-migration lease update protocol. Including the
689	nodeid who accepted the lease is useful, we can use the same protocol as we
690	do for updating the write enabler. However we need to know which lease to
691	update.. maybe send back a list of all old nodeids that we find, then try all
692	of them when we accept the update?
693
694	We now do this in a specially-formatted IndexError exception:
695	"UNABLE to renew non-existent lease. I have leases accepted by " +
696	"nodeids: '12345','abcde','44221' ."
697
698	Every node in a given tahoe grid must have the same common DSA moduli and
699	exponent, but different grids could use different parameters. We haven't
700	figured out how to define a "grid id" yet, but I think the DSA parameters
701	should be part of that identifier. In practical terms, this might mean that
702	the Introducer tells each node what parameters to use, or perhaps the node
703	could have a config file which specifies them instead.

Download in other formats:

Original Format