Ticket #217: mutable.txt

File mutable.txt, 35.8 KB (added by warner, at 2007-12-31T21:04:43Z)

new DSA-based mutable-file protocol

Line 
1
2(protocol proposal, work-in-progress, not authoritative)
3
4= Mutable Files =
5
6Mutable File Slots are places with a stable identifier that can hold data
7that changes over time. In contrast to CHK slots, for which the
8URI/identifier is derived from the contents themselves, the Mutable File Slot
9URI remains fixed for the life of the slot, regardless of what data is placed
10inside it.
11
12Each mutable slot is referenced by two different URIs. The "read-write" URI
13grants read-write access to its holder, allowing them to put whatever
14contents they like into the slot. The "read-only" URI is less powerful, only
15granting read access, and not enabling modification of the data. The
16read-write URI can be turned into the read-only URI, but not the other way
17around.
18
19The data in these slots is distributed over a number of servers, using the
20same erasure coding that CHK files use, with 3-of-10 being a typical choice
21of encoding parameters. The data is encrypted and signed in such a way that
22only the holders of the read-write URI will be able to set the contents of
23the slot, and only the holders of the read-only URI will be able to read
24those contents. Holders of either URI will be able to validate the contents
25as being written by someone with the read-write URI. The servers who hold the
26shares cannot read or modify them: the worst they can do is deny service (by
27deleting or corrupting the shares), or attempt a rollback attack (which can
28only succeed with the cooperation of at least k servers).
29
30== Consistency vs Availability ==
31
32There is an age-old battle between consistency and availability. Epic papers
33have been written, elaborate proofs have been established, and generations of
34theorists have learned that you cannot simultaneously achieve guaranteed
35consistency with guaranteed reliability. In addition, the closer to 0 you get
36on either axis, the cost and complexity of the design goes up.
37
38Tahoe's design goals are to largely favor design simplicity, then slightly
39favor read availability, over the other criteria.
40
41As we develop more sophisticated mutable slots, the API may expose multiple
42read versions to the application layer. The tahoe philosophy is to defer most
43consistency recovery logic to the higher layers. Some applications have
44effective ways to merge multiple versions, so inconsistency is not
45necessarily a problem (i.e. directory nodes can usually merge multiple "add
46child" operations).
47
48== The Prime Coordination Directive: "Don't Do That" ==
49
50The current rule for applications which run on top of Tahoe is "do not
51perform simultaneous uncoordinated writes". That means you need non-tahoe
52means to make sure that two parties are not trying to modify the same mutable
53slot at the same time. For example:
54
55 * don't give the read-write URI to anyone else. Dirnodes in a private
56   directory generally satisfy this case, as long as you don't use two
57   clients on the same account at the same time
58 * if you give a read-write URI to someone else, stop using it yourself. An
59   inbox would be a good example of this.
60 * if you give a read-write URI to someone else, call them on the phone
61   before you write into it
62 * build an automated mechanism to have your agents coordinate writes.
63   For example, we expect a future release to include a FURL for a
64   "coordination server" in the dirnodes. The rule can be that you must
65   contact the coordination server and obtain a lock/lease on the file
66   before you're allowed to modify it.
67
68If you do not follow this rule, Bad Things will happen. The worst-case Bad
69Thing is that the entire file will be lost. A less-bad Bad Thing is that one
70or more of the simultaneous writers will lose their changes. An observer of
71the file may not see monotonically-increasing changes to the file, i.e. they
72may see version 1, then version 2, then 3, then 2 again.
73
74Tahoe takes some amount of care to reduce the badness of these Bad Things.
75One way you can help nudge it from the "lose your file" case into the "lose
76some changes" case is to reduce the number of competing versions: multiple
77versions of the file that different parties are trying to establish as the
78one true current contents. Each simultaneous writer counts as a "competing
79version", as does the previous version of the file. If the count "S" of these
80competing versions is larger than N/k, then the file runs the risk of being
81lost completely. If at least one of the writers remains running after the
82collision is detected, it will attempt to recover, but if S>(N/k) and all
83writers crash after writing a few shares, the file will be lost.
84
85
86== Small Distributed Mutable Files ==
87
88SDMF slots are suitable for small (<1MB) files that are editing by rewriting
89the entire file. The three operations are:
90
91 * allocate (with initial contents)
92 * set (with new contents)
93 * get (old contents)
94
95The first use of SDMF slots will be to hold directories (dirnodes), which map
96encrypted child names to rw-URI/ro-URI pairs.
97
98=== SDMF slots overview ===
99
100Each SDMF slot is created with a DSA public/private key pair, using a
101system-wide common modulus and generator, in which the private key is a
102random 256 bit number, and the public key is a larger value (about 2048 bits)
103that can be derived with a bit of math from the private key. The public key
104is known as the "verification key", while the private key is called the
105"signature key".
106
107The 256-bit signature key is used verbatim as the "write capability". This
108can be converted into the 2048ish-bit verification key through a fairly cheap
109set of modular exponentiation operations; this is done any time the holder of
110the write-cap wants to read the data. (Note that the signature key can either
111be a newly-generated random value, or the hash of something else, if we found
112a need for a capability that's stronger than the write-cap).
113
114This results in a write-cap which is 256 bits long and can thus be expressed
115in an ASCII/transport-safe encoded form (base62 encoding, fits in 72
116characters, including a local-node http: convenience prefix).
117
118The private key is hashed to form a 256-bit "salt". The public key is also
119hashed to form a 256-bit "pubkey hash". These two values are concatenated,
120hashed, and truncated to 192 bits to form the first 192 bits of the read-cap.
121The pubkey hash is hashed by itself and truncated to 64 bits to form the last
12264 bits of the read-cap. The full read-cap is 256 bits long, just like the
123write-cap.
124
125The first 192 bits of the read-cap are hashed and truncated to form the first
12664 bits of the storage index. The last 64 bits of the read-cap are hashed to
127form the last 64 bits of the storage index. This gives us a 128-bit storage
128index.
129
130The verification-cap is the first 64 bits of the storage index plus the
131pubkey hash, 320 bits total. The verification-cap doesn't need to be
132expressed in a printable transport-safe form, so it's ok that it's longer.
133
134The read-cap is hashed one way to form an AES encryption key that is used to
135encrypt the salt; this key is called the "salt key". The encrypted salt is
136stored in the share. The private key never changes, therefore the salt never
137changes, and the salt key is only used for a single purpose, so there is no
138need for an IV.
139
140The read-cap is hashed a different way to form the master data encryption
141key. A random "data salt" is generated each time the share's contents are
142replaced, and the master data encryption key is concatenated with the data
143salt, then hashed, to form the AES CTR-mode "read key" that will be used to
144encrypt the actual file data. This is to avoid key-reuse. An outstanding
145issue is how to avoid key reuse when files are modified in place instead of
146being replaced completely; this is not done in SDMF but might occur in MDMF.
147
148The private key is hashed one way to form the salt, and a different way to
149form the "write enabler master". For each storage server on which a share is
150kept, the write enabler master is concatenated with the server's nodeid and
151hashed, and the result is called the "write enabler" for that particular
152server. Note that multiple shares of the same slot stored on the same server
153will all get the same write enabler, i.e. the write enabler is associated
154with the "bucket", rather than the individual shares.
155
156The private key is hashed a third way to form the "data write key", which can
157be used by applications which wish to store some data in a form that is only
158available to those with a write-cap, and not to those with merely a read-cap.
159This is used to implement transitive read-onlyness of dirnodes.
160
161The public key is stored on the servers, as is the encrypted salt, the
162(non-encrypted) data salt, the encrypted data, and a signature. The container
163records the write-enabler, but of course this is not visible to readers. To
164make sure that every byte of the share can be verified by a holder of the
165verify-cap (and also by the storage server itself), the signature covers the
166version number, the sequence number, the root hash "R" of the share merkle
167tree, the encoding parameters, and the encrypted salt. "R" itself covers the
168hash trees and the share data.
169
170The read-write URI is just the private key. The read-only URI is the read-cap
171key. The verify-only URI contains the the pubkey hash and the first 64 bits
172of the storage index.
173
174 FMW:b2a(privatekey)
175 FMR:b2a(readcap)
176 FMV:b2a(storageindex[:64])b2a(pubkey-hash)
177
178Note that this allows the read-only and verify-only URIs to be derived from
179the read-write URI without actually retrieving any data from the share, but
180instead by regenerating the public key from the private one. Uses of the
181read-only or verify-only caps must validate the public key against their
182pubkey hash (or its derivative) the first time they retrieve the pubkey,
183before trusting any signatures they see.
184
185The SDMF slot is allocated by sending a request to the storage server with a
186desired size, the storage index, and the write enabler for that server's
187nodeid. If granted, the write enabler is stashed inside the slot's backing
188store file. All further write requests must be accompanied by the write
189enabler or they will not be honored. The storage server does not share the
190write enabler with anyone else.
191
192The SDMF slot structure will be described in more detail below. The important
193pieces are:
194
195  * a sequence number
196  * a root hash "R"
197  * the data salt
198  * the encoding parameters (including k, N, file size, segment size)
199  * a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key)
200  * the verification key (not encrypted)
201  * the share hash chain (part of a Merkle tree over the share hashes)
202  * the block hash tree (Merkle tree over blocks of share data)
203  * the share data itself (erasure-coding of read-key-encrypted file data)
204  * the salt, encrypted with the salt key
205
206The access pattern for read (assuming we hold the write-cap) is:
207 * generate public key from the private one
208 * hash private key to get the salt, hash public key, form read-cap
209 * form storage-index
210 * use storage-index to locate 'k' shares with identical 'R' values
211   * either get one share, read 'k' from it, then read k-1 shares
212   * or read, say, 5 shares, discover k, either get more or be finished
213   * or copy k into the URIs
214 * .. jump to "COMMON READ", below
215
216To read (assuming we only hold the read-cap), do:
217 * hash read-cap pieces to generate storage index and salt key
218 * use storage-index to locate 'k' shares with identical 'R' values
219 * retrieve verification key and encrypted salt
220 * decrypt salt
221 * hash decrypted salt and pubkey to generate another copy of the read-cap,
222   make sure they match (this validates the pubkey)
223 * .. jump to "COMMON READ"
224
225 * COMMON READ:
226 * read seqnum, R, data salt, encoding parameters, signature
227 * verify signature against verification key
228 * hash data salt and read-cap to generate read-key
229 * read share data, compute block-hash Merkle tree and root "r"
230 * read share hash chain (leading from "r" to "R")
231 * validate share hash chain up to the root "R"
232 * submit share data to erasure decoding
233 * decrypt decoded data with read-key
234 * submit plaintext to application
235
236The access pattern for write is:
237 * generate pubkey, salt, read-cap, storage-index as in read case
238 * generate data salt for this update, generate read-key
239 * encrypt plaintext from application with read-key
240   * application can encrypt some data with the data-write-key to make it
241     only available to writers (used for transitively-readonly dirnodes)
242 * erasure-code crypttext to form shares
243 * split shares into blocks
244 * compute Merkle tree of blocks, giving root "r" for each share
245 * compute Merkle tree of shares, find root "R" for the file as a whole
246 * create share data structures, one per server:
247   * use seqnum which is one higher than the old version
248   * share hash chain has log(N) hashes, different for each server
249   * signed data is the same for each server
250   * include pubkey, encrypted salt, data salt
251 * now we have N shares and need homes for them
252 * walk through peers
253   * if share is not already present, allocate-and-set
254   * otherwise, try to modify existing share:
255   * send testv_and_writev operation to each one
256   * testv says to accept share if their(seqnum+R) <= our(seqnum+R)
257   * count how many servers wind up with which versions (histogram over R)
258   * keep going until N servers have the same version, or we run out of servers
259     * if any servers wound up with a different version, report error to
260       application
261     * if we ran out of servers, initiate recovery process (described below)
262
263==== Cryptographic Properties ====
264
265This scheme protects the data's confidentiality with 192 bits of key
266material, since the read-cap contains 192 secret bits (derived from an
267encrypted salt, which is encrypted using those same 192 bits plus some
268additional public material).
269
270The integrity of the data (assuming that the signature is valid) is protected
271by the 256-bit hash which gets included in the signature. The privilege of
272modifying the data (equivalent to the ability to form a valid signature) is
273protected by a 256 bit random DSA private key, and the difficulty of
274computing a discrete logarithm in a 2048-bit field.
275
276There are a few weaker denial-of-service attacks possible. If N-k+1 of the
277shares are damaged or unavailable, the client will be unable to recover the
278file. Any coalition of more than N-k shareholders will be able to effect this
279attack by merely refusing to provide the desired share. The "write enabler"
280shared secret protects existing shares from being displaced by new ones,
281except by the holder of the write-cap. One server cannot affect the other
282shares of the same file, once those other shares are in place.
283
284The worst DoS attack is the "roadblock attack", which must be made before
285those shares get placed. Storage indexes are effectively random (being
286derived from the hash of a random value), so they are not guessable before
287the writer begins their upload, but there is a window of vulnerability during
288the beginning of the upload, when some servers have heard about the storage
289index but not all of them.
290
291The roadblock attack we want to prevent is when the first server that the
292uploader contacts quickly runs to all the other selected servers and places a
293bogus share under the same storage index, before the uploader can contact
294them. These shares will normally be accepted, since storage servers create
295new shares on demand. The bogus shares would have randomly-generated
296write-enablers, which will of course be different than the real uploader's
297write-enabler, since the malicious server does not know the write-cap.
298
299If this attack were successful, the uploader would be unable to place any of
300their shares, because the slots have already been filled by the bogus shares.
301The uploader would probably try for peers further and further away from the
302desired location, but eventually they will hit a preconfigured distance limit
303and give up. In addition, the further the writer searches, the less likely it
304is that a reader will search as far. So a successful attack will either cause
305the file to be uploaded but not be reachable, or it will cause the upload to
306fail.
307
308If the uploader tries again (creating a new privkey), they may get lucky and
309the malicious servers will appear later in the query list, giving sufficient
310honest servers a chance to see their share before the malicious one manages
311to place bogus ones.
312
313The first line of defense against this attack is the timing challenges: the
314attacking server must be ready to act the moment a storage request arrives
315(which will only occur for a certain percentage of all new-file uploads), and
316only has a few seconds to act before the other servers will have allocated
317the shares (and recorded the write-enabler, terminating the window of
318vulnerability).
319
320The second line of defense is post-verification, and is possible because the
321storage index is partially derived from the public key hash. A storage server
322can, at any time, verify every public bit of the container as being signed by
323the verification key (this operation is recommended as a continual background
324process, when disk usage is minimal, to detect disk errors). The server can
325also hash the verification key to derive 64 bits of the storage index. If it
326detects that these 64 bits do not match (but the rest of the share validates
327correctly), then the implication is that this share was stored to the wrong
328storage index, either due to a bug or a roadblock attack.
329
330If an uploader finds that they are unable to place their shares because of
331"bad write enabler errors" (as reported by the prospective storage servers),
332it can "cry foul", and ask the storage server to perform this verification on
333the share in question. If the pubkey and storage index do not match, the
334storage server can delete the bogus share, thus allowing the real uploader to
335place their share. Of course the origin of the offending bogus share should
336be logged and reported to a central authority, so corrective measures can be
337taken. It may be necessary to have this "cry foul" protocol include the new
338write-enabler, to close the window during which the malicious server can
339re-submit the bogus share during the adjudication process.
340
341If the problem persists, the servers can be placed into pre-verification
342mode, in which this verification is performed on all potential shares before
343being committed to disk. This mode is more CPU-intensive (since normally the
344storage server ignores the contents of the container altogether), but would
345solve the problem completely.
346
347The mere existence of these potential defenses should be sufficient to deter
348any actual attacks. Note that the storage index only has 64 bits of
349pubkey-derived data in it, which is below the usual crypto guidelines for
350security factors. In this case it's a pre-image attack which would be needed,
351rather than a collision, and the actual attack would be to find a keypair for
352which the public key can be hashed three times to produce the desired portion
353of the storage index. We believe that 64 bits of material is sufficiently
354resistant to this form of pre-image attack to serve as a suitable deterrent.
355
356
357=== Server Storage Protocol ===
358
359The storage servers will provide a mutable slot container which is oblivious
360to the details of the data being contained inside it. Each storage index
361refers to a "bucket", and each bucket has one or more shares inside it. (In a
362well-provisioned network, each bucket will have only one share). The bucket
363is stored as a directory, using the base32-encoded storage index as the
364directory name. Each share is stored in a single file, using the share number
365as the filename.
366
367The container holds space for a container magic number (for versioning), the
368write enabler, the nodeid which accepted the write enabler (used for share
369migration, described below), a small number of lease structures, the embedded
370data itself, and expansion space for additional lease structures.
371
372 #   offset    size    name
373 1   0         32      magic verstr "tahoe mutable container v1" plus binary
374 2   32        20      write enabler's nodeid
375 3   52        32      write enabler
376 4   84        8       data size (actual share data present) (a)
377 5   92        8       offset of (8) count of extra leases (after data)
378 6   100       368     four leases, 92 bytes each
379                        0    4   ownerid (0 means "no lease here")
380                        4    4   expiration timestamp
381                        8   32   renewal token
382                        40  32   cancel token
383                        72  20   nodeid which accepted the tokens
384 7   468       (a)     data
385 8   ??        4       count of extra leases
386 9   ??        n*92    extra leases
387
388The "extra leases" field must be copied and rewritten each time the size of
389the enclosed data changes. The hope is that most buckets will have four or
390fewer leases and this extra copying will not usually be necessary.
391
392The (4) "data size" field contains the actual number of bytes of data present
393in field (7), such that a client request to read beyond 504+(a) will result
394in an error. This allows the client to (one day) read relative to the end of
395the file. The container size (that is, (8)-(7)) might be larger, especially
396if extra size was pre-allocated in anticipation of filling the container with
397a lot of data.
398
399The offset in (5) points at the *count* of extra leases, at (8). The actual
400leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
401and (9) must be relocated by copying.
402
403The server will honor any write commands that provide the write token and do
404not exceed the server-wide storage size limitations. Read and write commands
405MUST be restricted to the 'data' portion of the container: the implementation
406of those commands MUST perform correct bounds-checking to make sure other
407portions of the container are inaccessible to the clients.
408
409The two methods provided by the storage server on these "MutableSlot" share
410objects are:
411
412 * readv(ListOf(offset=int, length=int))
413   * returns a list of bytestrings, of the various requested lengths
414   * offset < 0 is interpreted relative to the end of the data
415   * spans which hit the end of the data will return truncated data
416
417 * testv_and_writev(write_enabler, test_vector, write_vector)
418   * this is a test-and-set operation which performs the given tests and only
419     applies the desired writes if all tests succeed. This is used to detect
420     simultaneous writers, and to reduce the chance that an update will lose
421     data recently written by some other party (written after the last time
422     this slot was read).
423   * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
424   * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
425   * each element of the test vector is read from the slot's data and
426     compared against the specimen using the desired (in)equality. If all
427     tests evaluate True, the write is performed
428   * write_vector=ListOf(TupleOf(offset, newdata))
429     * offset < 0 is not yet defined, it probably means relative to the
430       end of the data, which probably means append, but we haven't nailed
431       it down quite yet
432     * write vectors are executed in order, which specifies the results of
433       overlapping writes
434   * return value:
435     * error: OutOfSpace
436     * error: something else (io error, out of memory, whatever)
437     * (True, old_test_data): the write was accepted (test_vector passed)
438     * (False, old_test_data): the write was rejected (test_vector failed)
439       * both 'accepted' and 'rejected' return the old data that was used
440         for the test_vector comparison. This can be used by the client
441         to detect write collisions, including collisions for which the
442         desired behavior was to overwrite the old version.
443
444In addition, the storage server provides several methods to access these
445share objects:
446
447 * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
448   * returns DictOf(int, MutableSlot)
449 * get_mutable_slot(storage_index)
450   * returns DictOf(int, MutableSlot)
451   * or raises KeyError
452
453We intend to add an interface which allows small slots to allocate-and-write
454in a single call, as well as do update or read in a single call. The goal is
455to allow a reasonably-sized dirnode to be created (or updated, or read) in
456just one round trip (to all N shareholders in parallel).
457
458==== migrating shares ====
459
460If a share must be migrated from one server to another, two values become
461invalid: the write enabler (since it was computed for the old server), and
462the lease renew/cancel tokens.
463
464Suppose that a slot was first created on nodeA, and was thus initialized with
465WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
466moved from nodeA to nodeB.
467
468Readers may still be able to find the share in its new home, depending upon
469how many servers are present in the grid, where the new nodeid lands in the
470permuted index for this particular storage index, and how many servers the
471reading client is willing to contact.
472
473When a client attempts to write to this migrated share, it will get a "bad
474write enabler" error, since the WE it computes for nodeB will not match the
475WE(nodeA) that was embedded in the share. When this occurs, the "bad write
476enabler" message must include the old nodeid (e.g. nodeA) that was in the
477share.
478
479The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
480H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
481is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
482anyone else. Also note that the client does not send a value to nodeB that
483would allow the node to impersonate the client to a third node: everything
484sent to nodeB will include something specific to nodeB in it.
485
486The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
487old write enabler from the share. It compares this against the value supplied
488by the client. If they match, this serves as proof that the client was able
489to compute the old write enabler. The server then accepts the client's new
490WE(nodeB) and writes it into the container.
491
492This WE-fixup process requires an extra round trip, and requires the error
493message to include the old nodeid, but does not require any public key
494operations on either client or server.
495
496Migrating the leases will require a similar protocol. This protocol will be
497defined concretely at a later date.
498
499=== Code Details ===
500
501The current FileNode class will be renamed ImmutableFileNode, and a new
502MutableFileNode class will be created. Instances of this class will contain a
503URI and a reference to the client (for peer selection and connection). The
504methods of MutableFileNode are:
505
506 * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
507 * get() -> [deferred] newdata, NotEnoughPeersError
508   * if there are multiple retrieveable versions in the grid, get() returns
509     the first version it can reconstruct, and silently ignores the others.
510     In the future, a more advanced API will signal and provide access to
511     the multiple heads.
512
513The peer-selection and data-structure manipulation (and signing/verification)
514steps will be implemented in a separate class in allmydata/mutable.py .
515
516=== SMDF Slot Format ===
517
518This SMDF data lives inside a server-side MutableSlot container. The server
519is generally oblivious to this format, but it may look inside the container
520when verification is desired.
521
522This data is tightly packed. There are no gaps left between the different
523fields, and the offset table is mainly present to allow future flexibility of
524key sizes.
525
526 #   offset   size    name
527 1    0        1       version byte, \x01 for this format
528 2    1        8       sequence number. 2^64-1 must be handled specially, TBD
529 3    9        32      "R" (root of share hash Merkle tree)
530 4    41       32      data salt (readkey is H(readcap+data_salt))
531 5    73       32      encrypted salt (AESenc(key=H(readcap), salt)
532 6    105       18      encoding parameters:
533       105      1        k
534       106      1        N
535       107      8        segment size
536       115      8        data length (of original plaintext)
537 7    123      36      offset table:
538       127      4        (9) signature
539       131      4        (10) share hash chain
540       135      4        (11) block hash tree
541       139      4        (12) share data
542       143      8        (13) EOF
543 8    151      256     verification key (2048bit DSA key)
544 9    407      40      signature=DSAsig(H([1,2,3,4,5,6]))                   
54510    447      (a)     share hash chain, encoded as:
546                        "".join([pack(">H32s", shnum, hash)
547                                 for (shnum,hash) in needed_hashes])
54811    ??       (b)     block hash tree, encoded as:
549                        "".join([pack(">32s",hash) for hash in block_hash_tree])
55012    ??       LEN     share data
55113    ??       --      EOF
552
553(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
554    This is the set of hashes necessary to validate this share's leaf in the
555    share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
556(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
557    long. This is the set of hashes necessary to validate any given block of
558    share data up to the per-share root "r". Each "r" is a leaf of the share
559    has tree (with root "R"), from which a minimal subset of hashes is put in
560    the share hash chain in (8).
561
562=== Recovery ===
563
564The first line of defense against damage caused by colliding writes is the
565Prime Coordination Directive: "Don't Do That".
566
567The second line of defense is to keep "S" (the number of competing versions)
568lower than N/k. If this holds true, at least one competing version will have
569k shares and thus be recoverable. Note that server unavailability counts
570against us here: the old version stored on the unavailable server must be
571included in the value of S.
572
573The third line of defense is our use of testv_and_writev() (described below),
574which increases the convergence of simultaneous writes: one of the writers
575will be favored (the one with the highest "R"), and that version is more
576likely to be accepted than the others. This defense is least effective in the
577pathological situation where S simultaneous writers are active, the one with
578the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
579the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
580one with the highest "R" writes to k-1 shares and dies. Any other sequencing
581will allow the highest "R" to write to at least k shares and establish a new
582revision.
583
584The fourth line of defense is the fact that each client keeps writing until
585at least one version has N shares. This uses additional servers, if
586necessary, to make sure that either the client's version or some
587newer/overriding version is highly available.
588
589The fifth line of defense is the recovery algorithm, which seeks to make sure
590that at least *one* version is highly available, even if that version is
591somebody else's.
592
593The write-shares-to-peers algorithm is as follows:
594
595 * permute peers according to storage index
596 * walk through peers, trying to assign one share per peer
597 * for each peer:
598   * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
599     * this means that we will overwrite any old versions, and we will
600       overwrite simultaenous writers of the same version if our R is higher.
601       We will not overwrite writers using a higher seqnum.
602   * record the version that each share winds up with. If the write was
603     accepted, this is our own version. If it was rejected, read the
604     old_test_data to find out what version was retained.
605   * if old_test_data indicates the seqnum was equal or greater than our
606     own, mark the "Simultanous Writes Detected" flag, which will eventually
607     result in an error being reported to the writer (in their close() call).
608   * build a histogram of "R" values
609   * repeat until the histogram indicate that some version (possibly ours)
610     has N shares. Use new servers if necessary.
611   * If we run out of servers:
612     * if there are at least shares-of-happiness of any one version, we're
613       happy, so return. (the close() might still get an error)
614     * not happy, need to reinforce something, goto RECOVERY
615
616RECOVERY:
617 * read all shares, count the versions, identify the recoverable ones,
618   discard the unrecoverable ones.
619 * sort versions: locate max(seqnums), put all versions with that seqnum
620   in the list, sort by number of outstanding shares. Then put our own
621   version. (TODO: put versions with seqnum <max but >us ahead of us?).
622 * for each version:
623   * attempt to recover that version
624   * if not possible, remove it from the list, go to next one
625   * if recovered, start at beginning of peer list, push that version,
626     continue until N shares are placed
627   * if pushing our own version, bump up the seqnum to one higher than
628     the max seqnum we saw
629   * if we run out of servers:
630     * schedule retry and exponential backoff to repeat RECOVERY
631   * admit defeat after some period? presumeably the client will be shut down
632     eventually, maybe keep trying (once per hour?) until then.
633
634
635
636
637== Medium Distributed Mutable Files ==
638
639These are just like the SDMF case, but:
640
641 * we actually take advantage of the Merkle hash tree over the blocks, by
642   reading a single segment of data at a time (and its necessary hashes), to
643   reduce the read-time alacrity
644 * we allow arbitrary writes to the file (i.e. seek() is provided, and
645   O_TRUNC is no longer required)
646 * we write more code on the client side (in the MutableFileNode class), to
647   first read each segment that a write must modify. This looks exactly like
648   the way a normal filesystem uses a block device, or how a CPU must perform
649   a cache-line fill before modifying a single word.
650 * we might implement some sort of copy-based atomic update server call,
651   to allow multiple writev() calls to appear atomic to any readers.
652
653MDMF slots provide fairly efficient in-place edits of very large files (a few
654GB). Appending data is also fairly efficient, although each time a power of 2
655boundary is crossed, the entire file must effectively be re-uploaded (because
656the size of the block hash tree changes), so if the filesize is known in
657advance, that space ought to be pre-allocated (by leaving extra space between
658the block hash tree and the actual data).
659
660MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
661adds cache-line reads to allow random-access writes.
662
663== Large Distributed Mutable Files ==
664
665LDMF slots use a fundamentally different way to store the file, inspired by
666Mercurial's "revlog" format. They enable very efficient insert/remove/replace
667editing of arbitrary spans. Multiple versions of the file can be retained, in
668a revision graph that can have multiple heads. Each revision can be
669referenced by a cryptographic identifier. There are two forms of the URI, one
670that means "most recent version", and a longer one that points to a specific
671revision.
672
673Metadata can be attached to the revisions, like timestamps, to enable rolling
674back an entire tree to a specific point in history.
675
676LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
677provides explicit support for revision identifiers and branching.
678
679== TODO ==
680
681improve allocate-and-write or get-writer-buckets API to allow one-call (or
682maybe two-call) updates. The challenge is in figuring out which shares are on
683which machines. First cut will have lots of round trips.
684
685(eventually) define behavior when seqnum wraps. At the very least make sure
686it can't cause a security problem. "the slot is worn out" is acceptable.
687
688(eventually) define share-migration lease update protocol. Including the
689nodeid who accepted the lease is useful, we can use the same protocol as we
690do for updating the write enabler. However we need to know which lease to
691update.. maybe send back a list of all old nodeids that we find, then try all
692of them when we accept the update?
693
694 We now do this in a specially-formatted IndexError exception:
695  "UNABLE to renew non-existent lease. I have leases accepted by " +
696  "nodeids: '12345','abcde','44221' ."
697
698Every node in a given tahoe grid must have the same common DSA moduli and
699exponent, but different grids could use different parameters. We haven't
700figured out how to define a "grid id" yet, but I think the DSA parameters
701should be part of that identifier. In practical terms, this might mean that
702the Introducer tells each node what parameters to use, or perhaps the node
703could have a config file which specifies them instead.