1 | |
---|
2 | (protocol proposal, work-in-progress, not authoritative) |
---|
3 | |
---|
4 | = Mutable Files = |
---|
5 | |
---|
6 | Mutable File Slots are places with a stable identifier that can hold data |
---|
7 | that changes over time. In contrast to CHK slots, for which the |
---|
8 | URI/identifier is derived from the contents themselves, the Mutable File Slot |
---|
9 | URI remains fixed for the life of the slot, regardless of what data is placed |
---|
10 | inside it. |
---|
11 | |
---|
12 | Each mutable slot is referenced by two different URIs. The "read-write" URI |
---|
13 | grants read-write access to its holder, allowing them to put whatever |
---|
14 | contents they like into the slot. The "read-only" URI is less powerful, only |
---|
15 | granting read access, and not enabling modification of the data. The |
---|
16 | read-write URI can be turned into the read-only URI, but not the other way |
---|
17 | around. |
---|
18 | |
---|
19 | The data in these slots is distributed over a number of servers, using the |
---|
20 | same erasure coding that CHK files use, with 3-of-10 being a typical choice |
---|
21 | of encoding parameters. The data is encrypted and signed in such a way that |
---|
22 | only the holders of the read-write URI will be able to set the contents of |
---|
23 | the slot, and only the holders of the read-only URI will be able to read |
---|
24 | those contents. Holders of either URI will be able to validate the contents |
---|
25 | as being written by someone with the read-write URI. The servers who hold the |
---|
26 | shares cannot read or modify them: the worst they can do is deny service (by |
---|
27 | deleting or corrupting the shares), or attempt a rollback attack (which can |
---|
28 | only succeed with the cooperation of at least k servers). |
---|
29 | |
---|
30 | == Consistency vs Availability == |
---|
31 | |
---|
32 | There is an age-old battle between consistency and availability. Epic papers |
---|
33 | have been written, elaborate proofs have been established, and generations of |
---|
34 | theorists have learned that you cannot simultaneously achieve guaranteed |
---|
35 | consistency with guaranteed reliability. In addition, the closer to 0 you get |
---|
36 | on either axis, the cost and complexity of the design goes up. |
---|
37 | |
---|
38 | Tahoe's design goals are to largely favor design simplicity, then slightly |
---|
39 | favor read availability, over the other criteria. |
---|
40 | |
---|
41 | As we develop more sophisticated mutable slots, the API may expose multiple |
---|
42 | read versions to the application layer. The tahoe philosophy is to defer most |
---|
43 | consistency recovery logic to the higher layers. Some applications have |
---|
44 | effective ways to merge multiple versions, so inconsistency is not |
---|
45 | necessarily a problem (i.e. directory nodes can usually merge multiple "add |
---|
46 | child" operations). |
---|
47 | |
---|
48 | == The Prime Coordination Directive: "Don't Do That" == |
---|
49 | |
---|
50 | The current rule for applications which run on top of Tahoe is "do not |
---|
51 | perform simultaneous uncoordinated writes". That means you need non-tahoe |
---|
52 | means to make sure that two parties are not trying to modify the same mutable |
---|
53 | slot at the same time. For example: |
---|
54 | |
---|
55 | * don't give the read-write URI to anyone else. Dirnodes in a private |
---|
56 | directory generally satisfy this case, as long as you don't use two |
---|
57 | clients on the same account at the same time |
---|
58 | * if you give a read-write URI to someone else, stop using it yourself. An |
---|
59 | inbox would be a good example of this. |
---|
60 | * if you give a read-write URI to someone else, call them on the phone |
---|
61 | before you write into it |
---|
62 | * build an automated mechanism to have your agents coordinate writes. |
---|
63 | For example, we expect a future release to include a FURL for a |
---|
64 | "coordination server" in the dirnodes. The rule can be that you must |
---|
65 | contact the coordination server and obtain a lock/lease on the file |
---|
66 | before you're allowed to modify it. |
---|
67 | |
---|
68 | If you do not follow this rule, Bad Things will happen. The worst-case Bad |
---|
69 | Thing is that the entire file will be lost. A less-bad Bad Thing is that one |
---|
70 | or more of the simultaneous writers will lose their changes. An observer of |
---|
71 | the file may not see monotonically-increasing changes to the file, i.e. they |
---|
72 | may see version 1, then version 2, then 3, then 2 again. |
---|
73 | |
---|
74 | Tahoe takes some amount of care to reduce the badness of these Bad Things. |
---|
75 | One way you can help nudge it from the "lose your file" case into the "lose |
---|
76 | some changes" case is to reduce the number of competing versions: multiple |
---|
77 | versions of the file that different parties are trying to establish as the |
---|
78 | one true current contents. Each simultaneous writer counts as a "competing |
---|
79 | version", as does the previous version of the file. If the count "S" of these |
---|
80 | competing versions is larger than N/k, then the file runs the risk of being |
---|
81 | lost completely. If at least one of the writers remains running after the |
---|
82 | collision is detected, it will attempt to recover, but if S>(N/k) and all |
---|
83 | writers crash after writing a few shares, the file will be lost. |
---|
84 | |
---|
85 | |
---|
86 | == Small Distributed Mutable Files == |
---|
87 | |
---|
88 | SDMF slots are suitable for small (<1MB) files that are editing by rewriting |
---|
89 | the entire file. The three operations are: |
---|
90 | |
---|
91 | * allocate (with initial contents) |
---|
92 | * set (with new contents) |
---|
93 | * get (old contents) |
---|
94 | |
---|
95 | The first use of SDMF slots will be to hold directories (dirnodes), which map |
---|
96 | encrypted child names to rw-URI/ro-URI pairs. |
---|
97 | |
---|
98 | === SDMF slots overview === |
---|
99 | |
---|
100 | Each SDMF slot is created with a DSA public/private key pair, using a |
---|
101 | system-wide common modulus and generator, in which the private key is a |
---|
102 | random 256 bit number, and the public key is a larger value (about 2048 bits) |
---|
103 | that can be derived with a bit of math from the private key. The public key |
---|
104 | is known as the "verification key", while the private key is called the |
---|
105 | "signature key". |
---|
106 | |
---|
107 | The 256-bit signature key is used verbatim as the "write capability". This |
---|
108 | can be converted into the 2048ish-bit verification key through a fairly cheap |
---|
109 | set of modular exponentiation operations; this is done any time the holder of |
---|
110 | the write-cap wants to read the data. (Note that the signature key can either |
---|
111 | be a newly-generated random value, or the hash of something else, if we found |
---|
112 | a need for a capability that's stronger than the write-cap). |
---|
113 | |
---|
114 | This results in a write-cap which is 256 bits long and can thus be expressed |
---|
115 | in an ASCII/transport-safe encoded form (base62 encoding, fits in 72 |
---|
116 | characters, including a local-node http: convenience prefix). |
---|
117 | |
---|
118 | The private key is hashed to form a 256-bit "salt". The public key is also |
---|
119 | hashed to form a 256-bit "pubkey hash". These two values are concatenated, |
---|
120 | hashed, and truncated to 192 bits to form the first 192 bits of the read-cap. |
---|
121 | The pubkey hash is hashed by itself and truncated to 64 bits to form the last |
---|
122 | 64 bits of the read-cap. The full read-cap is 256 bits long, just like the |
---|
123 | write-cap. |
---|
124 | |
---|
125 | The first 192 bits of the read-cap are hashed and truncated to form the first |
---|
126 | 64 bits of the storage index. The last 64 bits of the read-cap are hashed to |
---|
127 | form the last 64 bits of the storage index. This gives us a 128-bit storage |
---|
128 | index. |
---|
129 | |
---|
130 | The verification-cap is the first 64 bits of the storage index plus the |
---|
131 | pubkey hash, 320 bits total. The verification-cap doesn't need to be |
---|
132 | expressed in a printable transport-safe form, so it's ok that it's longer. |
---|
133 | |
---|
134 | The read-cap is hashed one way to form an AES encryption key that is used to |
---|
135 | encrypt the salt; this key is called the "salt key". The encrypted salt is |
---|
136 | stored in the share. The private key never changes, therefore the salt never |
---|
137 | changes, and the salt key is only used for a single purpose, so there is no |
---|
138 | need for an IV. |
---|
139 | |
---|
140 | The read-cap is hashed a different way to form the master data encryption |
---|
141 | key. A random "data salt" is generated each time the share's contents are |
---|
142 | replaced, and the master data encryption key is concatenated with the data |
---|
143 | salt, then hashed, to form the AES CTR-mode "read key" that will be used to |
---|
144 | encrypt the actual file data. This is to avoid key-reuse. An outstanding |
---|
145 | issue is how to avoid key reuse when files are modified in place instead of |
---|
146 | being replaced completely; this is not done in SDMF but might occur in MDMF. |
---|
147 | |
---|
148 | The private key is hashed one way to form the salt, and a different way to |
---|
149 | form the "write enabler master". For each storage server on which a share is |
---|
150 | kept, the write enabler master is concatenated with the server's nodeid and |
---|
151 | hashed, and the result is called the "write enabler" for that particular |
---|
152 | server. Note that multiple shares of the same slot stored on the same server |
---|
153 | will all get the same write enabler, i.e. the write enabler is associated |
---|
154 | with the "bucket", rather than the individual shares. |
---|
155 | |
---|
156 | The private key is hashed a third way to form the "data write key", which can |
---|
157 | be used by applications which wish to store some data in a form that is only |
---|
158 | available to those with a write-cap, and not to those with merely a read-cap. |
---|
159 | This is used to implement transitive read-onlyness of dirnodes. |
---|
160 | |
---|
161 | The public key is stored on the servers, as is the encrypted salt, the |
---|
162 | (non-encrypted) data salt, the encrypted data, and a signature. The container |
---|
163 | records the write-enabler, but of course this is not visible to readers. To |
---|
164 | make sure that every byte of the share can be verified by a holder of the |
---|
165 | verify-cap (and also by the storage server itself), the signature covers the |
---|
166 | version number, the sequence number, the root hash "R" of the share merkle |
---|
167 | tree, the encoding parameters, and the encrypted salt. "R" itself covers the |
---|
168 | hash trees and the share data. |
---|
169 | |
---|
170 | The read-write URI is just the private key. The read-only URI is the read-cap |
---|
171 | key. The verify-only URI contains the the pubkey hash and the first 64 bits |
---|
172 | of the storage index. |
---|
173 | |
---|
174 | FMW:b2a(privatekey) |
---|
175 | FMR:b2a(readcap) |
---|
176 | FMV:b2a(storageindex[:64])b2a(pubkey-hash) |
---|
177 | |
---|
178 | Note that this allows the read-only and verify-only URIs to be derived from |
---|
179 | the read-write URI without actually retrieving any data from the share, but |
---|
180 | instead by regenerating the public key from the private one. Uses of the |
---|
181 | read-only or verify-only caps must validate the public key against their |
---|
182 | pubkey hash (or its derivative) the first time they retrieve the pubkey, |
---|
183 | before trusting any signatures they see. |
---|
184 | |
---|
185 | The SDMF slot is allocated by sending a request to the storage server with a |
---|
186 | desired size, the storage index, and the write enabler for that server's |
---|
187 | nodeid. If granted, the write enabler is stashed inside the slot's backing |
---|
188 | store file. All further write requests must be accompanied by the write |
---|
189 | enabler or they will not be honored. The storage server does not share the |
---|
190 | write enabler with anyone else. |
---|
191 | |
---|
192 | The SDMF slot structure will be described in more detail below. The important |
---|
193 | pieces are: |
---|
194 | |
---|
195 | * a sequence number |
---|
196 | * a root hash "R" |
---|
197 | * the data salt |
---|
198 | * the encoding parameters (including k, N, file size, segment size) |
---|
199 | * a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key) |
---|
200 | * the verification key (not encrypted) |
---|
201 | * the share hash chain (part of a Merkle tree over the share hashes) |
---|
202 | * the block hash tree (Merkle tree over blocks of share data) |
---|
203 | * the share data itself (erasure-coding of read-key-encrypted file data) |
---|
204 | * the salt, encrypted with the salt key |
---|
205 | |
---|
206 | The access pattern for read (assuming we hold the write-cap) is: |
---|
207 | * generate public key from the private one |
---|
208 | * hash private key to get the salt, hash public key, form read-cap |
---|
209 | * form storage-index |
---|
210 | * use storage-index to locate 'k' shares with identical 'R' values |
---|
211 | * either get one share, read 'k' from it, then read k-1 shares |
---|
212 | * or read, say, 5 shares, discover k, either get more or be finished |
---|
213 | * or copy k into the URIs |
---|
214 | * .. jump to "COMMON READ", below |
---|
215 | |
---|
216 | To read (assuming we only hold the read-cap), do: |
---|
217 | * hash read-cap pieces to generate storage index and salt key |
---|
218 | * use storage-index to locate 'k' shares with identical 'R' values |
---|
219 | * retrieve verification key and encrypted salt |
---|
220 | * decrypt salt |
---|
221 | * hash decrypted salt and pubkey to generate another copy of the read-cap, |
---|
222 | make sure they match (this validates the pubkey) |
---|
223 | * .. jump to "COMMON READ" |
---|
224 | |
---|
225 | * COMMON READ: |
---|
226 | * read seqnum, R, data salt, encoding parameters, signature |
---|
227 | * verify signature against verification key |
---|
228 | * hash data salt and read-cap to generate read-key |
---|
229 | * read share data, compute block-hash Merkle tree and root "r" |
---|
230 | * read share hash chain (leading from "r" to "R") |
---|
231 | * validate share hash chain up to the root "R" |
---|
232 | * submit share data to erasure decoding |
---|
233 | * decrypt decoded data with read-key |
---|
234 | * submit plaintext to application |
---|
235 | |
---|
236 | The access pattern for write is: |
---|
237 | * generate pubkey, salt, read-cap, storage-index as in read case |
---|
238 | * generate data salt for this update, generate read-key |
---|
239 | * encrypt plaintext from application with read-key |
---|
240 | * application can encrypt some data with the data-write-key to make it |
---|
241 | only available to writers (used for transitively-readonly dirnodes) |
---|
242 | * erasure-code crypttext to form shares |
---|
243 | * split shares into blocks |
---|
244 | * compute Merkle tree of blocks, giving root "r" for each share |
---|
245 | * compute Merkle tree of shares, find root "R" for the file as a whole |
---|
246 | * create share data structures, one per server: |
---|
247 | * use seqnum which is one higher than the old version |
---|
248 | * share hash chain has log(N) hashes, different for each server |
---|
249 | * signed data is the same for each server |
---|
250 | * include pubkey, encrypted salt, data salt |
---|
251 | * now we have N shares and need homes for them |
---|
252 | * walk through peers |
---|
253 | * if share is not already present, allocate-and-set |
---|
254 | * otherwise, try to modify existing share: |
---|
255 | * send testv_and_writev operation to each one |
---|
256 | * testv says to accept share if their(seqnum+R) <= our(seqnum+R) |
---|
257 | * count how many servers wind up with which versions (histogram over R) |
---|
258 | * keep going until N servers have the same version, or we run out of servers |
---|
259 | * if any servers wound up with a different version, report error to |
---|
260 | application |
---|
261 | * if we ran out of servers, initiate recovery process (described below) |
---|
262 | |
---|
263 | ==== Cryptographic Properties ==== |
---|
264 | |
---|
265 | This scheme protects the data's confidentiality with 192 bits of key |
---|
266 | material, since the read-cap contains 192 secret bits (derived from an |
---|
267 | encrypted salt, which is encrypted using those same 192 bits plus some |
---|
268 | additional public material). |
---|
269 | |
---|
270 | The integrity of the data (assuming that the signature is valid) is protected |
---|
271 | by the 256-bit hash which gets included in the signature. The privilege of |
---|
272 | modifying the data (equivalent to the ability to form a valid signature) is |
---|
273 | protected by a 256 bit random DSA private key, and the difficulty of |
---|
274 | computing a discrete logarithm in a 2048-bit field. |
---|
275 | |
---|
276 | There are a few weaker denial-of-service attacks possible. If N-k+1 of the |
---|
277 | shares are damaged or unavailable, the client will be unable to recover the |
---|
278 | file. Any coalition of more than N-k shareholders will be able to effect this |
---|
279 | attack by merely refusing to provide the desired share. The "write enabler" |
---|
280 | shared secret protects existing shares from being displaced by new ones, |
---|
281 | except by the holder of the write-cap. One server cannot affect the other |
---|
282 | shares of the same file, once those other shares are in place. |
---|
283 | |
---|
284 | The worst DoS attack is the "roadblock attack", which must be made before |
---|
285 | those shares get placed. Storage indexes are effectively random (being |
---|
286 | derived from the hash of a random value), so they are not guessable before |
---|
287 | the writer begins their upload, but there is a window of vulnerability during |
---|
288 | the beginning of the upload, when some servers have heard about the storage |
---|
289 | index but not all of them. |
---|
290 | |
---|
291 | The roadblock attack we want to prevent is when the first server that the |
---|
292 | uploader contacts quickly runs to all the other selected servers and places a |
---|
293 | bogus share under the same storage index, before the uploader can contact |
---|
294 | them. These shares will normally be accepted, since storage servers create |
---|
295 | new shares on demand. The bogus shares would have randomly-generated |
---|
296 | write-enablers, which will of course be different than the real uploader's |
---|
297 | write-enabler, since the malicious server does not know the write-cap. |
---|
298 | |
---|
299 | If this attack were successful, the uploader would be unable to place any of |
---|
300 | their shares, because the slots have already been filled by the bogus shares. |
---|
301 | The uploader would probably try for peers further and further away from the |
---|
302 | desired location, but eventually they will hit a preconfigured distance limit |
---|
303 | and give up. In addition, the further the writer searches, the less likely it |
---|
304 | is that a reader will search as far. So a successful attack will either cause |
---|
305 | the file to be uploaded but not be reachable, or it will cause the upload to |
---|
306 | fail. |
---|
307 | |
---|
308 | If the uploader tries again (creating a new privkey), they may get lucky and |
---|
309 | the malicious servers will appear later in the query list, giving sufficient |
---|
310 | honest servers a chance to see their share before the malicious one manages |
---|
311 | to place bogus ones. |
---|
312 | |
---|
313 | The first line of defense against this attack is the timing challenges: the |
---|
314 | attacking server must be ready to act the moment a storage request arrives |
---|
315 | (which will only occur for a certain percentage of all new-file uploads), and |
---|
316 | only has a few seconds to act before the other servers will have allocated |
---|
317 | the shares (and recorded the write-enabler, terminating the window of |
---|
318 | vulnerability). |
---|
319 | |
---|
320 | The second line of defense is post-verification, and is possible because the |
---|
321 | storage index is partially derived from the public key hash. A storage server |
---|
322 | can, at any time, verify every public bit of the container as being signed by |
---|
323 | the verification key (this operation is recommended as a continual background |
---|
324 | process, when disk usage is minimal, to detect disk errors). The server can |
---|
325 | also hash the verification key to derive 64 bits of the storage index. If it |
---|
326 | detects that these 64 bits do not match (but the rest of the share validates |
---|
327 | correctly), then the implication is that this share was stored to the wrong |
---|
328 | storage index, either due to a bug or a roadblock attack. |
---|
329 | |
---|
330 | If an uploader finds that they are unable to place their shares because of |
---|
331 | "bad write enabler errors" (as reported by the prospective storage servers), |
---|
332 | it can "cry foul", and ask the storage server to perform this verification on |
---|
333 | the share in question. If the pubkey and storage index do not match, the |
---|
334 | storage server can delete the bogus share, thus allowing the real uploader to |
---|
335 | place their share. Of course the origin of the offending bogus share should |
---|
336 | be logged and reported to a central authority, so corrective measures can be |
---|
337 | taken. It may be necessary to have this "cry foul" protocol include the new |
---|
338 | write-enabler, to close the window during which the malicious server can |
---|
339 | re-submit the bogus share during the adjudication process. |
---|
340 | |
---|
341 | If the problem persists, the servers can be placed into pre-verification |
---|
342 | mode, in which this verification is performed on all potential shares before |
---|
343 | being committed to disk. This mode is more CPU-intensive (since normally the |
---|
344 | storage server ignores the contents of the container altogether), but would |
---|
345 | solve the problem completely. |
---|
346 | |
---|
347 | The mere existence of these potential defenses should be sufficient to deter |
---|
348 | any actual attacks. Note that the storage index only has 64 bits of |
---|
349 | pubkey-derived data in it, which is below the usual crypto guidelines for |
---|
350 | security factors. In this case it's a pre-image attack which would be needed, |
---|
351 | rather than a collision, and the actual attack would be to find a keypair for |
---|
352 | which the public key can be hashed three times to produce the desired portion |
---|
353 | of the storage index. We believe that 64 bits of material is sufficiently |
---|
354 | resistant to this form of pre-image attack to serve as a suitable deterrent. |
---|
355 | |
---|
356 | |
---|
357 | === Server Storage Protocol === |
---|
358 | |
---|
359 | The storage servers will provide a mutable slot container which is oblivious |
---|
360 | to the details of the data being contained inside it. Each storage index |
---|
361 | refers to a "bucket", and each bucket has one or more shares inside it. (In a |
---|
362 | well-provisioned network, each bucket will have only one share). The bucket |
---|
363 | is stored as a directory, using the base32-encoded storage index as the |
---|
364 | directory name. Each share is stored in a single file, using the share number |
---|
365 | as the filename. |
---|
366 | |
---|
367 | The container holds space for a container magic number (for versioning), the |
---|
368 | write enabler, the nodeid which accepted the write enabler (used for share |
---|
369 | migration, described below), a small number of lease structures, the embedded |
---|
370 | data itself, and expansion space for additional lease structures. |
---|
371 | |
---|
372 | # offset size name |
---|
373 | 1 0 32 magic verstr "tahoe mutable container v1" plus binary |
---|
374 | 2 32 20 write enabler's nodeid |
---|
375 | 3 52 32 write enabler |
---|
376 | 4 84 8 data size (actual share data present) (a) |
---|
377 | 5 92 8 offset of (8) count of extra leases (after data) |
---|
378 | 6 100 368 four leases, 92 bytes each |
---|
379 | 0 4 ownerid (0 means "no lease here") |
---|
380 | 4 4 expiration timestamp |
---|
381 | 8 32 renewal token |
---|
382 | 40 32 cancel token |
---|
383 | 72 20 nodeid which accepted the tokens |
---|
384 | 7 468 (a) data |
---|
385 | 8 ?? 4 count of extra leases |
---|
386 | 9 ?? n*92 extra leases |
---|
387 | |
---|
388 | The "extra leases" field must be copied and rewritten each time the size of |
---|
389 | the enclosed data changes. The hope is that most buckets will have four or |
---|
390 | fewer leases and this extra copying will not usually be necessary. |
---|
391 | |
---|
392 | The (4) "data size" field contains the actual number of bytes of data present |
---|
393 | in field (7), such that a client request to read beyond 504+(a) will result |
---|
394 | in an error. This allows the client to (one day) read relative to the end of |
---|
395 | the file. The container size (that is, (8)-(7)) might be larger, especially |
---|
396 | if extra size was pre-allocated in anticipation of filling the container with |
---|
397 | a lot of data. |
---|
398 | |
---|
399 | The offset in (5) points at the *count* of extra leases, at (8). The actual |
---|
400 | leases (at (9)) begin 4 bytes later. If the container size changes, both (8) |
---|
401 | and (9) must be relocated by copying. |
---|
402 | |
---|
403 | The server will honor any write commands that provide the write token and do |
---|
404 | not exceed the server-wide storage size limitations. Read and write commands |
---|
405 | MUST be restricted to the 'data' portion of the container: the implementation |
---|
406 | of those commands MUST perform correct bounds-checking to make sure other |
---|
407 | portions of the container are inaccessible to the clients. |
---|
408 | |
---|
409 | The two methods provided by the storage server on these "MutableSlot" share |
---|
410 | objects are: |
---|
411 | |
---|
412 | * readv(ListOf(offset=int, length=int)) |
---|
413 | * returns a list of bytestrings, of the various requested lengths |
---|
414 | * offset < 0 is interpreted relative to the end of the data |
---|
415 | * spans which hit the end of the data will return truncated data |
---|
416 | |
---|
417 | * testv_and_writev(write_enabler, test_vector, write_vector) |
---|
418 | * this is a test-and-set operation which performs the given tests and only |
---|
419 | applies the desired writes if all tests succeed. This is used to detect |
---|
420 | simultaneous writers, and to reduce the chance that an update will lose |
---|
421 | data recently written by some other party (written after the last time |
---|
422 | this slot was read). |
---|
423 | * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) |
---|
424 | * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] |
---|
425 | * each element of the test vector is read from the slot's data and |
---|
426 | compared against the specimen using the desired (in)equality. If all |
---|
427 | tests evaluate True, the write is performed |
---|
428 | * write_vector=ListOf(TupleOf(offset, newdata)) |
---|
429 | * offset < 0 is not yet defined, it probably means relative to the |
---|
430 | end of the data, which probably means append, but we haven't nailed |
---|
431 | it down quite yet |
---|
432 | * write vectors are executed in order, which specifies the results of |
---|
433 | overlapping writes |
---|
434 | * return value: |
---|
435 | * error: OutOfSpace |
---|
436 | * error: something else (io error, out of memory, whatever) |
---|
437 | * (True, old_test_data): the write was accepted (test_vector passed) |
---|
438 | * (False, old_test_data): the write was rejected (test_vector failed) |
---|
439 | * both 'accepted' and 'rejected' return the old data that was used |
---|
440 | for the test_vector comparison. This can be used by the client |
---|
441 | to detect write collisions, including collisions for which the |
---|
442 | desired behavior was to overwrite the old version. |
---|
443 | |
---|
444 | In addition, the storage server provides several methods to access these |
---|
445 | share objects: |
---|
446 | |
---|
447 | * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) |
---|
448 | * returns DictOf(int, MutableSlot) |
---|
449 | * get_mutable_slot(storage_index) |
---|
450 | * returns DictOf(int, MutableSlot) |
---|
451 | * or raises KeyError |
---|
452 | |
---|
453 | We intend to add an interface which allows small slots to allocate-and-write |
---|
454 | in a single call, as well as do update or read in a single call. The goal is |
---|
455 | to allow a reasonably-sized dirnode to be created (or updated, or read) in |
---|
456 | just one round trip (to all N shareholders in parallel). |
---|
457 | |
---|
458 | ==== migrating shares ==== |
---|
459 | |
---|
460 | If a share must be migrated from one server to another, two values become |
---|
461 | invalid: the write enabler (since it was computed for the old server), and |
---|
462 | the lease renew/cancel tokens. |
---|
463 | |
---|
464 | Suppose that a slot was first created on nodeA, and was thus initialized with |
---|
465 | WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is |
---|
466 | moved from nodeA to nodeB. |
---|
467 | |
---|
468 | Readers may still be able to find the share in its new home, depending upon |
---|
469 | how many servers are present in the grid, where the new nodeid lands in the |
---|
470 | permuted index for this particular storage index, and how many servers the |
---|
471 | reading client is willing to contact. |
---|
472 | |
---|
473 | When a client attempts to write to this migrated share, it will get a "bad |
---|
474 | write enabler" error, since the WE it computes for nodeB will not match the |
---|
475 | WE(nodeA) that was embedded in the share. When this occurs, the "bad write |
---|
476 | enabler" message must include the old nodeid (e.g. nodeA) that was in the |
---|
477 | share. |
---|
478 | |
---|
479 | The client then computes H(nodeB+H(WEM+nodeA)), which is the same as |
---|
480 | H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which |
---|
481 | is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to |
---|
482 | anyone else. Also note that the client does not send a value to nodeB that |
---|
483 | would allow the node to impersonate the client to a third node: everything |
---|
484 | sent to nodeB will include something specific to nodeB in it. |
---|
485 | |
---|
486 | The server locally computes H(nodeB+WE(nodeA)), using its own node id and the |
---|
487 | old write enabler from the share. It compares this against the value supplied |
---|
488 | by the client. If they match, this serves as proof that the client was able |
---|
489 | to compute the old write enabler. The server then accepts the client's new |
---|
490 | WE(nodeB) and writes it into the container. |
---|
491 | |
---|
492 | This WE-fixup process requires an extra round trip, and requires the error |
---|
493 | message to include the old nodeid, but does not require any public key |
---|
494 | operations on either client or server. |
---|
495 | |
---|
496 | Migrating the leases will require a similar protocol. This protocol will be |
---|
497 | defined concretely at a later date. |
---|
498 | |
---|
499 | === Code Details === |
---|
500 | |
---|
501 | The current FileNode class will be renamed ImmutableFileNode, and a new |
---|
502 | MutableFileNode class will be created. Instances of this class will contain a |
---|
503 | URI and a reference to the client (for peer selection and connection). The |
---|
504 | methods of MutableFileNode are: |
---|
505 | |
---|
506 | * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError |
---|
507 | * get() -> [deferred] newdata, NotEnoughPeersError |
---|
508 | * if there are multiple retrieveable versions in the grid, get() returns |
---|
509 | the first version it can reconstruct, and silently ignores the others. |
---|
510 | In the future, a more advanced API will signal and provide access to |
---|
511 | the multiple heads. |
---|
512 | |
---|
513 | The peer-selection and data-structure manipulation (and signing/verification) |
---|
514 | steps will be implemented in a separate class in allmydata/mutable.py . |
---|
515 | |
---|
516 | === SMDF Slot Format === |
---|
517 | |
---|
518 | This SMDF data lives inside a server-side MutableSlot container. The server |
---|
519 | is generally oblivious to this format, but it may look inside the container |
---|
520 | when verification is desired. |
---|
521 | |
---|
522 | This data is tightly packed. There are no gaps left between the different |
---|
523 | fields, and the offset table is mainly present to allow future flexibility of |
---|
524 | key sizes. |
---|
525 | |
---|
526 | # offset size name |
---|
527 | 1 0 1 version byte, \x01 for this format |
---|
528 | 2 1 8 sequence number. 2^64-1 must be handled specially, TBD |
---|
529 | 3 9 32 "R" (root of share hash Merkle tree) |
---|
530 | 4 41 32 data salt (readkey is H(readcap+data_salt)) |
---|
531 | 5 73 32 encrypted salt (AESenc(key=H(readcap), salt) |
---|
532 | 6 105 18 encoding parameters: |
---|
533 | 105 1 k |
---|
534 | 106 1 N |
---|
535 | 107 8 segment size |
---|
536 | 115 8 data length (of original plaintext) |
---|
537 | 7 123 36 offset table: |
---|
538 | 127 4 (9) signature |
---|
539 | 131 4 (10) share hash chain |
---|
540 | 135 4 (11) block hash tree |
---|
541 | 139 4 (12) share data |
---|
542 | 143 8 (13) EOF |
---|
543 | 8 151 256 verification key (2048bit DSA key) |
---|
544 | 9 407 40 signature=DSAsig(H([1,2,3,4,5,6])) |
---|
545 | 10 447 (a) share hash chain, encoded as: |
---|
546 | "".join([pack(">H32s", shnum, hash) |
---|
547 | for (shnum,hash) in needed_hashes]) |
---|
548 | 11 ?? (b) block hash tree, encoded as: |
---|
549 | "".join([pack(">32s",hash) for hash in block_hash_tree]) |
---|
550 | 12 ?? LEN share data |
---|
551 | 13 ?? -- EOF |
---|
552 | |
---|
553 | (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. |
---|
554 | This is the set of hashes necessary to validate this share's leaf in the |
---|
555 | share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. |
---|
556 | (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes |
---|
557 | long. This is the set of hashes necessary to validate any given block of |
---|
558 | share data up to the per-share root "r". Each "r" is a leaf of the share |
---|
559 | has tree (with root "R"), from which a minimal subset of hashes is put in |
---|
560 | the share hash chain in (8). |
---|
561 | |
---|
562 | === Recovery === |
---|
563 | |
---|
564 | The first line of defense against damage caused by colliding writes is the |
---|
565 | Prime Coordination Directive: "Don't Do That". |
---|
566 | |
---|
567 | The second line of defense is to keep "S" (the number of competing versions) |
---|
568 | lower than N/k. If this holds true, at least one competing version will have |
---|
569 | k shares and thus be recoverable. Note that server unavailability counts |
---|
570 | against us here: the old version stored on the unavailable server must be |
---|
571 | included in the value of S. |
---|
572 | |
---|
573 | The third line of defense is our use of testv_and_writev() (described below), |
---|
574 | which increases the convergence of simultaneous writes: one of the writers |
---|
575 | will be favored (the one with the highest "R"), and that version is more |
---|
576 | likely to be accepted than the others. This defense is least effective in the |
---|
577 | pathological situation where S simultaneous writers are active, the one with |
---|
578 | the lowest "R" writes to N-k+1 of the shares and then dies, then the one with |
---|
579 | the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the |
---|
580 | one with the highest "R" writes to k-1 shares and dies. Any other sequencing |
---|
581 | will allow the highest "R" to write to at least k shares and establish a new |
---|
582 | revision. |
---|
583 | |
---|
584 | The fourth line of defense is the fact that each client keeps writing until |
---|
585 | at least one version has N shares. This uses additional servers, if |
---|
586 | necessary, to make sure that either the client's version or some |
---|
587 | newer/overriding version is highly available. |
---|
588 | |
---|
589 | The fifth line of defense is the recovery algorithm, which seeks to make sure |
---|
590 | that at least *one* version is highly available, even if that version is |
---|
591 | somebody else's. |
---|
592 | |
---|
593 | The write-shares-to-peers algorithm is as follows: |
---|
594 | |
---|
595 | * permute peers according to storage index |
---|
596 | * walk through peers, trying to assign one share per peer |
---|
597 | * for each peer: |
---|
598 | * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test |
---|
599 | * this means that we will overwrite any old versions, and we will |
---|
600 | overwrite simultaenous writers of the same version if our R is higher. |
---|
601 | We will not overwrite writers using a higher seqnum. |
---|
602 | * record the version that each share winds up with. If the write was |
---|
603 | accepted, this is our own version. If it was rejected, read the |
---|
604 | old_test_data to find out what version was retained. |
---|
605 | * if old_test_data indicates the seqnum was equal or greater than our |
---|
606 | own, mark the "Simultanous Writes Detected" flag, which will eventually |
---|
607 | result in an error being reported to the writer (in their close() call). |
---|
608 | * build a histogram of "R" values |
---|
609 | * repeat until the histogram indicate that some version (possibly ours) |
---|
610 | has N shares. Use new servers if necessary. |
---|
611 | * If we run out of servers: |
---|
612 | * if there are at least shares-of-happiness of any one version, we're |
---|
613 | happy, so return. (the close() might still get an error) |
---|
614 | * not happy, need to reinforce something, goto RECOVERY |
---|
615 | |
---|
616 | RECOVERY: |
---|
617 | * read all shares, count the versions, identify the recoverable ones, |
---|
618 | discard the unrecoverable ones. |
---|
619 | * sort versions: locate max(seqnums), put all versions with that seqnum |
---|
620 | in the list, sort by number of outstanding shares. Then put our own |
---|
621 | version. (TODO: put versions with seqnum <max but >us ahead of us?). |
---|
622 | * for each version: |
---|
623 | * attempt to recover that version |
---|
624 | * if not possible, remove it from the list, go to next one |
---|
625 | * if recovered, start at beginning of peer list, push that version, |
---|
626 | continue until N shares are placed |
---|
627 | * if pushing our own version, bump up the seqnum to one higher than |
---|
628 | the max seqnum we saw |
---|
629 | * if we run out of servers: |
---|
630 | * schedule retry and exponential backoff to repeat RECOVERY |
---|
631 | * admit defeat after some period? presumeably the client will be shut down |
---|
632 | eventually, maybe keep trying (once per hour?) until then. |
---|
633 | |
---|
634 | |
---|
635 | |
---|
636 | |
---|
637 | == Medium Distributed Mutable Files == |
---|
638 | |
---|
639 | These are just like the SDMF case, but: |
---|
640 | |
---|
641 | * we actually take advantage of the Merkle hash tree over the blocks, by |
---|
642 | reading a single segment of data at a time (and its necessary hashes), to |
---|
643 | reduce the read-time alacrity |
---|
644 | * we allow arbitrary writes to the file (i.e. seek() is provided, and |
---|
645 | O_TRUNC is no longer required) |
---|
646 | * we write more code on the client side (in the MutableFileNode class), to |
---|
647 | first read each segment that a write must modify. This looks exactly like |
---|
648 | the way a normal filesystem uses a block device, or how a CPU must perform |
---|
649 | a cache-line fill before modifying a single word. |
---|
650 | * we might implement some sort of copy-based atomic update server call, |
---|
651 | to allow multiple writev() calls to appear atomic to any readers. |
---|
652 | |
---|
653 | MDMF slots provide fairly efficient in-place edits of very large files (a few |
---|
654 | GB). Appending data is also fairly efficient, although each time a power of 2 |
---|
655 | boundary is crossed, the entire file must effectively be re-uploaded (because |
---|
656 | the size of the block hash tree changes), so if the filesize is known in |
---|
657 | advance, that space ought to be pre-allocated (by leaving extra space between |
---|
658 | the block hash tree and the actual data). |
---|
659 | |
---|
660 | MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 |
---|
661 | adds cache-line reads to allow random-access writes. |
---|
662 | |
---|
663 | == Large Distributed Mutable Files == |
---|
664 | |
---|
665 | LDMF slots use a fundamentally different way to store the file, inspired by |
---|
666 | Mercurial's "revlog" format. They enable very efficient insert/remove/replace |
---|
667 | editing of arbitrary spans. Multiple versions of the file can be retained, in |
---|
668 | a revision graph that can have multiple heads. Each revision can be |
---|
669 | referenced by a cryptographic identifier. There are two forms of the URI, one |
---|
670 | that means "most recent version", and a longer one that points to a specific |
---|
671 | revision. |
---|
672 | |
---|
673 | Metadata can be attached to the revisions, like timestamps, to enable rolling |
---|
674 | back an entire tree to a specific point in history. |
---|
675 | |
---|
676 | LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 |
---|
677 | provides explicit support for revision identifiers and branching. |
---|
678 | |
---|
679 | == TODO == |
---|
680 | |
---|
681 | improve allocate-and-write or get-writer-buckets API to allow one-call (or |
---|
682 | maybe two-call) updates. The challenge is in figuring out which shares are on |
---|
683 | which machines. First cut will have lots of round trips. |
---|
684 | |
---|
685 | (eventually) define behavior when seqnum wraps. At the very least make sure |
---|
686 | it can't cause a security problem. "the slot is worn out" is acceptable. |
---|
687 | |
---|
688 | (eventually) define share-migration lease update protocol. Including the |
---|
689 | nodeid who accepted the lease is useful, we can use the same protocol as we |
---|
690 | do for updating the write enabler. However we need to know which lease to |
---|
691 | update.. maybe send back a list of all old nodeids that we find, then try all |
---|
692 | of them when we accept the update? |
---|
693 | |
---|
694 | We now do this in a specially-formatted IndexError exception: |
---|
695 | "UNABLE to renew non-existent lease. I have leases accepted by " + |
---|
696 | "nodeids: '12345','abcde','44221' ." |
---|
697 | |
---|
698 | Every node in a given tahoe grid must have the same common DSA moduli and |
---|
699 | exponent, but different grids could use different parameters. We haven't |
---|
700 | figured out how to define a "grid id" yet, but I think the DSA parameters |
---|
701 | should be part of that identifier. In practical terms, this might mean that |
---|
702 | the Introducer tells each node what parameters to use, or perhaps the node |
---|
703 | could have a config file which specifies them instead. |
---|