1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ======================= |
---|
4 | Tahoe-LAFS Architecture |
---|
5 | ======================= |
---|
6 | |
---|
7 | 1. `Overview`_ |
---|
8 | 2. `The Key-Value Store`_ |
---|
9 | 3. `File Encoding`_ |
---|
10 | 4. `Capabilities`_ |
---|
11 | 5. `Server Selection`_ |
---|
12 | 6. `Swarming Download, Trickling Upload`_ |
---|
13 | 7. `The File Store Layer`_ |
---|
14 | 8. `Leases, Refreshing, Garbage Collection`_ |
---|
15 | 9. `File Repairer`_ |
---|
16 | 10. `Security`_ |
---|
17 | 11. `Reliability`_ |
---|
18 | |
---|
19 | |
---|
20 | Overview |
---|
21 | ======== |
---|
22 | |
---|
23 | (See the `docs/specifications directory`_ for more details.) |
---|
24 | |
---|
25 | There are three layers: the key-value store, the file store, and the |
---|
26 | application. |
---|
27 | |
---|
28 | The lowest layer is the key-value store. The keys are "capabilities" -- short |
---|
29 | ASCII strings -- and the values are sequences of data bytes. This data is |
---|
30 | encrypted and distributed across a number of nodes, such that it will survive |
---|
31 | the loss of most of the nodes. There are no hard limits on the size of the |
---|
32 | values, but there may be performance issues with extremely large values (just |
---|
33 | due to the limitation of network bandwidth). In practice, values as small as |
---|
34 | a few bytes and as large as tens of gigabytes are in common use. |
---|
35 | |
---|
36 | The middle layer is the decentralized file store: a directed graph in which |
---|
37 | the intermediate nodes are directories and the leaf nodes are files. The leaf |
---|
38 | nodes contain only the data -- they contain no metadata other than the length |
---|
39 | in bytes. The edges leading to leaf nodes have metadata attached to them |
---|
40 | about the file they point to. Therefore, the same file may be associated with |
---|
41 | different metadata if it is referred to through different edges. |
---|
42 | |
---|
43 | The top layer consists of the applications using the file store. |
---|
44 | Allmydata.com used it for a backup service: the application periodically |
---|
45 | copies files from the local disk onto the decentralized file store. We later |
---|
46 | provide read-only access to those files, allowing users to recover them. |
---|
47 | There are several other applications built on top of the Tahoe-LAFS |
---|
48 | file store (see the RelatedProjects_ page of the wiki for a list). |
---|
49 | |
---|
50 | .. _docs/specifications directory: https://github.com/tahoe-lafs/tahoe-lafs/tree/master/docs/specifications |
---|
51 | .. _RelatedProjects: https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects |
---|
52 | |
---|
53 | The Key-Value Store |
---|
54 | =================== |
---|
55 | |
---|
56 | The key-value store is implemented by a grid of Tahoe-LAFS storage servers -- |
---|
57 | user-space processes. Tahoe-LAFS storage clients communicate with the storage |
---|
58 | servers over TCP. |
---|
59 | |
---|
60 | There are two supported protocols: |
---|
61 | |
---|
62 | * Foolscap, the only supported protocol in release before v1.19. |
---|
63 | * HTTPS, new in v1.19. |
---|
64 | |
---|
65 | By default HTTPS is enabled. When HTTPS is enabled on the server, the server |
---|
66 | transparently listens for both Foolscap and HTTPS on the same port. When it is |
---|
67 | disabled, the server only supports Foolscap. Clients can use either; by default |
---|
68 | they will use HTTPS when possible, falling back to I2p, but when configured |
---|
69 | appropriately they will only use Foolscap. At this time the only limitations of |
---|
70 | HTTPS is that I2P is not supported, so any usage of I2P only uses Foolscap. |
---|
71 | |
---|
72 | Storage servers hold data in the form of "shares". Shares are encoded pieces |
---|
73 | of files. There are a configurable number of shares for each file, 10 by |
---|
74 | default. Normally, each share is stored on a separate server, but in some |
---|
75 | cases a single server can hold multiple shares of a file. |
---|
76 | |
---|
77 | Nodes learn about each other through an "introducer". Each server connects to |
---|
78 | the introducer at startup and announces its presence. Each client connects to |
---|
79 | the introducer at startup, and receives a list of all servers from it. Each |
---|
80 | client then connects to every server, creating a "bi-clique" topology. In the |
---|
81 | current release, nodes behind NAT boxes will connect to all nodes that they |
---|
82 | can open connections to, but they cannot open connections to other nodes |
---|
83 | behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the |
---|
84 | topology resembles the intended bi-clique topology. |
---|
85 | |
---|
86 | The introducer is a Single Point of Failure ("SPoF"), in that clients who |
---|
87 | never connect to the introducer will be unable to connect to any storage |
---|
88 | servers, but once a client has been introduced to everybody, it does not need |
---|
89 | the introducer again until it is restarted. The danger of a SPoF is further |
---|
90 | reduced in two ways. First, the introducer is defined by a hostname and a |
---|
91 | private key, which are easy to move to a new host in case the original one |
---|
92 | suffers an unrecoverable hardware problem. Second, even if the private key is |
---|
93 | lost, clients can be reconfigured to use a new introducer. |
---|
94 | |
---|
95 | For future releases, we have plans to decentralize introduction, allowing any |
---|
96 | server to tell a new client about all the others. |
---|
97 | |
---|
98 | |
---|
99 | File Encoding |
---|
100 | ============= |
---|
101 | |
---|
102 | When a client stores a file on the grid, it first encrypts the file. It then |
---|
103 | breaks the encrypted file into small segments, in order to reduce the memory |
---|
104 | footprint, and to decrease the lag between initiating a download and |
---|
105 | receiving the first part of the file; for example the lag between hitting |
---|
106 | "play" and a movie actually starting. |
---|
107 | |
---|
108 | The client then erasure-codes each segment, producing blocks of which only a |
---|
109 | subset are needed to reconstruct the segment (3 out of 10, with the default |
---|
110 | settings). |
---|
111 | |
---|
112 | It sends one block from each segment to a given server. The set of blocks on |
---|
113 | a given server constitutes a "share". Therefore a subset of the shares (3 out |
---|
114 | of 10, by default) are needed to reconstruct the file. |
---|
115 | |
---|
116 | A hash of the encryption key is used to form the "storage index", which is |
---|
117 | used for both server selection (described below) and to index shares within |
---|
118 | the Storage Servers on the selected nodes. |
---|
119 | |
---|
120 | The client computes secure hashes of the ciphertext and of the shares. It |
---|
121 | uses `Merkle Trees`_ so that it is possible to verify the correctness of a |
---|
122 | subset of the data without requiring all of the data. For example, this |
---|
123 | allows you to verify the correctness of the first segment of a movie file and |
---|
124 | then begin playing the movie file in your movie viewer before the entire |
---|
125 | movie file has been downloaded. |
---|
126 | |
---|
127 | These hashes are stored in a small datastructure named the Capability |
---|
128 | Extension Block which is stored on the storage servers alongside each share. |
---|
129 | |
---|
130 | The capability contains the encryption key, the hash of the Capability |
---|
131 | Extension Block, and any encoding parameters necessary to perform the |
---|
132 | eventual decoding process. For convenience, it also contains the size of the |
---|
133 | file being stored. |
---|
134 | |
---|
135 | To download, the client that wishes to turn a capability into a sequence of |
---|
136 | bytes will obtain the blocks from storage servers, use erasure-decoding to |
---|
137 | turn them into segments of ciphertext, use the decryption key to convert that |
---|
138 | into plaintext, then emit the plaintext bytes to the output target. |
---|
139 | |
---|
140 | .. _`Merkle Trees`: http://systems.cs.colorado.edu/grunwald/Classes/Fall2003-InformationStorage/Papers/merkle-tree.pdf |
---|
141 | |
---|
142 | |
---|
143 | Capabilities |
---|
144 | ============ |
---|
145 | |
---|
146 | Capabilities to immutable files represent a specific set of bytes. Think of |
---|
147 | it like a hash function: you feed in a bunch of bytes, and you get out a |
---|
148 | capability, which is deterministically derived from the input data: changing |
---|
149 | even one bit of the input data will result in a completely different |
---|
150 | capability. |
---|
151 | |
---|
152 | Read-only capabilities to mutable files represent the ability to get a set of |
---|
153 | bytes representing some version of the file, most likely the latest version. |
---|
154 | Each read-only capability is unique. In fact, each mutable file has a unique |
---|
155 | public/private key pair created when the mutable file is created, and the |
---|
156 | read-only capability to that file includes a secure hash of the public key. |
---|
157 | |
---|
158 | Read-write capabilities to mutable files represent the ability to read the |
---|
159 | file (just like a read-only capability) and also to write a new version of |
---|
160 | the file, overwriting any extant version. Read-write capabilities are unique |
---|
161 | -- each one includes the secure hash of the private key associated with that |
---|
162 | mutable file. |
---|
163 | |
---|
164 | The capability provides both "location" and "identification": you can use it |
---|
165 | to retrieve a set of bytes, and then you can use it to validate ("identify") |
---|
166 | that these potential bytes are indeed the ones that you were looking for. |
---|
167 | |
---|
168 | The "key-value store" layer doesn't include human-meaningful names. |
---|
169 | Capabilities sit on the "global+secure" edge of `Zooko's Triangle`_. They are |
---|
170 | self-authenticating, meaning that nobody can trick you into accepting a file |
---|
171 | that doesn't match the capability you used to refer to that file. The |
---|
172 | file store layer (described below) adds human-meaningful names atop the |
---|
173 | key-value layer. |
---|
174 | |
---|
175 | .. _`Zooko's Triangle`: https://en.wikipedia.org/wiki/Zooko%27s_triangle |
---|
176 | |
---|
177 | |
---|
178 | Server Selection |
---|
179 | ================ |
---|
180 | |
---|
181 | When a file is uploaded, the encoded shares are sent to some servers. But to |
---|
182 | which ones? The "server selection" algorithm is used to make this choice. |
---|
183 | |
---|
184 | The storage index is used to consistently-permute the set of all servers nodes |
---|
185 | (by sorting them by ``HASH(storage_index+nodeid)``). Each file gets a different |
---|
186 | permutation, which (on average) will evenly distribute shares among the grid |
---|
187 | and avoid hotspots. Each server has announced its available space when it |
---|
188 | connected to the introducer, and we use that available space information to |
---|
189 | remove any servers that cannot hold an encoded share for our file. Then we ask |
---|
190 | some of the servers thus removed if they are already holding any encoded shares |
---|
191 | for our file; we use this information later. (We ask any servers which are in |
---|
192 | the first 2*``N`` elements of the permuted list.) |
---|
193 | |
---|
194 | We then use the permuted list of servers to ask each server, in turn, if it |
---|
195 | will hold a share for us (a share that was not reported as being already |
---|
196 | present when we talked to the full servers earlier, and that we have not |
---|
197 | already planned to upload to a different server). We plan to send a share to a |
---|
198 | server by sending an 'allocate_buckets() query' to the server with the number |
---|
199 | of that share. Some will say yes they can hold that share, others (those who |
---|
200 | have become full since they announced their available space) will say no; when |
---|
201 | a server refuses our request, we take that share to the next server on the |
---|
202 | list. In the response to allocate_buckets() the server will also inform us of |
---|
203 | any shares of that file that it already has. We keep going until we run out of |
---|
204 | shares that need to be stored. At the end of the process, we'll have a table |
---|
205 | that maps each share number to a server, and then we can begin the encode and |
---|
206 | push phase, using the table to decide where each share should be sent. |
---|
207 | |
---|
208 | Most of the time, this will result in one share per server, which gives us |
---|
209 | maximum reliability. If there are fewer writable servers than there are |
---|
210 | unstored shares, we'll be forced to loop around, eventually giving multiple |
---|
211 | shares to a single server. |
---|
212 | |
---|
213 | If we have to loop through the node list a second time, we accelerate the query |
---|
214 | process, by asking each node to hold multiple shares on the second pass. In |
---|
215 | most cases, this means we'll never send more than two queries to any given |
---|
216 | node. |
---|
217 | |
---|
218 | If a server is unreachable, or has an error, or refuses to accept any of our |
---|
219 | shares, we remove it from the permuted list, so we won't query it again for |
---|
220 | this file. If a server already has shares for the file we're uploading, we add |
---|
221 | that information to the share-to-server table. This lets us do less work for |
---|
222 | files which have been uploaded once before, while making sure we still wind up |
---|
223 | with as many shares as we desire. |
---|
224 | |
---|
225 | Before a file upload is called successful, it has to pass an upload health |
---|
226 | check. For immutable files, we check to see that a condition called |
---|
227 | 'servers-of-happiness' is satisfied. When satisfied, 'servers-of-happiness' |
---|
228 | assures us that enough pieces of the file are distributed across enough |
---|
229 | servers on the grid to ensure that the availability of the file will not be |
---|
230 | affected if a few of those servers later fail. For mutable files and |
---|
231 | directories, we check to see that all of the encoded shares generated during |
---|
232 | the upload process were successfully placed on the grid. This is a weaker |
---|
233 | check than 'servers-of-happiness'; it does not consider any information about |
---|
234 | how the encoded shares are placed on the grid, and cannot detect situations in |
---|
235 | which all or a majority of the encoded shares generated during the upload |
---|
236 | process reside on only one storage server. We hope to extend |
---|
237 | 'servers-of-happiness' to mutable files in a future release of Tahoe-LAFS. If, |
---|
238 | at the end of the upload process, the appropriate upload health check fails, |
---|
239 | the upload is considered a failure. |
---|
240 | |
---|
241 | The current defaults use ``k`` = 3, ``servers_of_happiness`` = 7, and ``N`` = 10. |
---|
242 | ``N`` = 10 means that we'll try to place 10 shares. ``k`` = 3 means that we need |
---|
243 | any three shares to recover the file. ``servers_of_happiness`` = 7 means that |
---|
244 | we'll consider an immutable file upload to be successful if we can place shares |
---|
245 | on enough servers that there are 7 different servers, the correct functioning |
---|
246 | of any ``k`` of which guarantee the availability of the immutable file. |
---|
247 | |
---|
248 | ``N`` = 10 and ``k`` = 3 means there is a 3.3x expansion factor. On a small grid, you |
---|
249 | should set ``N`` about equal to the number of storage servers in your grid; on a |
---|
250 | large grid, you might set it to something smaller to avoid the overhead of |
---|
251 | contacting every server to place a file. In either case, you should then set ``k`` |
---|
252 | such that ``N``/``k`` reflects your desired availability goals. The best value for |
---|
253 | ``servers_of_happiness`` will depend on how you use Tahoe-LAFS. In a friendnet |
---|
254 | with a variable number of servers, it might make sense to set it to the smallest |
---|
255 | number of servers that you expect to have online and accepting shares at any |
---|
256 | given time. In a stable environment without much server churn, it may make |
---|
257 | sense to set ``servers_of_happiness`` = ``N``. |
---|
258 | |
---|
259 | When downloading a file, the current version just asks all known servers for |
---|
260 | any shares they might have. Once it has received enough responses that it |
---|
261 | knows where to find the needed k shares, it downloads at least the first |
---|
262 | segment from those servers. This means that it tends to download shares from |
---|
263 | the fastest servers. If some servers had more than one share, it will continue |
---|
264 | sending "Do You Have Block" requests to other servers, so that it can download |
---|
265 | subsequent segments from distinct servers (sorted by their DYHB round-trip |
---|
266 | times), if possible. |
---|
267 | |
---|
268 | *future work* |
---|
269 | |
---|
270 | A future release will use the server selection algorithm to reduce the |
---|
271 | number of queries that must be sent out. |
---|
272 | |
---|
273 | Other peer-node selection algorithms are possible. One earlier version |
---|
274 | (known as "Tahoe 3") used the permutation to place the nodes around a large |
---|
275 | ring, distributed the shares evenly around the same ring, then walked |
---|
276 | clockwise from 0 with a basket. Each time it encountered a share, it put it |
---|
277 | in the basket, each time it encountered a server, give it as many shares |
---|
278 | from the basket as they'd accept. This reduced the number of queries |
---|
279 | (usually to 1) for small grids (where ``N`` is larger than the number of |
---|
280 | nodes), but resulted in extremely non-uniform share distribution, which |
---|
281 | significantly hurt reliability (sometimes the permutation resulted in most |
---|
282 | of the shares being dumped on a single node). |
---|
283 | |
---|
284 | Another algorithm (known as "denver airport" [#naming]_) uses the permuted hash to |
---|
285 | decide on an approximate target for each share, then sends lease requests |
---|
286 | via Chord routing. The request includes the contact information of the |
---|
287 | uploading node, and asks that the node which eventually accepts the lease |
---|
288 | should contact the uploader directly. The shares are then transferred over |
---|
289 | direct connections rather than through multiple Chord hops. Download uses |
---|
290 | the same approach. This allows nodes to avoid maintaining a large number of |
---|
291 | long-term connections, at the expense of complexity and latency. |
---|
292 | |
---|
293 | .. [#naming] all of these names are derived from the location where they were |
---|
294 | concocted, in this case in a car ride from Boulder to DEN. To be |
---|
295 | precise, "Tahoe 1" was an unworkable scheme in which everyone who holds |
---|
296 | shares for a given file would form a sort of cabal which kept track of |
---|
297 | all the others, "Tahoe 2" is the first-100-nodes in the permuted hash |
---|
298 | described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1") |
---|
299 | was the abandoned ring-with-many-hands approach. |
---|
300 | |
---|
301 | |
---|
302 | Swarming Download, Trickling Upload |
---|
303 | =================================== |
---|
304 | |
---|
305 | Because the shares being downloaded are distributed across a large number of |
---|
306 | nodes, the download process will pull from many of them at the same time. The |
---|
307 | current encoding parameters require 3 shares to be retrieved for each |
---|
308 | segment, which means that up to 3 nodes will be used simultaneously. For |
---|
309 | larger networks, 8-of-22 encoding could be used, meaning 8 nodes can be used |
---|
310 | simultaneously. This allows the download process to use the sum of the |
---|
311 | available nodes' upload bandwidths, resulting in downloads that take full |
---|
312 | advantage of the common 8x disparity between download and upload bandwith on |
---|
313 | modern ADSL lines. |
---|
314 | |
---|
315 | On the other hand, uploads are hampered by the need to upload encoded shares |
---|
316 | that are larger than the original data (3.3x larger with the current default |
---|
317 | encoding parameters), through the slow end of the asymmetric connection. This |
---|
318 | means that on a typical 8x ADSL line, uploading a file will take about 32 |
---|
319 | times longer than downloading it again later. |
---|
320 | |
---|
321 | Smaller expansion ratios can reduce this upload penalty, at the expense of |
---|
322 | reliability (see `Reliability`_, below). By using an "upload helper", this |
---|
323 | penalty is eliminated: the client does a 1x upload of encrypted data to the |
---|
324 | helper, then the helper performs encoding and pushes the shares to the |
---|
325 | storage servers. This is an improvement if the helper has significantly |
---|
326 | higher upload bandwidth than the client, so it makes the most sense for a |
---|
327 | commercially-run grid for which all of the storage servers are in a colo |
---|
328 | facility with high interconnect bandwidth. In this case, the helper is placed |
---|
329 | in the same facility, so the helper-to-storage-server bandwidth is huge. |
---|
330 | |
---|
331 | See :doc:`helper` for details about the upload helper. |
---|
332 | |
---|
333 | |
---|
334 | The File Store Layer |
---|
335 | ==================== |
---|
336 | |
---|
337 | The "file store" layer is responsible for mapping human-meaningful pathnames |
---|
338 | (directories and filenames) to pieces of data. The actual bytes inside these |
---|
339 | files are referenced by capability, but the file store layer is where the |
---|
340 | directory names, file names, and metadata are kept. |
---|
341 | |
---|
342 | The file store layer is a graph of directories. Each directory contains a |
---|
343 | table of named children. These children are either other directories or |
---|
344 | files. All children are referenced by their capability. |
---|
345 | |
---|
346 | A directory has two forms of capability: read-write caps and read-only caps. |
---|
347 | The table of children inside the directory has a read-write and read-only |
---|
348 | capability for each child. If you have a read-only capability for a given |
---|
349 | directory, you will not be able to access the read-write capability of its |
---|
350 | children. This results in "transitively read-only" directory access. |
---|
351 | |
---|
352 | By having two different capabilities, you can choose which you want to share |
---|
353 | with someone else. If you create a new directory and share the read-write |
---|
354 | capability for it with a friend, then you will both be able to modify its |
---|
355 | contents. If instead you give them the read-only capability, then they will |
---|
356 | *not* be able to modify the contents. Any capability that you receive can be |
---|
357 | linked in to any directory that you can modify, so very powerful |
---|
358 | shared+published directory structures can be built from these components. |
---|
359 | |
---|
360 | This structure enable individual users to have their own personal space, with |
---|
361 | links to spaces that are shared with specific other users, and other spaces |
---|
362 | that are globally visible. |
---|
363 | |
---|
364 | |
---|
365 | Leases, Refreshing, Garbage Collection |
---|
366 | ====================================== |
---|
367 | |
---|
368 | When a file or directory in the file store is no longer referenced, the space |
---|
369 | that its shares occupied on each storage server can be freed, making room for |
---|
370 | other shares. Tahoe-LAFS uses a garbage collection ("GC") mechanism to |
---|
371 | implement this space-reclamation process. Each share has one or more |
---|
372 | "leases", which are managed by clients who want the file/directory to be |
---|
373 | retained. The storage server accepts each share for a pre-defined period of |
---|
374 | time, and is allowed to delete the share if all of the leases are cancelled |
---|
375 | or allowed to expire. |
---|
376 | |
---|
377 | Garbage collection is not enabled by default: storage servers will not delete |
---|
378 | shares without being explicitly configured to do so. When GC is enabled, |
---|
379 | clients are responsible for renewing their leases on a periodic basis at |
---|
380 | least frequently enough to prevent any of the leases from expiring before the |
---|
381 | next renewal pass. |
---|
382 | |
---|
383 | See :doc:`garbage-collection` for further information, and for how to |
---|
384 | configure garbage collection. |
---|
385 | |
---|
386 | File Repairer |
---|
387 | ============= |
---|
388 | |
---|
389 | Shares may go away because the storage server hosting them has suffered a |
---|
390 | failure: either temporary downtime (affecting availability of the file), or a |
---|
391 | permanent data loss (affecting the preservation of the file). Hard drives |
---|
392 | crash, power supplies explode, coffee spills, and asteroids strike. The goal |
---|
393 | of a robust distributed file store is to survive these setbacks. |
---|
394 | |
---|
395 | To work against this slow, continual loss of shares, a File Checker is used |
---|
396 | to periodically count the number of shares still available for any given |
---|
397 | file. A more extensive form of checking known as the File Verifier can |
---|
398 | download the ciphertext of the target file and perform integrity checks |
---|
399 | (using strong hashes) to make sure the data is still intact. When the file is |
---|
400 | found to have decayed below some threshold, the File Repairer can be used to |
---|
401 | regenerate and re-upload the missing shares. These processes are conceptually |
---|
402 | distinct (the repairer is only run if the checker/verifier decides it is |
---|
403 | necessary), but in practice they will be closely related, and may run in the |
---|
404 | same process. |
---|
405 | |
---|
406 | The repairer process does not get the full capability of the file to be |
---|
407 | maintained: it merely gets the "repairer capability" subset, which does not |
---|
408 | include the decryption key. The File Verifier uses that data to find out |
---|
409 | which nodes ought to hold shares for this file, and to see if those nodes are |
---|
410 | still around and willing to provide the data. If the file is not healthy |
---|
411 | enough, the File Repairer is invoked to download the ciphertext, regenerate |
---|
412 | any missing shares, and upload them to new nodes. The goal of the File |
---|
413 | Repairer is to finish up with a full set of ``N`` shares. |
---|
414 | |
---|
415 | There are a number of engineering issues to be resolved here. The bandwidth, |
---|
416 | disk IO, and CPU time consumed by the verification/repair process must be |
---|
417 | balanced against the robustness that it provides to the grid. The nodes |
---|
418 | involved in repair will have very different access patterns than normal |
---|
419 | nodes, such that these processes may need to be run on hosts with more memory |
---|
420 | or network connectivity than usual. The frequency of repair will directly |
---|
421 | affect the resources consumed. In some cases, verification of multiple files |
---|
422 | can be performed at the same time, and repair of files can be delegated off |
---|
423 | to other nodes. |
---|
424 | |
---|
425 | *future work* |
---|
426 | |
---|
427 | Currently there are two modes of checking on the health of your file: |
---|
428 | "Checker" simply asks storage servers which shares they have and does |
---|
429 | nothing to try to verify that they aren't lying. "Verifier" downloads and |
---|
430 | cryptographically verifies every bit of every share of the file from every |
---|
431 | server, which costs a lot of network and CPU. A future improvement would be |
---|
432 | to make a random-sampling verifier which downloads and cryptographically |
---|
433 | verifies only a few randomly-chosen blocks from each server. This would |
---|
434 | require much less network and CPU but it could make it extremely unlikely |
---|
435 | that any sort of corruption -- even malicious corruption intended to evade |
---|
436 | detection -- would evade detection. This would be an instance of a |
---|
437 | cryptographic notion called "Proof of Retrievability". Note that to implement |
---|
438 | this requires no change to the server or to the cryptographic data structure |
---|
439 | -- with the current data structure and the current protocol it is up to the |
---|
440 | client which blocks they choose to download, so this would be solely a change |
---|
441 | in client behavior. |
---|
442 | |
---|
443 | |
---|
444 | Security |
---|
445 | ======== |
---|
446 | |
---|
447 | The design goal for this project is that an attacker may be able to deny |
---|
448 | service (i.e. prevent you from recovering a file that was uploaded earlier) |
---|
449 | but can accomplish none of the following three attacks: |
---|
450 | |
---|
451 | 1) violate confidentiality: the attacker gets to view data to which you have |
---|
452 | not granted them access |
---|
453 | 2) violate integrity: the attacker convinces you that the wrong data is |
---|
454 | actually the data you were intending to retrieve |
---|
455 | 3) violate unforgeability: the attacker gets to modify a mutable file or |
---|
456 | directory (either the pathnames or the file contents) to which you have |
---|
457 | not given them write permission |
---|
458 | |
---|
459 | Integrity (the promise that the downloaded data will match the uploaded data) |
---|
460 | is provided by the hashes embedded in the capability (for immutable files) or |
---|
461 | the digital signature (for mutable files). Confidentiality (the promise that |
---|
462 | the data is only readable by people with the capability) is provided by the |
---|
463 | encryption key embedded in the capability (for both immutable and mutable |
---|
464 | files). Data availability (the hope that data which has been uploaded in the |
---|
465 | past will be downloadable in the future) is provided by the grid, which |
---|
466 | distributes failures in a way that reduces the correlation between individual |
---|
467 | node failure and overall file recovery failure, and by the erasure-coding |
---|
468 | technique used to generate shares. |
---|
469 | |
---|
470 | Many of these security properties depend upon the usual cryptographic |
---|
471 | assumptions: the resistance of AES and RSA to attack, the resistance of |
---|
472 | SHA-256 to collision attacks and pre-image attacks, and upon the proximity of |
---|
473 | 2^-128 and 2^-256 to zero. A break in AES would allow a confidentiality |
---|
474 | violation, a collision break in SHA-256 would allow a consistency violation, |
---|
475 | and a break in RSA would allow a mutability violation. |
---|
476 | |
---|
477 | There is no attempt made to provide anonymity, neither of the origin of a |
---|
478 | piece of data nor the identity of the subsequent downloaders. In general, |
---|
479 | anyone who already knows the contents of a file will be in a strong position |
---|
480 | to determine who else is uploading or downloading it. Also, it is quite easy |
---|
481 | for a sufficiently large coalition of nodes to correlate the set of nodes who |
---|
482 | are all uploading or downloading the same file, even if the attacker does not |
---|
483 | know the contents of the file in question. |
---|
484 | |
---|
485 | Also note that the file size and (when convergence is being used) a keyed |
---|
486 | hash of the plaintext are not protected. Many people can determine the size |
---|
487 | of the file you are accessing, and if they already know the contents of a |
---|
488 | given file, they will be able to determine that you are uploading or |
---|
489 | downloading the same one. |
---|
490 | |
---|
491 | The capability-based security model is used throughout this project. |
---|
492 | Directory operations are expressed in terms of distinct read- and write- |
---|
493 | capabilities. Knowing the read-capability of a file is equivalent to the |
---|
494 | ability to read the corresponding data. The capability to validate the |
---|
495 | correctness of a file is strictly weaker than the read-capability (possession |
---|
496 | of read-capability automatically grants you possession of |
---|
497 | validate-capability, but not vice versa). These capabilities may be expressly |
---|
498 | delegated (irrevocably) by simply transferring the relevant secrets. |
---|
499 | |
---|
500 | The application layer can provide whatever access model is desired, built on |
---|
501 | top of this capability access model. |
---|
502 | |
---|
503 | |
---|
504 | Reliability |
---|
505 | =========== |
---|
506 | |
---|
507 | File encoding and peer-node selection parameters can be adjusted to achieve |
---|
508 | different goals. Each choice results in a number of properties; there are |
---|
509 | many tradeoffs. |
---|
510 | |
---|
511 | First, some terms: the erasure-coding algorithm is described as ``k``-out-of-``N`` |
---|
512 | (for this release, the default values are ``k`` = 3 and ``N`` = 10). Each grid will |
---|
513 | have some number of nodes; this number will rise and fall over time as nodes |
---|
514 | join, drop out, come back, and leave forever. Files are of various sizes, some |
---|
515 | are popular, others are unpopular. Nodes have various capacities, variable |
---|
516 | upload/download bandwidths, and network latency. Most of the mathematical |
---|
517 | models that look at node failure assume some average (and independent) |
---|
518 | probability 'P' of a given node being available: this can be high (servers |
---|
519 | tend to be online and available >90% of the time) or low (laptops tend to be |
---|
520 | turned on for an hour then disappear for several days). Files are encoded in |
---|
521 | segments of a given maximum size, which affects memory usage. |
---|
522 | |
---|
523 | The ratio of ``N``/``k`` is the "expansion factor". Higher expansion factors |
---|
524 | improve reliability very quickly (the binomial distribution curve is very sharp), |
---|
525 | but consumes much more grid capacity. When P=50%, the absolute value of ``k`` |
---|
526 | affects the granularity of the binomial curve (1-out-of-2 is much worse than |
---|
527 | 50-out-of-100), but high values asymptotically approach a constant (i.e. |
---|
528 | 500-of-1000 is not much better than 50-of-100). When P is high and the |
---|
529 | expansion factor is held at a constant, higher values of ``k`` and ``N`` give |
---|
530 | much better reliability (for P=99%, 50-out-of-100 is much much better than |
---|
531 | 5-of-10, roughly 10^50 times better), because there are more shares that can |
---|
532 | be lost without losing the file. |
---|
533 | |
---|
534 | Likewise, the total number of nodes in the network affects the same |
---|
535 | granularity: having only one node means a single point of failure, no matter |
---|
536 | how many copies of the file you make. Independent nodes (with uncorrelated |
---|
537 | failures) are necessary to hit the mathematical ideals: if you have 100 nodes |
---|
538 | but they are all in the same office building, then a single power failure |
---|
539 | will take out all of them at once. Pseudospoofing, also called a "Sybil Attack", |
---|
540 | is where a single attacker convinces you that they are actually multiple |
---|
541 | servers, so that you think you are using a large number of independent nodes, |
---|
542 | but in fact you have a single point of failure (where the attacker turns off |
---|
543 | all their machines at once). Large grids, with lots of truly independent nodes, |
---|
544 | will enable the use of lower expansion factors to achieve the same reliability, |
---|
545 | but will increase overhead because each node needs to know something about |
---|
546 | every other, and the rate at which nodes come and go will be higher (requiring |
---|
547 | network maintenance traffic). Also, the File Repairer work will increase with |
---|
548 | larger grids, although then the job can be distributed out to more nodes. |
---|
549 | |
---|
550 | Higher values of ``N`` increase overhead: more shares means more Merkle hashes |
---|
551 | that must be included with the data, and more nodes to contact to retrieve |
---|
552 | the shares. Smaller segment sizes reduce memory usage (since each segment |
---|
553 | must be held in memory while erasure coding runs) and improves "alacrity" |
---|
554 | (since downloading can validate a smaller piece of data faster, delivering it |
---|
555 | to the target sooner), but also increase overhead (because more blocks means |
---|
556 | more Merkle hashes to validate them). |
---|
557 | |
---|
558 | In general, small private grids should work well, but the participants will |
---|
559 | have to decide between storage overhead and reliability. Large stable grids |
---|
560 | will be able to reduce the expansion factor down to a bare minimum while |
---|
561 | still retaining high reliability, but large unstable grids (where nodes are |
---|
562 | coming and going very quickly) may require more repair/verification bandwidth |
---|
563 | than actual upload/download traffic. |
---|