source: trunk/docs/architecture.rst

Last change on this file was a8646fe, checked in by Andreas Fischer <af@…>, at 2020-01-04T12:44:24Z

docs: Fix small typo (f -> of) in architecture.rst

  • Property mode set to 100644
File size: 29.6 KB
1.. -*- coding: utf-8-with-signature -*-
4Tahoe-LAFS Architecture
71.  `Overview`_
82.  `The Key-Value Store`_
93.  `File Encoding`_
104.  `Capabilities`_
115.  `Server Selection`_
126.  `Swarming Download, Trickling Upload`_
137.  `The File Store Layer`_
148.  `Leases, Refreshing, Garbage Collection`_
159.  `File Repairer`_
1610. `Security`_
1711. `Reliability`_
23(See the `docs/specifications directory`_ for more details.)
25There are three layers: the key-value store, the file store, and the
28The lowest layer is the key-value store. The keys are "capabilities" -- short
29ASCII strings -- and the values are sequences of data bytes. This data is
30encrypted and distributed across a number of nodes, such that it will survive
31the loss of most of the nodes. There are no hard limits on the size of the
32values, but there may be performance issues with extremely large values (just
33due to the limitation of network bandwidth). In practice, values as small as
34a few bytes and as large as tens of gigabytes are in common use.
36The middle layer is the decentralized file store: a directed graph in which
37the intermediate nodes are directories and the leaf nodes are files. The leaf
38nodes contain only the data -- they contain no metadata other than the length
39in bytes. The edges leading to leaf nodes have metadata attached to them
40about the file they point to. Therefore, the same file may be associated with
41different metadata if it is referred to through different edges.
43The top layer consists of the applications using the file store. used it for a backup service: the application periodically
45copies files from the local disk onto the decentralized file store. We later
46provide read-only access to those files, allowing users to recover them.
47There are several other applications built on top of the Tahoe-LAFS
48file store (see the RelatedProjects_ page of the wiki for a list).
50.. _docs/specifications directory:
51.. _RelatedProjects:
53The Key-Value Store
56The key-value store is implemented by a grid of Tahoe-LAFS storage servers --
57user-space processes. Tahoe-LAFS storage clients communicate with the storage
58servers over TCP.
60Storage servers hold data in the form of "shares". Shares are encoded pieces
61of files. There are a configurable number of shares for each file, 10 by
62default. Normally, each share is stored on a separate server, but in some
63cases a single server can hold multiple shares of a file.
65Nodes learn about each other through an "introducer". Each server connects to
66the introducer at startup and announces its presence. Each client connects to
67the introducer at startup, and receives a list of all servers from it. Each
68client then connects to every server, creating a "bi-clique" topology. In the
69current release, nodes behind NAT boxes will connect to all nodes that they
70can open connections to, but they cannot open connections to other nodes
71behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the
72topology resembles the intended bi-clique topology.
74The introducer is a Single Point of Failure ("SPoF"), in that clients who
75never connect to the introducer will be unable to connect to any storage
76servers, but once a client has been introduced to everybody, it does not need
77the introducer again until it is restarted. The danger of a SPoF is further
78reduced in two ways. First, the introducer is defined by a hostname and a
79private key, which are easy to move to a new host in case the original one
80suffers an unrecoverable hardware problem. Second, even if the private key is
81lost, clients can be reconfigured to use a new introducer.
83For future releases, we have plans to decentralize introduction, allowing any
84server to tell a new client about all the others.
87File Encoding
90When a client stores a file on the grid, it first encrypts the file. It then
91breaks the encrypted file into small segments, in order to reduce the memory
92footprint, and to decrease the lag between initiating a download and
93receiving the first part of the file; for example the lag between hitting
94"play" and a movie actually starting.
96The client then erasure-codes each segment, producing blocks of which only a
97subset are needed to reconstruct the segment (3 out of 10, with the default
100It sends one block from each segment to a given server. The set of blocks on
101a given server constitutes a "share". Therefore a subset of the shares (3 out
102of 10, by default) are needed to reconstruct the file.
104A hash of the encryption key is used to form the "storage index", which is
105used for both server selection (described below) and to index shares within
106the Storage Servers on the selected nodes.
108The client computes secure hashes of the ciphertext and of the shares. It
109uses `Merkle Trees`_ so that it is possible to verify the correctness of a
110subset of the data without requiring all of the data. For example, this
111allows you to verify the correctness of the first segment of a movie file and
112then begin playing the movie file in your movie viewer before the entire
113movie file has been downloaded.
115These hashes are stored in a small datastructure named the Capability
116Extension Block which is stored on the storage servers alongside each share.
118The capability contains the encryption key, the hash of the Capability
119Extension Block, and any encoding parameters necessary to perform the
120eventual decoding process. For convenience, it also contains the size of the
121file being stored.
123To download, the client that wishes to turn a capability into a sequence of
124bytes will obtain the blocks from storage servers, use erasure-decoding to
125turn them into segments of ciphertext, use the decryption key to convert that
126into plaintext, then emit the plaintext bytes to the output target.
128.. _`Merkle Trees`:
134Capabilities to immutable files represent a specific set of bytes. Think of
135it like a hash function: you feed in a bunch of bytes, and you get out a
136capability, which is deterministically derived from the input data: changing
137even one bit of the input data will result in a completely different
140Read-only capabilities to mutable files represent the ability to get a set of
141bytes representing some version of the file, most likely the latest version.
142Each read-only capability is unique. In fact, each mutable file has a unique
143public/private key pair created when the mutable file is created, and the
144read-only capability to that file includes a secure hash of the public key.
146Read-write capabilities to mutable files represent the ability to read the
147file (just like a read-only capability) and also to write a new version of
148the file, overwriting any extant version. Read-write capabilities are unique
149-- each one includes the secure hash of the private key associated with that
150mutable file.
152The capability provides both "location" and "identification": you can use it
153to retrieve a set of bytes, and then you can use it to validate ("identify")
154that these potential bytes are indeed the ones that you were looking for.
156The "key-value store" layer doesn't include human-meaningful names.
157Capabilities sit on the "global+secure" edge of `Zooko's Triangle`_. They are
158self-authenticating, meaning that nobody can trick you into accepting a file
159that doesn't match the capability you used to refer to that file. The
160file store layer (described below) adds human-meaningful names atop the
161key-value layer.
163.. _`Zooko's Triangle`:
166Server Selection
169When a file is uploaded, the encoded shares are sent to some servers. But to
170which ones? The "server selection" algorithm is used to make this choice.
172The storage index is used to consistently-permute the set of all servers nodes
173(by sorting them by ``HASH(storage_index+nodeid)``). Each file gets a different
174permutation, which (on average) will evenly distribute shares among the grid
175and avoid hotspots. Each server has announced its available space when it
176connected to the introducer, and we use that available space information to
177remove any servers that cannot hold an encoded share for our file. Then we ask
178some of the servers thus removed if they are already holding any encoded shares
179for our file; we use this information later. (We ask any servers which are in
180the first 2*``N`` elements of the permuted list.)
182We then use the permuted list of servers to ask each server, in turn, if it
183will hold a share for us (a share that was not reported as being already
184present when we talked to the full servers earlier, and that we have not
185already planned to upload to a different server). We plan to send a share to a
186server by sending an 'allocate_buckets() query' to the server with the number
187of that share. Some will say yes they can hold that share, others (those who
188have become full since they announced their available space) will say no; when
189a server refuses our request, we take that share to the next server on the
190list. In the response to allocate_buckets() the server will also inform us of
191any shares of that file that it already has. We keep going until we run out of
192shares that need to be stored. At the end of the process, we'll have a table
193that maps each share number to a server, and then we can begin the encode and
194push phase, using the table to decide where each share should be sent.
196Most of the time, this will result in one share per server, which gives us
197maximum reliability.  If there are fewer writable servers than there are
198unstored shares, we'll be forced to loop around, eventually giving multiple
199shares to a single server.
201If we have to loop through the node list a second time, we accelerate the query
202process, by asking each node to hold multiple shares on the second pass. In
203most cases, this means we'll never send more than two queries to any given
206If a server is unreachable, or has an error, or refuses to accept any of our
207shares, we remove it from the permuted list, so we won't query it again for
208this file. If a server already has shares for the file we're uploading, we add
209that information to the share-to-server table. This lets us do less work for
210files which have been uploaded once before, while making sure we still wind up
211with as many shares as we desire.
213Before a file upload is called successful, it has to pass an upload health
214check. For immutable files, we check to see that a condition called
215'servers-of-happiness' is satisfied. When satisfied, 'servers-of-happiness'
216assures us that enough pieces of the file are distributed across enough
217servers on the grid to ensure that the availability of the file will not be
218affected if a few of those servers later fail. For mutable files and
219directories, we check to see that all of the encoded shares generated during
220the upload process were successfully placed on the grid. This is a weaker
221check than 'servers-of-happiness'; it does not consider any information about
222how the encoded shares are placed on the grid, and cannot detect situations in
223which all or a majority of the encoded shares generated during the upload
224process reside on only one storage server. We hope to extend
225'servers-of-happiness' to mutable files in a future release of Tahoe-LAFS. If,
226at the end of the upload process, the appropriate upload health check fails,
227the upload is considered a failure.
229The current defaults use ``k`` = 3, ``servers_of_happiness`` = 7, and ``N`` = 10.
230``N`` = 10 means that we'll try to place 10 shares. ``k`` = 3 means that we need
231any three shares to recover the file. ``servers_of_happiness`` = 7 means that
232we'll consider an immutable file upload to be successful if we can place shares
233on enough servers that there are 7 different servers, the correct functioning
234of any ``k`` of which guarantee the availability of the immutable file.
236``N`` = 10 and ``k`` = 3 means there is a 3.3x expansion factor. On a small grid, you
237should set ``N`` about equal to the number of storage servers in your grid; on a
238large grid, you might set it to something smaller to avoid the overhead of
239contacting every server to place a file. In either case, you should then set ``k``
240such that ``N``/``k`` reflects your desired availability goals. The best value for
241``servers_of_happiness`` will depend on how you use Tahoe-LAFS. In a friendnet
242with a variable number of servers, it might make sense to set it to the smallest
243number of servers that you expect to have online and accepting shares at any
244given time. In a stable environment without much server churn, it may make
245sense to set ``servers_of_happiness`` = ``N``.
247When downloading a file, the current version just asks all known servers for
248any shares they might have. Once it has received enough responses that it
249knows where to find the needed k shares, it downloads at least the first
250segment from those servers. This means that it tends to download shares from
251the fastest servers. If some servers had more than one share, it will continue
252sending "Do You Have Block" requests to other servers, so that it can download
253subsequent segments from distinct servers (sorted by their DYHB round-trip
254times), if possible.
256  *future work*
258  A future release will use the server selection algorithm to reduce the
259  number of queries that must be sent out.
261  Other peer-node selection algorithms are possible. One earlier version
262  (known as "Tahoe 3") used the permutation to place the nodes around a large
263  ring, distributed the shares evenly around the same ring, then walked
264  clockwise from 0 with a basket. Each time it encountered a share, it put it
265  in the basket, each time it encountered a server, give it as many shares
266  from the basket as they'd accept. This reduced the number of queries
267  (usually to 1) for small grids (where ``N`` is larger than the number of
268  nodes), but resulted in extremely non-uniform share distribution, which
269  significantly hurt reliability (sometimes the permutation resulted in most
270  of the shares being dumped on a single node).
272  Another algorithm (known as "denver airport" [#naming]_) uses the permuted hash to
273  decide on an approximate target for each share, then sends lease requests
274  via Chord routing. The request includes the contact information of the
275  uploading node, and asks that the node which eventually accepts the lease
276  should contact the uploader directly. The shares are then transferred over
277  direct connections rather than through multiple Chord hops. Download uses
278  the same approach. This allows nodes to avoid maintaining a large number of
279  long-term connections, at the expense of complexity and latency.
281.. [#naming]  all of these names are derived from the location where they were
282        concocted, in this case in a car ride from Boulder to DEN. To be
283        precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
284        shares for a given file would form a sort of cabal which kept track of
285        all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
286        described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
287        was the abandoned ring-with-many-hands approach.
290Swarming Download, Trickling Upload
293Because the shares being downloaded are distributed across a large number of
294nodes, the download process will pull from many of them at the same time. The
295current encoding parameters require 3 shares to be retrieved for each
296segment, which means that up to 3 nodes will be used simultaneously. For
297larger networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
298simultaneously. This allows the download process to use the sum of the
299available nodes' upload bandwidths, resulting in downloads that take full
300advantage of the common 8x disparity between download and upload bandwith on
301modern ADSL lines.
303On the other hand, uploads are hampered by the need to upload encoded shares
304that are larger than the original data (3.3x larger with the current default
305encoding parameters), through the slow end of the asymmetric connection. This
306means that on a typical 8x ADSL line, uploading a file will take about 32
307times longer than downloading it again later.
309Smaller expansion ratios can reduce this upload penalty, at the expense of
310reliability (see `Reliability`_, below). By using an "upload helper", this
311penalty is eliminated: the client does a 1x upload of encrypted data to the
312helper, then the helper performs encoding and pushes the shares to the
313storage servers. This is an improvement if the helper has significantly
314higher upload bandwidth than the client, so it makes the most sense for a
315commercially-run grid for which all of the storage servers are in a colo
316facility with high interconnect bandwidth. In this case, the helper is placed
317in the same facility, so the helper-to-storage-server bandwidth is huge.
319See :doc:`helper` for details about the upload helper.
322The File Store Layer
325The "file store" layer is responsible for mapping human-meaningful pathnames
326(directories and filenames) to pieces of data. The actual bytes inside these
327files are referenced by capability, but the file store layer is where the
328directory names, file names, and metadata are kept.
330The file store layer is a graph of directories. Each directory contains a
331table of named children. These children are either other directories or
332files. All children are referenced by their capability.
334A directory has two forms of capability: read-write caps and read-only caps.
335The table of children inside the directory has a read-write and read-only
336capability for each child. If you have a read-only capability for a given
337directory, you will not be able to access the read-write capability of its
338children. This results in "transitively read-only" directory access.
340By having two different capabilities, you can choose which you want to share
341with someone else. If you create a new directory and share the read-write
342capability for it with a friend, then you will both be able to modify its
343contents. If instead you give them the read-only capability, then they will
344*not* be able to modify the contents. Any capability that you receive can be
345linked in to any directory that you can modify, so very powerful
346shared+published directory structures can be built from these components.
348This structure enable individual users to have their own personal space, with
349links to spaces that are shared with specific other users, and other spaces
350that are globally visible.
353Leases, Refreshing, Garbage Collection
356When a file or directory in the file store is no longer referenced, the space
357that its shares occupied on each storage server can be freed, making room for
358other shares. Tahoe-LAFS uses a garbage collection ("GC") mechanism to
359implement this space-reclamation process. Each share has one or more
360"leases", which are managed by clients who want the file/directory to be
361retained. The storage server accepts each share for a pre-defined period of
362time, and is allowed to delete the share if all of the leases are cancelled
363or allowed to expire.
365Garbage collection is not enabled by default: storage servers will not delete
366shares without being explicitly configured to do so. When GC is enabled,
367clients are responsible for renewing their leases on a periodic basis at
368least frequently enough to prevent any of the leases from expiring before the
369next renewal pass.
371See :doc:`garbage-collection` for further information, and for how to
372configure garbage collection.
374File Repairer
377Shares may go away because the storage server hosting them has suffered a
378failure: either temporary downtime (affecting availability of the file), or a
379permanent data loss (affecting the preservation of the file). Hard drives
380crash, power supplies explode, coffee spills, and asteroids strike. The goal
381of a robust distributed file store is to survive these setbacks.
383To work against this slow, continual loss of shares, a File Checker is used
384to periodically count the number of shares still available for any given
385file. A more extensive form of checking known as the File Verifier can
386download the ciphertext of the target file and perform integrity checks
387(using strong hashes) to make sure the data is still intact. When the file is
388found to have decayed below some threshold, the File Repairer can be used to
389regenerate and re-upload the missing shares. These processes are conceptually
390distinct (the repairer is only run if the checker/verifier decides it is
391necessary), but in practice they will be closely related, and may run in the
392same process.
394The repairer process does not get the full capability of the file to be
395maintained: it merely gets the "repairer capability" subset, which does not
396include the decryption key. The File Verifier uses that data to find out
397which nodes ought to hold shares for this file, and to see if those nodes are
398still around and willing to provide the data. If the file is not healthy
399enough, the File Repairer is invoked to download the ciphertext, regenerate
400any missing shares, and upload them to new nodes. The goal of the File
401Repairer is to finish up with a full set of ``N`` shares.
403There are a number of engineering issues to be resolved here. The bandwidth,
404disk IO, and CPU time consumed by the verification/repair process must be
405balanced against the robustness that it provides to the grid. The nodes
406involved in repair will have very different access patterns than normal
407nodes, such that these processes may need to be run on hosts with more memory
408or network connectivity than usual. The frequency of repair will directly
409affect the resources consumed. In some cases, verification of multiple files
410can be performed at the same time, and repair of files can be delegated off
411to other nodes.
413  *future work*
415  Currently there are two modes of checking on the health of your file:
416  "Checker" simply asks storage servers which shares they have and does
417  nothing to try to verify that they aren't lying. "Verifier" downloads and
418  cryptographically verifies every bit of every share of the file from every
419  server, which costs a lot of network and CPU. A future improvement would be
420  to make a random-sampling verifier which downloads and cryptographically
421  verifies only a few randomly-chosen blocks from each server. This would
422  require much less network and CPU but it could make it extremely unlikely
423  that any sort of corruption -- even malicious corruption intended to evade
424  detection -- would evade detection. This would be an instance of a
425  cryptographic notion called "Proof of Retrievability". Note that to implement
426  this requires no change to the server or to the cryptographic data structure
427  -- with the current data structure and the current protocol it is up to the
428  client which blocks they choose to download, so this would be solely a change
429  in client behavior.
435The design goal for this project is that an attacker may be able to deny
436service (i.e. prevent you from recovering a file that was uploaded earlier)
437but can accomplish none of the following three attacks:
4391) violate confidentiality: the attacker gets to view data to which you have
440   not granted them access
4412) violate integrity: the attacker convinces you that the wrong data is
442   actually the data you were intending to retrieve
4433) violate unforgeability: the attacker gets to modify a mutable file or
444   directory (either the pathnames or the file contents) to which you have
445   not given them write permission
447Integrity (the promise that the downloaded data will match the uploaded data)
448is provided by the hashes embedded in the capability (for immutable files) or
449the digital signature (for mutable files). Confidentiality (the promise that
450the data is only readable by people with the capability) is provided by the
451encryption key embedded in the capability (for both immutable and mutable
452files). Data availability (the hope that data which has been uploaded in the
453past will be downloadable in the future) is provided by the grid, which
454distributes failures in a way that reduces the correlation between individual
455node failure and overall file recovery failure, and by the erasure-coding
456technique used to generate shares.
458Many of these security properties depend upon the usual cryptographic
459assumptions: the resistance of AES and RSA to attack, the resistance of
460SHA-256 to collision attacks and pre-image attacks, and upon the proximity of
4612^-128 and 2^-256 to zero. A break in AES would allow a confidentiality
462violation, a collision break in SHA-256 would allow a consistency violation,
463and a break in RSA would allow a mutability violation.
465There is no attempt made to provide anonymity, neither of the origin of a
466piece of data nor the identity of the subsequent downloaders. In general,
467anyone who already knows the contents of a file will be in a strong position
468to determine who else is uploading or downloading it. Also, it is quite easy
469for a sufficiently large coalition of nodes to correlate the set of nodes who
470are all uploading or downloading the same file, even if the attacker does not
471know the contents of the file in question.
473Also note that the file size and (when convergence is being used) a keyed
474hash of the plaintext are not protected. Many people can determine the size
475of the file you are accessing, and if they already know the contents of a
476given file, they will be able to determine that you are uploading or
477downloading the same one.
479The capability-based security model is used throughout this project.
480Directory operations are expressed in terms of distinct read- and write-
481capabilities. Knowing the read-capability of a file is equivalent to the
482ability to read the corresponding data. The capability to validate the
483correctness of a file is strictly weaker than the read-capability (possession
484of read-capability automatically grants you possession of
485validate-capability, but not vice versa). These capabilities may be expressly
486delegated (irrevocably) by simply transferring the relevant secrets.
488The application layer can provide whatever access model is desired, built on
489top of this capability access model.
495File encoding and peer-node selection parameters can be adjusted to achieve
496different goals. Each choice results in a number of properties; there are
497many tradeoffs.
499First, some terms: the erasure-coding algorithm is described as ``k``-out-of-``N``
500(for this release, the default values are ``k`` = 3 and ``N`` = 10). Each grid will
501have some number of nodes; this number will rise and fall over time as nodes
502join, drop out, come back, and leave forever. Files are of various sizes, some
503are popular, others are unpopular. Nodes have various capacities, variable
504upload/download bandwidths, and network latency. Most of the mathematical
505models that look at node failure assume some average (and independent)
506probability 'P' of a given node being available: this can be high (servers
507tend to be online and available >90% of the time) or low (laptops tend to be
508turned on for an hour then disappear for several days). Files are encoded in
509segments of a given maximum size, which affects memory usage.
511The ratio of ``N``/``k`` is the "expansion factor". Higher expansion factors
512improve reliability very quickly (the binomial distribution curve is very sharp),
513but consumes much more grid capacity. When P=50%, the absolute value of ``k``
514affects the granularity of the binomial curve (1-out-of-2 is much worse than
51550-out-of-100), but high values asymptotically approach a constant (i.e.
516500-of-1000 is not much better than 50-of-100). When P is high and the
517expansion factor is held at a constant, higher values of ``k`` and ``N`` give
518much better reliability (for P=99%, 50-out-of-100 is much much better than
5195-of-10, roughly 10^50 times better), because there are more shares that can
520be lost without losing the file.
522Likewise, the total number of nodes in the network affects the same
523granularity: having only one node means a single point of failure, no matter
524how many copies of the file you make. Independent nodes (with uncorrelated
525failures) are necessary to hit the mathematical ideals: if you have 100 nodes
526but they are all in the same office building, then a single power failure
527will take out all of them at once. Pseudospoofing, also called a "Sybil Attack",
528is where a single attacker convinces you that they are actually multiple
529servers, so that you think you are using a large number of independent nodes,
530but in fact you have a single point of failure (where the attacker turns off
531all their machines at once). Large grids, with lots of truly independent nodes,
532will enable the use of lower expansion factors to achieve the same reliability,
533but will increase overhead because each node needs to know something about
534every other, and the rate at which nodes come and go will be higher (requiring
535network maintenance traffic). Also, the File Repairer work will increase with
536larger grids, although then the job can be distributed out to more nodes.
538Higher values of ``N`` increase overhead: more shares means more Merkle hashes
539that must be included with the data, and more nodes to contact to retrieve
540the shares. Smaller segment sizes reduce memory usage (since each segment
541must be held in memory while erasure coding runs) and improves "alacrity"
542(since downloading can validate a smaller piece of data faster, delivering it
543to the target sooner), but also increase overhead (because more blocks means
544more Merkle hashes to validate them).
546In general, small private grids should work well, but the participants will
547have to decide between storage overhead and reliability. Large stable grids
548will be able to reduce the expansion factor down to a bare minimum while
549still retaining high reliability, but large unstable grids (where nodes are
550coming and going very quickly) may require more repair/verification bandwidth
551than actual upload/download traffic.
Note: See TracBrowser for help on using the repository browser.