Context Navigation

architecture.rst @ 0977e52

Visit:

Last change on this file since 0977e52 was 2055a66, checked in by Brian Warner <warner@…>, at 2017-06-06T10:20:49Z

Doc changes that require more careful review. refs #2345

Signed-off-by: Daira Hopwood <daira@…>

Property mode set to 100644

File size: 29.6 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	=======================
4	Tahoe-LAFS Architecture
5	=======================
6
7	1. `Overview`_
8	2. `The Key-Value Store`_
9	3. `File Encoding`_
10	4. `Capabilities`_
11	5. `Server Selection`_
12	6. `Swarming Download, Trickling Upload`_
13	7. `The File Store Layer`_
14	8. `Leases, Refreshing, Garbage Collection`_
15	9. `File Repairer`_
16	10. `Security`_
17	11. `Reliability`_
18
19
20	Overview
21	========
22
23	(See the `docs/specifications directory`_ for more details.)
24
25	There are three layers: the key-value store, the file store, and the
26	application.
27
28	The lowest layer is the key-value store. The keys are "capabilities" -- short
29	ASCII strings -- and the values are sequences of data bytes. This data is
30	encrypted and distributed across a number of nodes, such that it will survive
31	the loss of most of the nodes. There are no hard limits on the size of the
32	values, but there may be performance issues with extremely large values (just
33	due to the limitation of network bandwidth). In practice, values as small as
34	a few bytes and as large as tens of gigabytes are in common use.
35
36	The middle layer is the decentralized file store: a directed graph in which
37	the intermediate nodes are directories and the leaf nodes are files. The leaf
38	nodes contain only the data -- they contain no metadata other than the length
39	in bytes. The edges leading to leaf nodes have metadata attached to them
40	about the file they point to. Therefore, the same file may be associated with
41	different metadata if it is referred to through different edges.
42
43	The top layer consists of the applications using the file store.
44	Allmydata.com used it for a backup service: the application periodically
45	copies files from the local disk onto the decentralized file store. We later
46	provide read-only access to those files, allowing users to recover them.
47	There are several other applications built on top of the Tahoe-LAFS
48	file store (see the RelatedProjects_ page of the wiki for a list).
49
50	.. _docs/specifications directory: https://github.com/tahoe-lafs/tahoe-lafs/tree/master/docs/specifications
51	.. _RelatedProjects: https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects
52
53	The Key-Value Store
54	===================
55
56	The key-value store is implemented by a grid of Tahoe-LAFS storage servers --
57	user-space processes. Tahoe-LAFS storage clients communicate with the storage
58	servers over TCP.
59
60	Storage servers hold data in the form of "shares". Shares are encoded pieces
61	of files. There are a configurable number of shares for each file, 10 by
62	default. Normally, each share is stored on a separate server, but in some
63	cases a single server can hold multiple shares of a file.
64
65	Nodes learn about each other through an "introducer". Each server connects to
66	the introducer at startup and announces its presence. Each client connects to
67	the introducer at startup, and receives a list of all servers from it. Each
68	client then connects to every server, creating a "bi-clique" topology. In the
69	current release, nodes behind NAT boxes will connect to all nodes that they
70	can open connections to, but they cannot open connections to other nodes
71	behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the
72	topology resembles the intended bi-clique topology.
73
74	The introducer is a Single Point of Failure ("SPoF"), in that clients who
75	never connect to the introducer will be unable to connect to any storage
76	servers, but once a client has been introduced to everybody, it does not need
77	the introducer again until it is restarted. The danger of a SPoF is further
78	reduced in two ways. First, the introducer is defined by a hostname and a
79	private key, which are easy to move to a new host in case the original one
80	suffers an unrecoverable hardware problem. Second, even if the private key is
81	lost, clients can be reconfigured to use a new introducer.
82
83	For future releases, we have plans to decentralize introduction, allowing any
84	server to tell a new client about all the others.
85
86
87	File Encoding
88	=============
89
90	When a client stores a file on the grid, it first encrypts the file. It then
91	breaks the encrypted file into small segments, in order to reduce the memory
92	footprint, and to decrease the lag between initiating a download and
93	receiving the first part of the file; for example the lag between hitting
94	"play" and a movie actually starting.
95
96	The client then erasure-codes each segment, producing blocks of which only a
97	subset are needed to reconstruct the segment (3 out of 10, with the default
98	settings).
99
100	It sends one block from each segment to a given server. The set of blocks on
101	a given server constitutes a "share". Therefore a subset f the shares (3 out
102	of 10, by default) are needed to reconstruct the file.
103
104	A hash of the encryption key is used to form the "storage index", which is
105	used for both server selection (described below) and to index shares within
106	the Storage Servers on the selected nodes.
107
108	The client computes secure hashes of the ciphertext and of the shares. It
109	uses `Merkle Trees`_ so that it is possible to verify the correctness of a
110	subset of the data without requiring all of the data. For example, this
111	allows you to verify the correctness of the first segment of a movie file and
112	then begin playing the movie file in your movie viewer before the entire
113	movie file has been downloaded.
114
115	These hashes are stored in a small datastructure named the Capability
116	Extension Block which is stored on the storage servers alongside each share.
117
118	The capability contains the encryption key, the hash of the Capability
119	Extension Block, and any encoding parameters necessary to perform the
120	eventual decoding process. For convenience, it also contains the size of the
121	file being stored.
122
123	To download, the client that wishes to turn a capability into a sequence of
124	bytes will obtain the blocks from storage servers, use erasure-decoding to
125	turn them into segments of ciphertext, use the decryption key to convert that
126	into plaintext, then emit the plaintext bytes to the output target.
127
128	.. _`Merkle Trees`: http://systems.cs.colorado.edu/grunwald/Classes/Fall2003-InformationStorage/Papers/merkle-tree.pdf
129
130
131	Capabilities
132	============
133
134	Capabilities to immutable files represent a specific set of bytes. Think of
135	it like a hash function: you feed in a bunch of bytes, and you get out a
136	capability, which is deterministically derived from the input data: changing
137	even one bit of the input data will result in a completely different
138	capability.
139
140	Read-only capabilities to mutable files represent the ability to get a set of
141	bytes representing some version of the file, most likely the latest version.
142	Each read-only capability is unique. In fact, each mutable file has a unique
143	public/private key pair created when the mutable file is created, and the
144	read-only capability to that file includes a secure hash of the public key.
145
146	Read-write capabilities to mutable files represent the ability to read the
147	file (just like a read-only capability) and also to write a new version of
148	the file, overwriting any extant version. Read-write capabilities are unique
149	-- each one includes the secure hash of the private key associated with that
150	mutable file.
151
152	The capability provides both "location" and "identification": you can use it
153	to retrieve a set of bytes, and then you can use it to validate ("identify")
154	that these potential bytes are indeed the ones that you were looking for.
155
156	The "key-value store" layer doesn't include human-meaningful names.
157	Capabilities sit on the "global+secure" edge of `Zooko's Triangle`_. They are
158	self-authenticating, meaning that nobody can trick you into accepting a file
159	that doesn't match the capability you used to refer to that file. The
160	file store layer (described below) adds human-meaningful names atop the
161	key-value layer.
162
163	.. _`Zooko's Triangle`: https://en.wikipedia.org/wiki/Zooko%27s_triangle
164
165
166	Server Selection
167	================
168
169	When a file is uploaded, the encoded shares are sent to some servers. But to
170	which ones? The "server selection" algorithm is used to make this choice.
171
172	The storage index is used to consistently-permute the set of all servers nodes
173	(by sorting them by ``HASH(storage_index+nodeid)``). Each file gets a different
174	permutation, which (on average) will evenly distribute shares among the grid
175	and avoid hotspots. Each server has announced its available space when it
176	connected to the introducer, and we use that available space information to
177	remove any servers that cannot hold an encoded share for our file. Then we ask
178	some of the servers thus removed if they are already holding any encoded shares
179	for our file; we use this information later. (We ask any servers which are in
180	the first 2*``N`` elements of the permuted list.)
181
182	We then use the permuted list of servers to ask each server, in turn, if it
183	will hold a share for us (a share that was not reported as being already
184	present when we talked to the full servers earlier, and that we have not
185	already planned to upload to a different server). We plan to send a share to a
186	server by sending an 'allocate_buckets() query' to the server with the number
187	of that share. Some will say yes they can hold that share, others (those who
188	have become full since they announced their available space) will say no; when
189	a server refuses our request, we take that share to the next server on the
190	list. In the response to allocate_buckets() the server will also inform us of
191	any shares of that file that it already has. We keep going until we run out of
192	shares that need to be stored. At the end of the process, we'll have a table
193	that maps each share number to a server, and then we can begin the encode and
194	push phase, using the table to decide where each share should be sent.
195
196	Most of the time, this will result in one share per server, which gives us
197	maximum reliability. If there are fewer writable servers than there are
198	unstored shares, we'll be forced to loop around, eventually giving multiple
199	shares to a single server.
200
201	If we have to loop through the node list a second time, we accelerate the query
202	process, by asking each node to hold multiple shares on the second pass. In
203	most cases, this means we'll never send more than two queries to any given
204	node.
205
206	If a server is unreachable, or has an error, or refuses to accept any of our
207	shares, we remove it from the permuted list, so we won't query it again for
208	this file. If a server already has shares for the file we're uploading, we add
209	that information to the share-to-server table. This lets us do less work for
210	files which have been uploaded once before, while making sure we still wind up
211	with as many shares as we desire.
212
213	Before a file upload is called successful, it has to pass an upload health
214	check. For immutable files, we check to see that a condition called
215	'servers-of-happiness' is satisfied. When satisfied, 'servers-of-happiness'
216	assures us that enough pieces of the file are distributed across enough
217	servers on the grid to ensure that the availability of the file will not be
218	affected if a few of those servers later fail. For mutable files and
219	directories, we check to see that all of the encoded shares generated during
220	the upload process were successfully placed on the grid. This is a weaker
221	check than 'servers-of-happiness'; it does not consider any information about
222	how the encoded shares are placed on the grid, and cannot detect situations in
223	which all or a majority of the encoded shares generated during the upload
224	process reside on only one storage server. We hope to extend
225	'servers-of-happiness' to mutable files in a future release of Tahoe-LAFS. If,
226	at the end of the upload process, the appropriate upload health check fails,
227	the upload is considered a failure.
228
229	The current defaults use ``k`` = 3, ``servers_of_happiness`` = 7, and ``N`` = 10.
230	``N`` = 10 means that we'll try to place 10 shares. ``k`` = 3 means that we need
231	any three shares to recover the file. ``servers_of_happiness`` = 7 means that
232	we'll consider an immutable file upload to be successful if we can place shares
233	on enough servers that there are 7 different servers, the correct functioning
234	of any ``k`` of which guarantee the availability of the immutable file.
235
236	``N`` = 10 and ``k`` = 3 means there is a 3.3x expansion factor. On a small grid, you
237	should set ``N`` about equal to the number of storage servers in your grid; on a
238	large grid, you might set it to something smaller to avoid the overhead of
239	contacting every server to place a file. In either case, you should then set ``k``
240	such that ``N``/``k`` reflects your desired availability goals. The best value for
241	``servers_of_happiness`` will depend on how you use Tahoe-LAFS. In a friendnet
242	with a variable number of servers, it might make sense to set it to the smallest
243	number of servers that you expect to have online and accepting shares at any
244	given time. In a stable environment without much server churn, it may make
245	sense to set ``servers_of_happiness`` = ``N``.
246
247	When downloading a file, the current version just asks all known servers for
248	any shares they might have. Once it has received enough responses that it
249	knows where to find the needed k shares, it downloads at least the first
250	segment from those servers. This means that it tends to download shares from
251	the fastest servers. If some servers had more than one share, it will continue
252	sending "Do You Have Block" requests to other servers, so that it can download
253	subsequent segments from distinct servers (sorted by their DYHB round-trip
254	times), if possible.
255
256	future work
257
258	A future release will use the server selection algorithm to reduce the
259	number of queries that must be sent out.
260
261	Other peer-node selection algorithms are possible. One earlier version
262	(known as "Tahoe 3") used the permutation to place the nodes around a large
263	ring, distributed the shares evenly around the same ring, then walked
264	clockwise from 0 with a basket. Each time it encountered a share, it put it
265	in the basket, each time it encountered a server, give it as many shares
266	from the basket as they'd accept. This reduced the number of queries
267	(usually to 1) for small grids (where ``N`` is larger than the number of
268	nodes), but resulted in extremely non-uniform share distribution, which
269	significantly hurt reliability (sometimes the permutation resulted in most
270	of the shares being dumped on a single node).
271
272	Another algorithm (known as "denver airport" [#naming]_) uses the permuted hash to
273	decide on an approximate target for each share, then sends lease requests
274	via Chord routing. The request includes the contact information of the
275	uploading node, and asks that the node which eventually accepts the lease
276	should contact the uploader directly. The shares are then transferred over
277	direct connections rather than through multiple Chord hops. Download uses
278	the same approach. This allows nodes to avoid maintaining a large number of
279	long-term connections, at the expense of complexity and latency.
280
281	.. [#naming] all of these names are derived from the location where they were
282	concocted, in this case in a car ride from Boulder to DEN. To be
283	precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
284	shares for a given file would form a sort of cabal which kept track of
285	all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
286	described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
287	was the abandoned ring-with-many-hands approach.
288
289
290	Swarming Download, Trickling Upload
291	===================================
292
293	Because the shares being downloaded are distributed across a large number of
294	nodes, the download process will pull from many of them at the same time. The
295	current encoding parameters require 3 shares to be retrieved for each
296	segment, which means that up to 3 nodes will be used simultaneously. For
297	larger networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
298	simultaneously. This allows the download process to use the sum of the
299	available nodes' upload bandwidths, resulting in downloads that take full
300	advantage of the common 8x disparity between download and upload bandwith on
301	modern ADSL lines.
302
303	On the other hand, uploads are hampered by the need to upload encoded shares
304	that are larger than the original data (3.3x larger with the current default
305	encoding parameters), through the slow end of the asymmetric connection. This
306	means that on a typical 8x ADSL line, uploading a file will take about 32
307	times longer than downloading it again later.
308
309	Smaller expansion ratios can reduce this upload penalty, at the expense of
310	reliability (see `Reliability`_, below). By using an "upload helper", this
311	penalty is eliminated: the client does a 1x upload of encrypted data to the
312	helper, then the helper performs encoding and pushes the shares to the
313	storage servers. This is an improvement if the helper has significantly
314	higher upload bandwidth than the client, so it makes the most sense for a
315	commercially-run grid for which all of the storage servers are in a colo
316	facility with high interconnect bandwidth. In this case, the helper is placed
317	in the same facility, so the helper-to-storage-server bandwidth is huge.
318
319	See :doc:`helper` for details about the upload helper.
320
321
322	The File Store Layer
323	====================
324
325	The "file store" layer is responsible for mapping human-meaningful pathnames
326	(directories and filenames) to pieces of data. The actual bytes inside these
327	files are referenced by capability, but the file store layer is where the
328	directory names, file names, and metadata are kept.
329
330	The file store layer is a graph of directories. Each directory contains a
331	table of named children. These children are either other directories or
332	files. All children are referenced by their capability.
333
334	A directory has two forms of capability: read-write caps and read-only caps.
335	The table of children inside the directory has a read-write and read-only
336	capability for each child. If you have a read-only capability for a given
337	directory, you will not be able to access the read-write capability of its
338	children. This results in "transitively read-only" directory access.
339
340	By having two different capabilities, you can choose which you want to share
341	with someone else. If you create a new directory and share the read-write
342	capability for it with a friend, then you will both be able to modify its
343	contents. If instead you give them the read-only capability, then they will
344	not be able to modify the contents. Any capability that you receive can be
345	linked in to any directory that you can modify, so very powerful
346	shared+published directory structures can be built from these components.
347
348	This structure enable individual users to have their own personal space, with
349	links to spaces that are shared with specific other users, and other spaces
350	that are globally visible.
351
352
353	Leases, Refreshing, Garbage Collection
354	======================================
355
356	When a file or directory in the file store is no longer referenced, the space
357	that its shares occupied on each storage server can be freed, making room for
358	other shares. Tahoe-LAFS uses a garbage collection ("GC") mechanism to
359	implement this space-reclamation process. Each share has one or more
360	"leases", which are managed by clients who want the file/directory to be
361	retained. The storage server accepts each share for a pre-defined period of
362	time, and is allowed to delete the share if all of the leases are cancelled
363	or allowed to expire.
364
365	Garbage collection is not enabled by default: storage servers will not delete
366	shares without being explicitly configured to do so. When GC is enabled,
367	clients are responsible for renewing their leases on a periodic basis at
368	least frequently enough to prevent any of the leases from expiring before the
369	next renewal pass.
370
371	See :doc:`garbage-collection` for further information, and for how to
372	configure garbage collection.
373
374	File Repairer
375	=============
376
377	Shares may go away because the storage server hosting them has suffered a
378	failure: either temporary downtime (affecting availability of the file), or a
379	permanent data loss (affecting the preservation of the file). Hard drives
380	crash, power supplies explode, coffee spills, and asteroids strike. The goal
381	of a robust distributed file store is to survive these setbacks.
382
383	To work against this slow, continual loss of shares, a File Checker is used
384	to periodically count the number of shares still available for any given
385	file. A more extensive form of checking known as the File Verifier can
386	download the ciphertext of the target file and perform integrity checks
387	(using strong hashes) to make sure the data is still intact. When the file is
388	found to have decayed below some threshold, the File Repairer can be used to
389	regenerate and re-upload the missing shares. These processes are conceptually
390	distinct (the repairer is only run if the checker/verifier decides it is
391	necessary), but in practice they will be closely related, and may run in the
392	same process.
393
394	The repairer process does not get the full capability of the file to be
395	maintained: it merely gets the "repairer capability" subset, which does not
396	include the decryption key. The File Verifier uses that data to find out
397	which nodes ought to hold shares for this file, and to see if those nodes are
398	still around and willing to provide the data. If the file is not healthy
399	enough, the File Repairer is invoked to download the ciphertext, regenerate
400	any missing shares, and upload them to new nodes. The goal of the File
401	Repairer is to finish up with a full set of ``N`` shares.
402
403	There are a number of engineering issues to be resolved here. The bandwidth,
404	disk IO, and CPU time consumed by the verification/repair process must be
405	balanced against the robustness that it provides to the grid. The nodes
406	involved in repair will have very different access patterns than normal
407	nodes, such that these processes may need to be run on hosts with more memory
408	or network connectivity than usual. The frequency of repair will directly
409	affect the resources consumed. In some cases, verification of multiple files
410	can be performed at the same time, and repair of files can be delegated off
411	to other nodes.
412
413	future work
414
415	Currently there are two modes of checking on the health of your file:
416	"Checker" simply asks storage servers which shares they have and does
417	nothing to try to verify that they aren't lying. "Verifier" downloads and
418	cryptographically verifies every bit of every share of the file from every
419	server, which costs a lot of network and CPU. A future improvement would be
420	to make a random-sampling verifier which downloads and cryptographically
421	verifies only a few randomly-chosen blocks from each server. This would
422	require much less network and CPU but it could make it extremely unlikely
423	that any sort of corruption -- even malicious corruption intended to evade
424	detection -- would evade detection. This would be an instance of a
425	cryptographic notion called "Proof of Retrievability". Note that to implement
426	this requires no change to the server or to the cryptographic data structure
427	-- with the current data structure and the current protocol it is up to the
428	client which blocks they choose to download, so this would be solely a change
429	in client behavior.
430
431
432	Security
433	========
434
435	The design goal for this project is that an attacker may be able to deny
436	service (i.e. prevent you from recovering a file that was uploaded earlier)
437	but can accomplish none of the following three attacks:
438
439	1) violate confidentiality: the attacker gets to view data to which you have
440	not granted them access
441	2) violate integrity: the attacker convinces you that the wrong data is
442	actually the data you were intending to retrieve
443	3) violate unforgeability: the attacker gets to modify a mutable file or
444	directory (either the pathnames or the file contents) to which you have
445	not given them write permission
446
447	Integrity (the promise that the downloaded data will match the uploaded data)
448	is provided by the hashes embedded in the capability (for immutable files) or
449	the digital signature (for mutable files). Confidentiality (the promise that
450	the data is only readable by people with the capability) is provided by the
451	encryption key embedded in the capability (for both immutable and mutable
452	files). Data availability (the hope that data which has been uploaded in the
453	past will be downloadable in the future) is provided by the grid, which
454	distributes failures in a way that reduces the correlation between individual
455	node failure and overall file recovery failure, and by the erasure-coding
456	technique used to generate shares.
457
458	Many of these security properties depend upon the usual cryptographic
459	assumptions: the resistance of AES and RSA to attack, the resistance of
460	SHA-256 to collision attacks and pre-image attacks, and upon the proximity of
461	2^-128 and 2^-256 to zero. A break in AES would allow a confidentiality
462	violation, a collision break in SHA-256 would allow a consistency violation,
463	and a break in RSA would allow a mutability violation.
464
465	There is no attempt made to provide anonymity, neither of the origin of a
466	piece of data nor the identity of the subsequent downloaders. In general,
467	anyone who already knows the contents of a file will be in a strong position
468	to determine who else is uploading or downloading it. Also, it is quite easy
469	for a sufficiently large coalition of nodes to correlate the set of nodes who
470	are all uploading or downloading the same file, even if the attacker does not
471	know the contents of the file in question.
472
473	Also note that the file size and (when convergence is being used) a keyed
474	hash of the plaintext are not protected. Many people can determine the size
475	of the file you are accessing, and if they already know the contents of a
476	given file, they will be able to determine that you are uploading or
477	downloading the same one.
478
479	The capability-based security model is used throughout this project.
480	Directory operations are expressed in terms of distinct read- and write-
481	capabilities. Knowing the read-capability of a file is equivalent to the
482	ability to read the corresponding data. The capability to validate the
483	correctness of a file is strictly weaker than the read-capability (possession
484	of read-capability automatically grants you possession of
485	validate-capability, but not vice versa). These capabilities may be expressly
486	delegated (irrevocably) by simply transferring the relevant secrets.
487
488	The application layer can provide whatever access model is desired, built on
489	top of this capability access model.
490
491
492	Reliability
493	===========
494
495	File encoding and peer-node selection parameters can be adjusted to achieve
496	different goals. Each choice results in a number of properties; there are
497	many tradeoffs.
498
499	First, some terms: the erasure-coding algorithm is described as ``k``-out-of-``N``
500	(for this release, the default values are ``k`` = 3 and ``N`` = 10). Each grid will
501	have some number of nodes; this number will rise and fall over time as nodes
502	join, drop out, come back, and leave forever. Files are of various sizes, some
503	are popular, others are unpopular. Nodes have various capacities, variable
504	upload/download bandwidths, and network latency. Most of the mathematical
505	models that look at node failure assume some average (and independent)
506	probability 'P' of a given node being available: this can be high (servers
507	tend to be online and available >90% of the time) or low (laptops tend to be
508	turned on for an hour then disappear for several days). Files are encoded in
509	segments of a given maximum size, which affects memory usage.
510
511	The ratio of ``N``/``k`` is the "expansion factor". Higher expansion factors
512	improve reliability very quickly (the binomial distribution curve is very sharp),
513	but consumes much more grid capacity. When P=50%, the absolute value of ``k``
514	affects the granularity of the binomial curve (1-out-of-2 is much worse than
515	50-out-of-100), but high values asymptotically approach a constant (i.e.
516	500-of-1000 is not much better than 50-of-100). When P is high and the
517	expansion factor is held at a constant, higher values of ``k`` and ``N`` give
518	much better reliability (for P=99%, 50-out-of-100 is much much better than
519	5-of-10, roughly 10^50 times better), because there are more shares that can
520	be lost without losing the file.
521
522	Likewise, the total number of nodes in the network affects the same
523	granularity: having only one node means a single point of failure, no matter
524	how many copies of the file you make. Independent nodes (with uncorrelated
525	failures) are necessary to hit the mathematical ideals: if you have 100 nodes
526	but they are all in the same office building, then a single power failure
527	will take out all of them at once. Pseudospoofing, also called a "Sybil Attack",
528	is where a single attacker convinces you that they are actually multiple
529	servers, so that you think you are using a large number of independent nodes,
530	but in fact you have a single point of failure (where the attacker turns off
531	all their machines at once). Large grids, with lots of truly independent nodes,
532	will enable the use of lower expansion factors to achieve the same reliability,
533	but will increase overhead because each node needs to know something about
534	every other, and the rate at which nodes come and go will be higher (requiring
535	network maintenance traffic). Also, the File Repairer work will increase with
536	larger grids, although then the job can be distributed out to more nodes.
537
538	Higher values of ``N`` increase overhead: more shares means more Merkle hashes
539	that must be included with the data, and more nodes to contact to retrieve
540	the shares. Smaller segment sizes reduce memory usage (since each segment
541	must be held in memory while erasure coding runs) and improves "alacrity"
542	(since downloading can validate a smaller piece of data faster, delivering it
543	to the target sooner), but also increase overhead (because more blocks means
544	more Merkle hashes to validate them).
545
546	In general, small private grids should work well, but the participants will
547	have to decide between storage overhead and reliability. Large stable grids
548	will be able to reduce the expansion factor down to a bare minimum while
549	still retaining high reliability, but large unstable grids (where nodes are
550	coming and going very quickly) may require more repair/verification bandwidth
551	than actual upload/download traffic.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/architecture.rst @ 0977e52

Download in other formats: