Context Navigation

raic.rst

Visit:

Last change on this file was 91047bf, checked in by Brian Warner <warner@…>, at 2016-12-12T21:57:28Z

docs: clean up .rst and references

This uses Read-The-Docs (sphinx/docutils) references exclusively, but adds a
README.md for GitHub? viewers to remind them that the links there won't
work (closes ticket:2835).

It also fixes all the dangling references and other Sphinx warnings.

The "Preparation" section of docs/magic-folder-howto.rst was removed, since
this feature has since been merged to trunk.

Property mode set to 100644

File size: 16.6 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	=============================================================
4	Redundant Array of Independent Clouds: Share To Cloud Mapping
5	=============================================================
6
7
8	Introduction
9	============
10
11	This document describes a proposed design for the mapping of LAFS shares to
12	objects in a cloud storage service. It also analyzes the costs for each of the
13	functional requirements, including network, disk, storage and API usage costs.
14
15
16	Terminology
17	===========
18
19	LAFS share
20	A Tahoe-LAFS share representing part of a file after encryption and
21	erasure encoding.
22
23	LAFS shareset
24	The set of shares stored by a LAFS storage server for a given storage index.
25	The shares within a shareset are numbered by a small integer.
26
27	Cloud storage service
28	A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
29	Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
30
31	Cloud storage interface
32	A protocol interface supported by a cloud storage service, such as the
33	S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
34	Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
35	multiple services implementing a given cloud storage interface. In this design,
36	only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
37
38	Store object
39	A file-like abstraction provided by a cloud storage service, storing a
40	sequence of bytes. Store objects are mutable in the sense that the contents
41	and metadata of the store object with a given name in a given backend store
42	can be replaced. Store objects are called “blobs” in the Azure interface,
43	and “objects” in the other interfaces.
44
45	Cloud backend store
46	A container for store objects provided by a cloud service. Cloud backend
47	stores are called “buckets” in the S3 and Google Cloud Storage interfaces,
48	and “containers” in the Azure and OpenStack Storage interfaces.
49
50
51	Functional Requirements
52	=======================
53
54	* Upload: a LAFS share can be uploaded to an appropriately configured
55	Tahoe-LAFS storage server and the data is stored to the cloud
56	storage service.
57
58	* Scalable shares: there is no hard limit on the size of LAFS share
59	that can be uploaded.
60
61	If the cloud storage interface offers scalable files, then this could be
62	implemented by using that feature of the specific cloud storage
63	interface. Alternately, it could be implemented by mapping from the LAFS
64	abstraction of an unlimited-size immutable share to a set of size-limited
65	store objects.
66
67	* Streaming upload: the size of the LAFS share that is uploaded
68	can exceed the amount of RAM and even the amount of direct attached
69	storage on the storage server. I.e., the storage server is required to
70	stream the data directly to the ultimate cloud storage service while
71	processing it, instead of to buffer the data until the client is finished
72	uploading and then transfer the data to the cloud storage service.
73
74	* Download: a LAFS share can be downloaded from an appropriately
75	configured Tahoe-LAFS storage server, and the data is loaded from the
76	cloud storage service.
77
78	* Streaming download: the size of the LAFS share that is
79	downloaded can exceed the amount of RAM and even the amount of direct
80	attached storage on the storage server. I.e. the storage server is
81	required to stream the data directly to the client while processing it,
82	instead of to buffer the data until the cloud storage service is finished
83	serving and then transfer the data to the client.
84
85	* Modify: a LAFS share can have part of its contents modified.
86
87	If the cloud storage interface offers scalable mutable files, then this
88	could be implemented by using that feature of the specific cloud storage
89	interface. Alternately, it could be implemented by mapping from the LAFS
90	abstraction of an unlimited-size mutable share to a set of size-limited
91	store objects.
92
93	* Efficient modify: the size of the LAFS share being
94	modified can exceed the amount of RAM and even the amount of direct
95	attached storage on the storage server. I.e. the storage server is
96	required to download, patch, and upload only the segment(s) of the share
97	that are being modified, instead of to download, patch, and upload the
98	entire share.
99
100	* Tracking leases: The Tahoe-LAFS storage server is required to track when
101	each share has its lease renewed so that unused shares (shares whose lease
102	has not been renewed within a time limit, e.g. 30 days) can be garbage
103	collected. This does not necessarily require code specific to each cloud
104	storage interface, because the lease tracking can be performed in the
105	storage server's generic component rather than in the component supporting
106	each interface.
107
108
109	Mapping
110	=======
111
112	This section describes the mapping between LAFS shares and store objects.
113
114	A LAFS share will be split into one or more “chunks” that are each stored in a
115	store object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
116	chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
117	(It is not possible for `C` to be zero, because valid shares always have a header,
118	so, there is at least one chunk for each share.)
119
120	For an existing share, the chunk size is determined by the size of the first
121	chunk. For a new share, it is a parameter that may depend on the storage
122	interface. It is an error for any chunk to be larger than the first chunk, or
123	for any chunk other than the last to be smaller than the first chunk.
124	If a mutable share with total size less than the default chunk size for the
125	storage interface is being modified, the new contents are split using the
126	default chunk size.
127
128	Rationale: this design allows the `chunksize` parameter to be changed for
129	new shares written via a particular storage interface, without breaking
130	compatibility with existing stored shares. All cloud storage interfaces
131	return the sizes of store objects with requests to list objects, and so
132	the size of the first chunk can be determined without an additional request.
133
134	The name of the store object for chunk `i` > 0 of a LAFS share with storage index
135	`STORAGEINDEX` and share number `SHNUM`, will be
136
137	shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
138
139	where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
140	`.0` is omitted.
141
142	Rationale: this layout maintains compatibility with data stored by the
143	prototype S3 backend, for which Least Authority Enterprises has existing
144	customers. This prototype always used a single store object to store each
145	share, with name
146
147	shares/`ST`/`STORAGEINDEX`/`SHNUM`
148
149	By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
150	the storage server can obtain a list of store objects associated with a given
151	shareset without having to know the layout in advance, and without having to
152	make multiple API requests. This also simplifies sharing of test code between the
153	disk and cloud backends.
154
155	Mutable and immutable shares will be “chunked” in the same way.
156
157
158	Rationale for Chunking
159	----------------------
160
161	Limiting the amount of data received or sent in a single request has the
162	following advantages:
163
164	* It is unnecessary to write separate code to take advantage of the
165	“large object” features of each cloud storage interface, which differ
166	significantly in their design.
167	* Data needed for each PUT request can be discarded after it completes.
168	If a PUT request fails, it can be retried while only holding the data
169	for that request in memory.
170
171
172	Costs
173	=====
174
175	In this section we analyze the costs of the proposed design in terms of network,
176	disk, memory, cloud storage, and API usage.
177
178
179	Network usage—bandwidth and number-of-round-trips
180	-------------------------------------------------
181
182	When a Tahoe-LAFS storage client allocates a new share on a storage server,
183	the backend will request a list of the existing store objects with the
184	appropriate prefix. This takes one HTTP request in the common case, but may
185	take more for the S3 interface, which has a limit of 1000 objects returned in
186	a single “GET Bucket” request.
187
188	If the share is to be read, the client will make a number of calls each
189	specifying the offset and length of the required span of bytes. On the first
190	request that overlaps a given chunk of the share, the server will make an
191	HTTP GET request for that store object. The server may also speculatively
192	make GET requests for store objects that are likely to be needed soon (which
193	can be predicted since reads are normally sequential), in order to reduce
194	latency.
195
196	Each read will be satisfied as soon as the corresponding data is available,
197	without waiting for the rest of the chunk, in order to minimize read latency.
198
199	All four cloud storage interfaces support GET requests using the
200	Range HTTP header. This could be used to optimize reads where the
201	Tahoe-LAFS storage client requires only part of a share.
202
203	If the share is to be written, the server will make an HTTP PUT request for
204	each chunk that has been completed. Tahoe-LAFS clients only write immutable
205	shares sequentially, and so we can rely on that property to simplify the
206	implementation.
207
208	When modifying shares of an existing mutable file, the storage server will
209	be able to make PUT requests only for chunks that have changed.
210	(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
211	future versions will probably do so for MDMF files.)
212
213	In some cases, it may be necessary to retry a request (see the `Structure of
214	Implementation`_ section below). In the case of a PUT request, at the point
215	at which a retry is needed, the new chunk contents to be stored will still be
216	in memory and so this is not problematic.
217
218	In the absence of retries, the maximum number of GET requests that will be made
219	when downloading a file, or the maximum number of PUT requests when uploading
220	or modifying a file, will be equal to the number of chunks in the file.
221
222	If the new mutable share content has fewer chunks than the old content,
223	then the remaining store objects for old chunks must be deleted (using one
224	HTTP request each). When reading a share, the backend must tolerate the case
225	where these store objects have not been deleted successfully.
226
227	The last write to a share will be reported as successful only when all
228	corresponding HTTP PUTs and DELETEs have completed successfully.
229
230
231
232	Disk usage (local to the storage server)
233	----------------------------------------
234
235	It is never necessary for the storage server to write the content of share
236	chunks to local disk, either when they are read or when they are written. Each
237	chunk is held only in memory.
238
239	A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
240	database to store metadata about shares. In that case the same database would
241	be used for the cloud backend. This would enable lease tracking to be implemented
242	in the same way for disk and cloud backends.
243
244
245	Memory usage
246	------------
247
248	The use of chunking simplifies bounding the memory usage of the storage server
249	when handling files that may be larger than memory. However, this depends on
250	limiting the number of chunks that are simultaneously held in memory.
251	Multiple chunks can be held in memory either because of pipelining of requests
252	for a single share, or because multiple shares are being read or written
253	(possibly by multiple clients).
254
255	For immutable shares, the Tahoe-LAFS storage protocol requires the client to
256	specify in advance the maximum amount of data it will write. Also, a cooperative
257	client (including all existing released versions of the Tahoe-LAFS code) will
258	limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
259	size will be greater than that, it is possible to ensure that for each allocation,
260	the maximum chunk data memory usage is the lesser of two chunks, and the allocation
261	size. (There is some additional overhead but it is small compared to the chunk
262	data.) If the maximum memory usage of a new allocation would exceed the memory
263	available, the allocation can be delayed or possibly denied, so that the total
264	memory usage is bounded.
265
266	It is not clear that the existing protocol allows allocations for mutable
267	shares to be bounded in general; this may be addressed in a future protocol change.
268
269	The above discussion assumes that clients do not maliciously send large
270	messages as a denial-of-service attack. Foolscap (the protocol layer underlying
271	the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
272
273
274	Storage
275	-------
276
277	The storage requirements, including not-yet-collected garbage shares, are
278	the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
279	objects stored is equal to the total size of shares that the disk backend
280	would store.
281
282	Erasure coding causes the size of shares for each file to be a
283	factor `shares.total` / `shares.needed` times the file size, plus overhead
284	that is logarithmic in the file size `¹¹`_.
285
286
287	API usage
288	---------
289
290	Cloud storage backends typically charge a small fee per API request. The number of
291	requests to the cloud storage service for various operations is discussed under
292	“network usage” above.
293
294
295	Structure of Implementation
296	===========================
297
298	A generic “cloud backend”, based on the prototype S3 backend but with support
299	for chunking as described above, will be written.
300
301	An instance of the cloud backend can be attached to one of several
302	“cloud interface adapters”, one for each cloud storage interface. These
303	adapters will operate only on chunks, and need not distinguish between
304	mutable and immutable shares. They will be a relatively “thin” abstraction
305	layer over the HTTP APIs of each cloud storage interface, similar to the
306	S3Bucket abstraction in the prototype.
307
308	For some cloud storage services it may be necessary to transparently retry
309	requests in order to recover from transient failures. (Although the erasure
310	coding may enable a file to be retrieved even when shares are not stored by or
311	not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
312	be desirable to retry cloud storage service requests in order to improve overall
313	reliability.) Support for this will be implemented in the generic cloud backend,
314	and used whenever a cloud storage adaptor reports a transient failure. Our
315	experience with the prototype suggests that it is necessary to retry on transient
316	failures for Amazon's S3 service.
317
318	There will also be a “mock” cloud interface adaptor, based on the prototype's
319	MockS3Bucket. This allows tests of the generic cloud backend to be run without
320	a connection to a real cloud service. The mock adaptor will be able to simulate
321	transient and non-transient failures.
322
323
324	Known Issues
325	============
326
327	This design worsens a known “write hole” issue in Tahoe-LAFS when updating
328	the contents of mutable files. An update to a mutable file can require
329	changing the contents of multiple chunks, and if the client fails or is
330	disconnected during the operation the resulting state of the store objects
331	for that share may be inconsistent—no longer containing all of the old version,
332	but not yet containing all of the new version. A mutable share can be left in
333	an inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
334	during a write, but that has a smaller chance of occurrence because the current
335	client behavior leads to mutable shares being written to disk in a single
336	system call.
337
338	The best fix for this issue probably requires changing the Tahoe-LAFS storage
339	protocol, perhaps by extending it to use a two-phase or three-phase commit
340	(ticket #1755).
341
342
343
344	References
345	===========
346
347	¹ omitted
348
349	.. _²:
350
351	² “Amazon S3” Amazon (2012)
352
353	https://aws.amazon.com/s3/
354
355	.. _³:
356
357	³ “Rackspace Cloud Files” Rackspace (2012)
358
359	https://www.rackspace.com/cloud/cloud_hosting_products/files/
360
361	.. _⁴:
362
363	⁴ “Google Cloud Storage” Google (2012)
364
365	https://developers.google.com/storage/
366
367	.. _⁵:
368
369	⁵ “Windows Azure Storage” Microsoft (2012)
370
371	https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
372
373	.. _⁶:
374
375	⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
376
377	http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
378
379	.. _⁷:
380
381	⁷ “OpenStack Object Storage” openstack.org (2012)
382
383	http://openstack.org/projects/storage/
384
385	.. _⁸:
386
387	⁸ “Google Cloud Storage Reference Guide” Google (2012)
388
389	https://developers.google.com/storage/docs/reference-guide
390
391	.. _⁹:
392
393	⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
394
395	http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
396
397	.. _¹⁰:
398
399	¹⁰ “Representational state transfer” English Wikipedia (2012)
400
401	https://en.wikipedia.org/wiki/Representational_state_transfer
402
403	.. _¹¹:
404
405	¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
406
407	:doc:`../../performance`

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/specifications/backends/raic.rst

Download in other formats: