Context Navigation

Back to Ticket #1760

Ticket #1760: raic.rst

File raic.rst, 16.6 KB (added by davidsarah, at 2012-06-05T15:48:07Z)
Redundant Array of Independent Clouds: Share To Cloud Mapping

Line
1
2	=============================================================
3	Redundant Array of Independent Clouds: Share To Cloud Mapping
4	=============================================================
5
6
7	Introduction
8	============
9
10	This document describes a proposed design for the mapping of LAFS shares to
11	objects in a cloud storage service. It also analyzes the costs for each of the
12	functional requirements, including network, disk, storage and API usage costs.
13
14
15	Terminology
16	===========
17
18	LAFS share
19	A Tahoe-LAFS share representing part of a file after encryption and
20	erasure encoding.
21
22	LAFS shareset
23	The set of shares stored by a LAFS storage server for a given storage index.
24	The shares within a shareset are numbered by a small integer.
25
26	Cloud storage service
27	A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
28	Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
29
30	Cloud storage interface
31	A protocol interface supported by a cloud storage service, such as the
32	S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
33	Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
34	multiple services implementing a given cloud storage interface. In this design,
35	only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
36
37	Cloud object
38	A file-like abstraction provided by a cloud storage service, storing a
39	sequence of bytes. Cloud objects are mutable in the sense that the contents
40	and metadata of the cloud object with a given name in a given cloud container
41	can be replaced. Cloud objects are called “blobs” in the Azure interface,
42	and “objects” in the other interfaces.
43
44	Cloud container
45	A container for cloud objects provided by a cloud service. Cloud containers
46	are called “buckets” in the S3 and Google Cloud Storage interfaces, and
47	“containers” in the Azure and OpenStack Storage interfaces.
48
49
50	Functional Requirements
51	=======================
52
53	* Upload: a LAFS share can be uploaded to an appropriately configured
54	Tahoe-LAFS storage server and the data is stored to the cloud
55	storage service.
56
57	* Scalable shares: there is no hard limit on the size of LAFS share
58	that can be uploaded.
59
60	If the cloud storage interface offers scalable files, then this could be
61	implemented by using that feature of the specific cloud storage
62	interface. Alternately, it could be implemented by mapping from the LAFS
63	abstraction of an unlimited-size immutable share to a set of size-limited
64	cloud objects.
65
66	* Streaming upload: the size of the LAFS share that is uploaded
67	can exceed the amount of RAM and even the amount of direct attached
68	storage on the storage server. I.e., the storage server is required to
69	stream the data directly to the ultimate cloud storage service while
70	processing it, instead of to buffer the data until the client is finished
71	uploading and then transfer the data to the cloud storage service.
72
73	* Download: a LAFS share can be downloaded from an appropriately
74	configured Tahoe-LAFS storage server, and the data is loaded from the
75	cloud storage service.
76
77	* Streaming download: the size of the LAFS share that is
78	downloaded can exceed the amount of RAM and even the amount of direct
79	attached storage on the storage server. I.e. the storage server is
80	required to stream the data directly to the client while processing it,
81	instead of to buffer the data until the cloud storage service is finished
82	serving and then transfer the data to the client.
83
84	* Modify: a LAFS share can have part of its contents modified.
85
86	If the cloud storage interface offers scalable mutable files, then this
87	could be implemented by using that feature of the specific cloud storage
88	interface. Alternately, it could be implemented by mapping from the LAFS
89	abstraction of an unlimited-size mutable share to a set of size-limited
90	cloud objects.
91
92	* Efficient modify: the size of the LAFS share being
93	modified can exceed the amount of RAM and even the amount of direct
94	attached storage on the storage server. I.e. the storage server is
95	required to download, patch, and upload only the segment(s) of the share
96	that are being modified, instead of to download, patch, and upload the
97	entire share.
98
99	* Tracking leases: The Tahoe-LAFS storage server is required to track when
100	each share has its lease renewed so that unused shares (shares whose lease
101	has not been renewed within a time limit, e.g. 30 days) can be garbage
102	collected. This does not necessarily require code specific to each cloud
103	storage interface, because the lease tracking can be performed in the
104	storage server's generic component rather than in the component supporting
105	each interface.
106
107
108	Mapping
109	=======
110
111	This section describes the mapping between LAFS shares and cloud objects.
112
113	A LAFS share will be split into one or more “chunks” that are each stored in a
114	cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
115	chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
116	(It is not possible for `C` to be zero, because valid shares always have a header,
117	so, there is at least one chunk for each share.)
118
119	For an existing share, the chunk size is determined by the size of the first
120	chunk. For a new share, it is a parameter that may depend on the storage
121	interface. It is an error for any chunk to be larger than the first chunk, or
122	for any chunk other than the last to be smaller than the first chunk.
123	If a mutable share with total size less than the default chunk size for the
124	storage interface is being modified, the new contents are split using the
125	default chunk size.
126
127	Rationale: this design allows the `chunksize` parameter to be changed for
128	new shares written via a particular storage interface, without breaking
129	compatibility with existing stored shares. All cloud storage interfaces
130	return the sizes of cloud objects with requests to list objects, and so
131	the size of the first chunk can be determined without an additional request.
132
133	The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
134	`STORAGEINDEX` and share number `SHNUM`, will be
135
136	shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
137
138	where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
139	`.0` is omitted.
140
141	Rationale: this layout maintains compatibility with data stored by the
142	prototype S3 backend, for which Least Authority Enterprises has existing
143	customers. This prototype always used a single cloud object to store each
144	share, with name
145
146	shares/`ST`/`STORAGEINDEX`/`SHNUM`
147
148	By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
149	the storage server can obtain a list of cloud objects associated with a given
150	shareset without having to know the layout in advance, and without having to
151	make multiple API requests. This also simplifies sharing of test code between the
152	disk and cloud backends.
153
154	Mutable and immutable shares will be “chunked” in the same way.
155
156
157	Rationale for Chunking
158	----------------------
159
160	Limiting the amount of data received or sent in a single request has the
161	following advantages:
162
163	* It is unnecessary to write separate code to take advantage of the
164	“large object” features of each cloud storage interface, which differ
165	significantly in their design.
166	* Data needed for each PUT request can be discarded after it completes.
167	If a PUT request fails, it can be retried while only holding the data
168	for that request in memory.
169
170
171	Costs
172	=====
173
174	In this section we analyze the costs of the proposed design in terms of network,
175	disk, memory, cloud storage, and API usage.
176
177
178	Network usage—bandwidth and number-of-round-trips
179	-------------------------------------------------
180
181	When a Tahoe-LAFS storage client allocates a new share on a storage server,
182	the backend will request a list of the existing cloud objects with the
183	appropriate prefix. This takes one HTTP request in the common case, but may
184	take more for the S3 interface, which has a limit of 1000 objects returned in
185	a single “GET Bucket” request.
186
187	If the share is to be read, the client will make a number of calls each
188	specifying the offset and length of the required span of bytes. On the first
189	request that overlaps a given chunk of the share, the server will make an
190	HTTP GET request for that cloud object. The server may also speculatively
191	make GET requests for cloud objects that are likely to be needed soon (which
192	can be predicted since reads are normally sequential), in order to reduce
193	latency.
194
195	Each read will be satisfied as soon as the corresponding data is available,
196	without waiting for the rest of the chunk, in order to minimize read latency.
197
198	All four cloud storage interfaces support GET requests using the
199	Range HTTP header. This could be used to optimize reads where the
200	Tahoe-LAFS storage client requires only part of a share.
201
202	If the share is to be written, the server will make an HTTP PUT request for
203	each chunk that has been completed. Tahoe-LAFS clients only write immutable
204	shares sequentially, and so we can rely on that property to simplify the
205	implementation.
206
207	When modifying shares of an existing mutable file, the storage server will
208	be able to make PUT requests only for chunks that have changed.
209	(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
210	future versions will probably do so for MDMF files.)
211
212	In some cases, it may be necessary to retry a request (see the `Structure of
213	Implementation`_ section below). In the case of a PUT request, at the point
214	at which a retry is needed, the new chunk contents to be stored will still be
215	in memory and so this is not problematic.
216
217	In the absence of retries, the maximum number of GET requests that will be made
218	when downloading a file, or the maximum number of PUT requests when uploading
219	or modifying a file, will be equal to the number of chunks in the file.
220
221	If the new mutable share content has fewer chunks than the old content,
222	then the remaining cloud objects for old chunks must be deleted (using one
223	HTTP request each). When reading a share, the backend must tolerate the case
224	where these cloud objects have not been deleted successfully.
225
226	The last write to a share will be reported as successful only when all
227	corresponding HTTP PUTs and DELETEs have completed successfully.
228
229
230
231	Disk usage (local to the storage server)
232	----------------------------------------
233
234	It is never necessary for the storage server to write the content of share
235	chunks to local disk, either when they are read or when they are written. Each
236	chunk is held only in memory.
237
238	A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
239	database to store metadata about shares. In that case the same database would
240	be used for the cloud backend. This would enable lease tracking to be implemented
241	in the same way for disk and cloud backends.
242
243
244	Memory usage
245	------------
246
247	The use of chunking simplifies bounding the memory usage of the storage server
248	when handling files that may be larger than memory. However, this depends on
249	limiting the number of chunks that are simultaneously held in memory.
250	Multiple chunks can be held in memory either because of pipelining of requests
251	for a single share, or because multiple shares are being read or written
252	(possibly by multiple clients).
253
254	For immutable shares, the Tahoe-LAFS storage protocol requires the client to
255	specify in advance the maximum amount of data it will write. Also, a cooperative
256	client (including all existing released versions of the Tahoe-LAFS code) will
257	limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
258	size will be greater than that, it is possible to ensure that for each allocation,
259	the maximum chunk data memory usage is the lesser of two chunks, and the allocation
260	size. (There is some additional overhead but it is small compared to the chunk
261	data.) If the maximum memory usage of a new allocation would exceed the memory
262	available, the allocation can be delayed or possibly denied, so that the total
263	memory usage is bounded.
264
265	It is not clear that the existing protocol allows allocations for mutable
266	shares to be bounded in general; this may be addressed in a future protocol change.
267
268	The above discussion assumes that clients do not maliciously send large
269	messages as a denial-of-service attack. Foolscap (the protocol layer underlying
270	the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
271
272
273	Storage
274	-------
275
276	The storage requirements, including not-yet-collected garbage shares, are
277	the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
278	objects stored is equal to the total size of shares that the disk backend
279	would store.
280
281	Erasure coding causes the size of shares for each file to be a
282	factor `shares.total` / `shares.needed` times the file size, plus overhead
283	that is logarithmic in the file size `¹¹`_.
284
285
286	API usage
287	---------
288
289	Cloud storage backends typically charge a small fee per API request. The number of
290	requests to the cloud storage service for various operations is discussed under
291	“network usage” above.
292
293
294	Structure of Implementation
295	===========================
296
297	A generic “cloud backend”, based on the prototype S3 backend but with support
298	for chunking as described above, will be written.
299
300	An instance of the cloud backend can be attached to one of several
301	“cloud interface adapters”, one for each cloud storage interface. These
302	adapters will operate only on chunks, and need not distinguish between
303	mutable and immutable shares. They will be a relatively “thin” abstraction
304	layer over the HTTP APIs of each cloud storage interface, similar to the
305	S3Bucket abstraction in the prototype.
306
307	For some cloud storage services it may be necessary to transparently retry
308	requests in order to recover from transient failures. (Although the erasure
309	coding may enable a file to be retrieved even when shares are not stored by or
310	not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
311	be desirable to retry cloud storage service requests in order to improve overall
312	reliability.) Support for this will be implemented in the generic cloud backend,
313	and used whenever a cloud storage adaptor reports a transient failure. Our
314	experience with the prototype suggests that it is necessary to retry on transient
315	failures for Amazon's S3 service.
316
317	There will also be a “mock” cloud interface adaptor, based on the prototype's
318	MockS3Bucket. This allows tests of the generic cloud backend to be run without
319	a connection to a real cloud service. The mock adaptor will be able to simulate
320	transient and non-transient failures.
321
322
323	Known Issues
324	============
325
326	This design worsens a known “write hole” issue in Tahoe-LAFS when updating
327	the contents of mutable files. An update to a mutable file can require changing
328	the contents of multiple chunks, and if the client fails or is disconnected
329	during the operation the resulting state of the stored cloud objects may be
330	inconsistent—no longer containing all of the old version, but not yet containing
331	all of the new version. A mutable share can be left in an inconsistent state
332	even by the existing Tahoe-LAFS disk backend if it fails during a write, but
333	that has a smaller chance of occurrence because the current client behavior
334	leads to mutable shares being written to disk in a single system call.
335
336	The best fix for this issue probably requires changing the Tahoe-LAFS storage
337	protocol, perhaps by extending it to use a two-phase or three-phase commit
338	(ticket #1755).
339
340
341
342	References
343	===========
344
345	¹ omitted
346
347	.. _²:
348
349	² “Amazon S3” Amazon (2012)
350
351	https://aws.amazon.com/s3/
352
353	.. _³:
354
355	³ “Rackspace Cloud Files” Rackspace (2012)
356
357	https://www.rackspace.com/cloud/cloud_hosting_products/files/
358
359	.. _⁴:
360
361	⁴ “Google Cloud Storage” Google (2012)
362
363	https://developers.google.com/storage/
364
365	.. _⁵:
366
367	⁵ “Windows Azure Storage” Microsoft (2012)
368
369	https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
370
371	.. _⁶:
372
373	⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
374
375	http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
376
377	.. _⁷:
378
379	⁷ “OpenStack Object Storage” openstack.org (2012)
380
381	http://openstack.org/projects/storage/
382
383	.. _⁸:
384
385	⁸ “Google Cloud Storage Reference Guide” Google (2012)
386
387	https://developers.google.com/storage/docs/reference-guide
388
389	.. _⁹:
390
391	⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
392
393	http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
394
395	.. _¹⁰:
396
397	¹⁰ “Representational state transfer” English Wikipedia (2012)
398
399	https://en.wikipedia.org/wiki/Representational_state_transfer
400
401	.. _¹¹:
402
403	¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
404
405	https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst

Download in other formats:

Original Format