Ticket #1760: raic.rst

File raic.rst, 16.6 KB (added by davidsarah, at 2012-06-05T15:48:07Z)

Redundant Array of Independent Clouds: Share To Cloud Mapping

Line 
1
2=============================================================
3Redundant Array of Independent Clouds: Share To Cloud Mapping
4=============================================================
5
6
7Introduction
8============
9
10This document describes a proposed design for the mapping of LAFS shares to
11objects in a cloud storage service. It also analyzes the costs for each of the
12functional requirements, including network, disk, storage and API usage costs.
13
14
15Terminology
16===========
17
18*LAFS share*
19   A Tahoe-LAFS share representing part of a file after encryption and
20   erasure encoding.
21
22*LAFS shareset*
23   The set of shares stored by a LAFS storage server for a given storage index.
24   The shares within a shareset are numbered by a small integer.
25
26*Cloud storage service*
27   A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
28   Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
29
30*Cloud storage interface*
31   A protocol interface supported by a cloud storage service, such as the
32   S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
33   Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
34   multiple services implementing a given cloud storage interface. In this design,
35   only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
36
37*Cloud object*
38   A file-like abstraction provided by a cloud storage service, storing a
39   sequence of bytes. Cloud objects are mutable in the sense that the contents
40   and metadata of the cloud object with a given name in a given cloud container
41   can be replaced. Cloud objects are called “blobs” in the Azure interface,
42   and “objects” in the other interfaces.
43
44*Cloud container*
45   A container for cloud objects provided by a cloud service. Cloud containers
46   are called “buckets” in the S3 and Google Cloud Storage interfaces, and
47   “containers” in the Azure and OpenStack Storage interfaces.
48
49
50Functional Requirements
51=======================
52
53* *Upload*: a LAFS share can be uploaded to an appropriately configured
54  Tahoe-LAFS storage server and the data is stored to the cloud
55  storage service.
56
57 * *Scalable shares*: there is no hard limit on the size of LAFS share
58   that can be uploaded.
59
60   If the cloud storage interface offers scalable files, then this could be
61   implemented by using that feature of the specific cloud storage
62   interface. Alternately, it could be implemented by mapping from the LAFS
63   abstraction of an unlimited-size immutable share to a set of size-limited
64   cloud objects.
65
66 * *Streaming upload*: the size of the LAFS share that is uploaded
67   can exceed the amount of RAM and even the amount of direct attached
68   storage on the storage server. I.e., the storage server is required to
69   stream the data directly to the ultimate cloud storage service while
70   processing it, instead of to buffer the data until the client is finished
71   uploading and then transfer the data to the cloud storage service.
72
73* *Download*: a LAFS share can be downloaded from an appropriately
74  configured Tahoe-LAFS storage server, and the data is loaded from the
75  cloud storage service.
76
77 * *Streaming download*: the size of the LAFS share that is
78   downloaded can exceed the amount of RAM and even the amount of direct
79   attached storage on the storage server. I.e. the storage server is
80   required to stream the data directly to the client while processing it,
81   instead of to buffer the data until the cloud storage service is finished
82   serving and then transfer the data to the client.
83
84* *Modify*: a LAFS share can have part of its contents modified.
85
86  If the cloud storage interface offers scalable mutable files, then this
87  could be implemented by using that feature of the specific cloud storage
88  interface. Alternately, it could be implemented by mapping from the LAFS
89  abstraction of an unlimited-size mutable share to a set of size-limited
90  cloud objects.
91
92 * *Efficient modify*: the size of the LAFS share being
93   modified can exceed the amount of RAM and even the amount of direct
94   attached storage on the storage server. I.e. the storage server is
95   required to download, patch, and upload only the segment(s) of the share
96   that are being modified, instead of to download, patch, and upload the
97   entire share.
98
99* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
100  each share has its lease renewed so that unused shares (shares whose lease
101  has not been renewed within a time limit, e.g. 30 days) can be garbage
102  collected. This does not necessarily require code specific to each cloud
103  storage interface, because the lease tracking can be performed in the
104  storage server's generic component rather than in the component supporting
105  each interface.
106
107
108Mapping
109=======
110
111This section describes the mapping between LAFS shares and cloud objects.
112
113A LAFS share will be split into one or more “chunks” that are each stored in a
114cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
115chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
116(It is not possible for `C` to be zero, because valid shares always have a header,
117so, there is at least one chunk for each share.)
118
119For an existing share, the chunk size is determined by the size of the first
120chunk. For a new share, it is a parameter that may depend on the storage
121interface. It is an error for any chunk to be larger than the first chunk, or
122for any chunk other than the last to be smaller than the first chunk.
123If a mutable share with total size less than the default chunk size for the
124storage interface is being modified, the new contents are split using the
125default chunk size.
126
127  *Rationale*: this design allows the `chunksize` parameter to be changed for
128  new shares written via a particular storage interface, without breaking
129  compatibility with existing stored shares. All cloud storage interfaces
130  return the sizes of cloud objects with requests to list objects, and so
131  the size of the first chunk can be determined without an additional request.
132
133The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
134`STORAGEINDEX` and share number `SHNUM`, will be
135
136  shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
137
138where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
139`.0` is omitted.
140
141  *Rationale*: this layout maintains compatibility with data stored by the
142  prototype S3 backend, for which Least Authority Enterprises has existing
143  customers. This prototype always used a single cloud object to store each
144  share, with name
145
146    shares/`ST`/`STORAGEINDEX`/`SHNUM`
147
148  By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
149  the storage server can obtain a list of cloud objects associated with a given
150  shareset without having to know the layout in advance, and without having to
151  make multiple API requests. This also simplifies sharing of test code between the
152  disk and cloud backends.
153
154Mutable and immutable shares will be “chunked” in the same way.
155
156
157Rationale for Chunking
158----------------------
159
160Limiting the amount of data received or sent in a single request has the
161following advantages:
162
163* It is unnecessary to write separate code to take advantage of the
164  “large object” features of each cloud storage interface, which differ
165  significantly in their design.
166* Data needed for each PUT request can be discarded after it completes.
167  If a PUT request fails, it can be retried while only holding the data
168  for that request in memory.
169
170
171Costs
172=====
173
174In this section we analyze the costs of the proposed design in terms of network,
175disk, memory, cloud storage, and API usage.
176
177
178Network usage—bandwidth and number-of-round-trips
179-------------------------------------------------
180
181When a Tahoe-LAFS storage client allocates a new share on a storage server,
182the backend will request a list of the existing cloud objects with the
183appropriate prefix. This takes one HTTP request in the common case, but may
184take more for the S3 interface, which has a limit of 1000 objects returned in
185a single “GET Bucket” request.
186
187If the share is to be read, the client will make a number of calls each
188specifying the offset and length of the required span of bytes. On the first
189request that overlaps a given chunk of the share, the server will make an
190HTTP GET request for that cloud object. The server may also speculatively
191make GET requests for cloud objects that are likely to be needed soon (which
192can be predicted since reads are normally sequential), in order to reduce
193latency.
194
195Each read will be satisfied as soon as the corresponding data is available,
196without waiting for the rest of the chunk, in order to minimize read latency.
197
198All four cloud storage interfaces support GET requests using the
199Range HTTP header. This could be used to optimize reads where the
200Tahoe-LAFS storage client requires only part of a share.
201
202If the share is to be written, the server will make an HTTP PUT request for
203each chunk that has been completed. Tahoe-LAFS clients only write immutable
204shares sequentially, and so we can rely on that property to simplify the
205implementation.
206
207When modifying shares of an existing mutable file, the storage server will
208be able to make PUT requests only for chunks that have changed.
209(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
210future versions will probably do so for MDMF files.)
211
212In some cases, it may be necessary to retry a request (see the `Structure of
213Implementation`_ section below). In the case of a PUT request, at the point
214at which a retry is needed, the new chunk contents to be stored will still be
215in memory and so this is not problematic.
216
217In the absence of retries, the maximum number of GET requests that will be made
218when downloading a file, or the maximum number of PUT requests when uploading
219or modifying a file, will be equal to the number of chunks in the file.
220
221If the new mutable share content has fewer chunks than the old content,
222then the remaining cloud objects for old chunks must be deleted (using one
223HTTP request each). When reading a share, the backend must tolerate the case
224where these cloud objects have not been deleted successfully.
225
226The last write to a share will be reported as successful only when all
227corresponding HTTP PUTs and DELETEs have completed successfully.
228
229
230
231Disk usage (local to the storage server)
232----------------------------------------
233
234It is never necessary for the storage server to write the content of share
235chunks to local disk, either when they are read or when they are written. Each
236chunk is held only in memory.
237
238A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
239database to store metadata about shares. In that case the same database would
240be used for the cloud backend. This would enable lease tracking to be implemented
241in the same way for disk and cloud backends.
242
243
244Memory usage
245------------
246
247The use of chunking simplifies bounding the memory usage of the storage server
248when handling files that may be larger than memory. However, this depends on
249limiting the number of chunks that are simultaneously held in memory.
250Multiple chunks can be held in memory either because of pipelining of requests
251for a single share, or because multiple shares are being read or written
252(possibly by multiple clients).
253
254For immutable shares, the Tahoe-LAFS storage protocol requires the client to
255specify in advance the maximum amount of data it will write. Also, a cooperative
256client (including all existing released versions of the Tahoe-LAFS code) will
257limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
258size will be greater than that, it is possible to ensure that for each allocation,
259the maximum chunk data memory usage is the lesser of two chunks, and the allocation
260size. (There is some additional overhead but it is small compared to the chunk
261data.) If the maximum memory usage of a new allocation would exceed the memory
262available, the allocation can be delayed or possibly denied, so that the total
263memory usage is bounded.
264
265It is not clear that the existing protocol allows allocations for mutable
266shares to be bounded in general; this may be addressed in a future protocol change.
267
268The above discussion assumes that clients do not maliciously send large
269messages as a denial-of-service attack. Foolscap (the protocol layer underlying
270the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
271
272
273Storage
274-------
275
276The storage requirements, including not-yet-collected garbage shares, are
277the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
278objects stored is equal to the total size of shares that the disk backend
279would store.
280
281Erasure coding causes the size of shares for each file to be a
282factor `shares.total` / `shares.needed` times the file size, plus overhead
283that is logarithmic in the file size `¹¹`_.
284
285
286API usage
287---------
288
289Cloud storage backends typically charge a small fee per API request. The number of
290requests to the cloud storage service for various operations is discussed under
291“network usage” above.
292
293
294Structure of Implementation
295===========================
296
297A generic “cloud backend”, based on the prototype S3 backend but with support
298for chunking as described above, will be written.
299
300An instance of the cloud backend can be attached to one of several
301“cloud interface adapters”, one for each cloud storage interface. These
302adapters will operate only on chunks, and need not distinguish between
303mutable and immutable shares. They will be a relatively “thin” abstraction
304layer over the HTTP APIs of each cloud storage interface, similar to the
305S3Bucket abstraction in the prototype.
306
307For some cloud storage services it may be necessary to transparently retry
308requests in order to recover from transient failures. (Although the erasure
309coding may enable a file to be retrieved even when shares are not stored by or
310not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
311be desirable to retry cloud storage service requests in order to improve overall
312reliability.) Support for this will be implemented in the generic cloud backend,
313and used whenever a cloud storage adaptor reports a transient failure. Our
314experience with the prototype suggests that it is necessary to retry on transient
315failures for Amazon's S3 service.
316
317There will also be a “mock” cloud interface adaptor, based on the prototype's
318MockS3Bucket. This allows tests of the generic cloud backend to be run without
319a connection to a real cloud service. The mock adaptor will be able to simulate
320transient and non-transient failures.
321
322
323Known Issues
324============
325
326This design worsens a known “write hole” issue in Tahoe-LAFS when updating
327the contents of mutable files. An update to a mutable file can require changing
328the contents of multiple chunks, and if the client fails or is disconnected
329during the operation the resulting state of the stored cloud objects may be
330inconsistent—no longer containing all of the old version, but not yet containing
331all of the new version. A mutable share can be left in an inconsistent state
332even by the existing Tahoe-LAFS disk backend if it fails during a write, but
333that has a smaller chance of occurrence because the current client behavior
334leads to mutable shares being written to disk in a single system call.
335
336The best fix for this issue probably requires changing the Tahoe-LAFS storage
337protocol, perhaps by extending it to use a two-phase or three-phase commit
338(ticket #1755).
339
340
341
342References
343===========
344
345¹ omitted
346
347.. _²:
348
349² “Amazon S3” Amazon (2012)
350
351   https://aws.amazon.com/s3/
352
353.. _³:
354
355³ “Rackspace Cloud Files” Rackspace (2012)
356
357   https://www.rackspace.com/cloud/cloud_hosting_products/files/
358
359.. _⁴:
360
361⁴ “Google Cloud Storage” Google (2012)
362
363   https://developers.google.com/storage/
364
365.. _⁵:
366
367⁵ “Windows Azure Storage” Microsoft (2012)
368
369   https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
370
371.. _⁶:
372
373⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
374
375   http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
376
377.. _⁷:
378
379⁷ “OpenStack Object Storage” openstack.org (2012)
380
381   http://openstack.org/projects/storage/
382
383.. _⁸:
384
385⁸ “Google Cloud Storage Reference Guide” Google (2012)
386
387   https://developers.google.com/storage/docs/reference-guide
388
389.. _⁹:
390
391⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
392
393   http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
394
395.. _¹⁰:
396
397¹⁰ “Representational state transfer” English Wikipedia (2012)
398
399    https://en.wikipedia.org/wiki/Representational_state_transfer
400
401.. _¹¹:
402
403¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
404
405    https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst