source: trunk/docs/specifications/backends/raic.rst

Last change on this file was 91047bf, checked in by Brian Warner <warner@…>, at 2016-12-12T21:57:28Z

docs: clean up .rst and references

This uses Read-The-Docs (sphinx/docutils) references exclusively, but adds a
README.md for GitHub? viewers to remind them that the links there won't
work (closes ticket:2835).

It also fixes all the dangling references and other Sphinx warnings.

The "Preparation" section of docs/magic-folder-howto.rst was removed, since
this feature has since been merged to trunk.

  • Property mode set to 100644
File size: 16.6 KB
Line 
1.. -*- coding: utf-8-with-signature -*-
2
3=============================================================
4Redundant Array of Independent Clouds: Share To Cloud Mapping
5=============================================================
6
7
8Introduction
9============
10
11This document describes a proposed design for the mapping of LAFS shares to
12objects in a cloud storage service. It also analyzes the costs for each of the
13functional requirements, including network, disk, storage and API usage costs.
14
15
16Terminology
17===========
18
19*LAFS share*
20   A Tahoe-LAFS share representing part of a file after encryption and
21   erasure encoding.
22
23*LAFS shareset*
24   The set of shares stored by a LAFS storage server for a given storage index.
25   The shares within a shareset are numbered by a small integer.
26
27*Cloud storage service*
28   A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
29   Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
30
31*Cloud storage interface*
32   A protocol interface supported by a cloud storage service, such as the
33   S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
34   Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
35   multiple services implementing a given cloud storage interface. In this design,
36   only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
37
38*Store object*
39   A file-like abstraction provided by a cloud storage service, storing a
40   sequence of bytes. Store objects are mutable in the sense that the contents
41   and metadata of the store object with a given name in a given backend store
42   can be replaced. Store objects are called “blobs” in the Azure interface,
43   and “objects” in the other interfaces.
44
45*Cloud backend store*
46   A container for store objects provided by a cloud service. Cloud backend
47   stores are called “buckets” in the S3 and Google Cloud Storage interfaces,
48   and “containers” in the Azure and OpenStack Storage interfaces.
49
50
51Functional Requirements
52=======================
53
54* *Upload*: a LAFS share can be uploaded to an appropriately configured
55  Tahoe-LAFS storage server and the data is stored to the cloud
56  storage service.
57
58 * *Scalable shares*: there is no hard limit on the size of LAFS share
59   that can be uploaded.
60
61   If the cloud storage interface offers scalable files, then this could be
62   implemented by using that feature of the specific cloud storage
63   interface. Alternately, it could be implemented by mapping from the LAFS
64   abstraction of an unlimited-size immutable share to a set of size-limited
65   store objects.
66
67 * *Streaming upload*: the size of the LAFS share that is uploaded
68   can exceed the amount of RAM and even the amount of direct attached
69   storage on the storage server. I.e., the storage server is required to
70   stream the data directly to the ultimate cloud storage service while
71   processing it, instead of to buffer the data until the client is finished
72   uploading and then transfer the data to the cloud storage service.
73
74* *Download*: a LAFS share can be downloaded from an appropriately
75  configured Tahoe-LAFS storage server, and the data is loaded from the
76  cloud storage service.
77
78 * *Streaming download*: the size of the LAFS share that is
79   downloaded can exceed the amount of RAM and even the amount of direct
80   attached storage on the storage server. I.e. the storage server is
81   required to stream the data directly to the client while processing it,
82   instead of to buffer the data until the cloud storage service is finished
83   serving and then transfer the data to the client.
84
85* *Modify*: a LAFS share can have part of its contents modified.
86
87  If the cloud storage interface offers scalable mutable files, then this
88  could be implemented by using that feature of the specific cloud storage
89  interface. Alternately, it could be implemented by mapping from the LAFS
90  abstraction of an unlimited-size mutable share to a set of size-limited
91  store objects.
92
93 * *Efficient modify*: the size of the LAFS share being
94   modified can exceed the amount of RAM and even the amount of direct
95   attached storage on the storage server. I.e. the storage server is
96   required to download, patch, and upload only the segment(s) of the share
97   that are being modified, instead of to download, patch, and upload the
98   entire share.
99
100* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
101  each share has its lease renewed so that unused shares (shares whose lease
102  has not been renewed within a time limit, e.g. 30 days) can be garbage
103  collected. This does not necessarily require code specific to each cloud
104  storage interface, because the lease tracking can be performed in the
105  storage server's generic component rather than in the component supporting
106  each interface.
107
108
109Mapping
110=======
111
112This section describes the mapping between LAFS shares and store objects.
113
114A LAFS share will be split into one or more “chunks” that are each stored in a
115store object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
116chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
117(It is not possible for `C` to be zero, because valid shares always have a header,
118so, there is at least one chunk for each share.)
119
120For an existing share, the chunk size is determined by the size of the first
121chunk. For a new share, it is a parameter that may depend on the storage
122interface. It is an error for any chunk to be larger than the first chunk, or
123for any chunk other than the last to be smaller than the first chunk.
124If a mutable share with total size less than the default chunk size for the
125storage interface is being modified, the new contents are split using the
126default chunk size.
127
128  *Rationale*: this design allows the `chunksize` parameter to be changed for
129  new shares written via a particular storage interface, without breaking
130  compatibility with existing stored shares. All cloud storage interfaces
131  return the sizes of store objects with requests to list objects, and so
132  the size of the first chunk can be determined without an additional request.
133
134The name of the store object for chunk `i` > 0 of a LAFS share with storage index
135`STORAGEINDEX` and share number `SHNUM`, will be
136
137  shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
138
139where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
140`.0` is omitted.
141
142  *Rationale*: this layout maintains compatibility with data stored by the
143  prototype S3 backend, for which Least Authority Enterprises has existing
144  customers. This prototype always used a single store object to store each
145  share, with name
146
147    shares/`ST`/`STORAGEINDEX`/`SHNUM`
148
149  By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
150  the storage server can obtain a list of store objects associated with a given
151  shareset without having to know the layout in advance, and without having to
152  make multiple API requests. This also simplifies sharing of test code between the
153  disk and cloud backends.
154
155Mutable and immutable shares will be “chunked” in the same way.
156
157
158Rationale for Chunking
159----------------------
160
161Limiting the amount of data received or sent in a single request has the
162following advantages:
163
164* It is unnecessary to write separate code to take advantage of the
165  “large object” features of each cloud storage interface, which differ
166  significantly in their design.
167* Data needed for each PUT request can be discarded after it completes.
168  If a PUT request fails, it can be retried while only holding the data
169  for that request in memory.
170
171
172Costs
173=====
174
175In this section we analyze the costs of the proposed design in terms of network,
176disk, memory, cloud storage, and API usage.
177
178
179Network usage—bandwidth and number-of-round-trips
180-------------------------------------------------
181
182When a Tahoe-LAFS storage client allocates a new share on a storage server,
183the backend will request a list of the existing store objects with the
184appropriate prefix. This takes one HTTP request in the common case, but may
185take more for the S3 interface, which has a limit of 1000 objects returned in
186a single “GET Bucket” request.
187
188If the share is to be read, the client will make a number of calls each
189specifying the offset and length of the required span of bytes. On the first
190request that overlaps a given chunk of the share, the server will make an
191HTTP GET request for that store object. The server may also speculatively
192make GET requests for store objects that are likely to be needed soon (which
193can be predicted since reads are normally sequential), in order to reduce
194latency.
195
196Each read will be satisfied as soon as the corresponding data is available,
197without waiting for the rest of the chunk, in order to minimize read latency.
198
199All four cloud storage interfaces support GET requests using the
200Range HTTP header. This could be used to optimize reads where the
201Tahoe-LAFS storage client requires only part of a share.
202
203If the share is to be written, the server will make an HTTP PUT request for
204each chunk that has been completed. Tahoe-LAFS clients only write immutable
205shares sequentially, and so we can rely on that property to simplify the
206implementation.
207
208When modifying shares of an existing mutable file, the storage server will
209be able to make PUT requests only for chunks that have changed.
210(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
211future versions will probably do so for MDMF files.)
212
213In some cases, it may be necessary to retry a request (see the `Structure of
214Implementation`_ section below). In the case of a PUT request, at the point
215at which a retry is needed, the new chunk contents to be stored will still be
216in memory and so this is not problematic.
217
218In the absence of retries, the maximum number of GET requests that will be made
219when downloading a file, or the maximum number of PUT requests when uploading
220or modifying a file, will be equal to the number of chunks in the file.
221
222If the new mutable share content has fewer chunks than the old content,
223then the remaining store objects for old chunks must be deleted (using one
224HTTP request each). When reading a share, the backend must tolerate the case
225where these store objects have not been deleted successfully.
226
227The last write to a share will be reported as successful only when all
228corresponding HTTP PUTs and DELETEs have completed successfully.
229
230
231
232Disk usage (local to the storage server)
233----------------------------------------
234
235It is never necessary for the storage server to write the content of share
236chunks to local disk, either when they are read or when they are written. Each
237chunk is held only in memory.
238
239A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
240database to store metadata about shares. In that case the same database would
241be used for the cloud backend. This would enable lease tracking to be implemented
242in the same way for disk and cloud backends.
243
244
245Memory usage
246------------
247
248The use of chunking simplifies bounding the memory usage of the storage server
249when handling files that may be larger than memory. However, this depends on
250limiting the number of chunks that are simultaneously held in memory.
251Multiple chunks can be held in memory either because of pipelining of requests
252for a single share, or because multiple shares are being read or written
253(possibly by multiple clients).
254
255For immutable shares, the Tahoe-LAFS storage protocol requires the client to
256specify in advance the maximum amount of data it will write. Also, a cooperative
257client (including all existing released versions of the Tahoe-LAFS code) will
258limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
259size will be greater than that, it is possible to ensure that for each allocation,
260the maximum chunk data memory usage is the lesser of two chunks, and the allocation
261size. (There is some additional overhead but it is small compared to the chunk
262data.) If the maximum memory usage of a new allocation would exceed the memory
263available, the allocation can be delayed or possibly denied, so that the total
264memory usage is bounded.
265
266It is not clear that the existing protocol allows allocations for mutable
267shares to be bounded in general; this may be addressed in a future protocol change.
268
269The above discussion assumes that clients do not maliciously send large
270messages as a denial-of-service attack. Foolscap (the protocol layer underlying
271the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
272
273
274Storage
275-------
276
277The storage requirements, including not-yet-collected garbage shares, are
278the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
279objects stored is equal to the total size of shares that the disk backend
280would store.
281
282Erasure coding causes the size of shares for each file to be a
283factor `shares.total` / `shares.needed` times the file size, plus overhead
284that is logarithmic in the file size `¹¹`_.
285
286
287API usage
288---------
289
290Cloud storage backends typically charge a small fee per API request. The number of
291requests to the cloud storage service for various operations is discussed under
292“network usage” above.
293
294
295Structure of Implementation
296===========================
297
298A generic “cloud backend”, based on the prototype S3 backend but with support
299for chunking as described above, will be written.
300
301An instance of the cloud backend can be attached to one of several
302“cloud interface adapters”, one for each cloud storage interface. These
303adapters will operate only on chunks, and need not distinguish between
304mutable and immutable shares. They will be a relatively “thin” abstraction
305layer over the HTTP APIs of each cloud storage interface, similar to the
306S3Bucket abstraction in the prototype.
307
308For some cloud storage services it may be necessary to transparently retry
309requests in order to recover from transient failures. (Although the erasure
310coding may enable a file to be retrieved even when shares are not stored by or
311not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
312be desirable to retry cloud storage service requests in order to improve overall
313reliability.) Support for this will be implemented in the generic cloud backend,
314and used whenever a cloud storage adaptor reports a transient failure. Our
315experience with the prototype suggests that it is necessary to retry on transient
316failures for Amazon's S3 service.
317
318There will also be a “mock” cloud interface adaptor, based on the prototype's
319MockS3Bucket. This allows tests of the generic cloud backend to be run without
320a connection to a real cloud service. The mock adaptor will be able to simulate
321transient and non-transient failures.
322
323
324Known Issues
325============
326
327This design worsens a known “write hole” issue in Tahoe-LAFS when updating
328the contents of mutable files. An update to a mutable file can require
329changing the contents of multiple chunks, and if the client fails or is
330disconnected during the operation the resulting state of the store objects
331for that share may be inconsistent—no longer containing all of the old version,
332but not yet containing all of the new version. A mutable share can be left in
333an inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
334during a write, but that has a smaller chance of occurrence because the current
335client behavior leads to mutable shares being written to disk in a single
336system call.
337
338The best fix for this issue probably requires changing the Tahoe-LAFS storage
339protocol, perhaps by extending it to use a two-phase or three-phase commit
340(ticket #1755).
341
342
343
344References
345===========
346
347¹ omitted
348
349.. _²:
350
351² “Amazon S3” Amazon (2012)
352
353   https://aws.amazon.com/s3/
354
355.. _³:
356
357³ “Rackspace Cloud Files” Rackspace (2012)
358
359   https://www.rackspace.com/cloud/cloud_hosting_products/files/
360
361.. _⁴:
362
363⁴ “Google Cloud Storage” Google (2012)
364
365   https://developers.google.com/storage/
366
367.. _⁵:
368
369⁵ “Windows Azure Storage” Microsoft (2012)
370
371   https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
372
373.. _⁶:
374
375⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
376
377   http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
378
379.. _⁷:
380
381⁷ “OpenStack Object Storage” openstack.org (2012)
382
383   http://openstack.org/projects/storage/
384
385.. _⁸:
386
387⁸ “Google Cloud Storage Reference Guide” Google (2012)
388
389   https://developers.google.com/storage/docs/reference-guide
390
391.. _⁹:
392
393⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
394
395   http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
396
397.. _¹⁰:
398
399¹⁰ “Representational state transfer” English Wikipedia (2012)
400
401    https://en.wikipedia.org/wiki/Representational_state_transfer
402
403.. _¹¹:
404
405¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
406
407    :doc:`../../performance`
Note: See TracBrowser for help on using the repository browser.