| 1 | .. -*- coding: utf-8-with-signature -*- |
|---|
| 2 | |
|---|
| 3 | ============================================================= |
|---|
| 4 | Redundant Array of Independent Clouds: Share To Cloud Mapping |
|---|
| 5 | ============================================================= |
|---|
| 6 | |
|---|
| 7 | |
|---|
| 8 | Introduction |
|---|
| 9 | ============ |
|---|
| 10 | |
|---|
| 11 | This document describes a proposed design for the mapping of LAFS shares to |
|---|
| 12 | objects in a cloud storage service. It also analyzes the costs for each of the |
|---|
| 13 | functional requirements, including network, disk, storage and API usage costs. |
|---|
| 14 | |
|---|
| 15 | |
|---|
| 16 | Terminology |
|---|
| 17 | =========== |
|---|
| 18 | |
|---|
| 19 | *LAFS share* |
|---|
| 20 | A Tahoe-LAFS share representing part of a file after encryption and |
|---|
| 21 | erasure encoding. |
|---|
| 22 | |
|---|
| 23 | *LAFS shareset* |
|---|
| 24 | The set of shares stored by a LAFS storage server for a given storage index. |
|---|
| 25 | The shares within a shareset are numbered by a small integer. |
|---|
| 26 | |
|---|
| 27 | *Cloud storage service* |
|---|
| 28 | A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_, |
|---|
| 29 | Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage. |
|---|
| 30 | |
|---|
| 31 | *Cloud storage interface* |
|---|
| 32 | A protocol interface supported by a cloud storage service, such as the |
|---|
| 33 | S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the |
|---|
| 34 | Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be |
|---|
| 35 | multiple services implementing a given cloud storage interface. In this design, |
|---|
| 36 | only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces. |
|---|
| 37 | |
|---|
| 38 | *Store object* |
|---|
| 39 | A file-like abstraction provided by a cloud storage service, storing a |
|---|
| 40 | sequence of bytes. Store objects are mutable in the sense that the contents |
|---|
| 41 | and metadata of the store object with a given name in a given backend store |
|---|
| 42 | can be replaced. Store objects are called “blobs” in the Azure interface, |
|---|
| 43 | and “objects” in the other interfaces. |
|---|
| 44 | |
|---|
| 45 | *Cloud backend store* |
|---|
| 46 | A container for store objects provided by a cloud service. Cloud backend |
|---|
| 47 | stores are called “buckets” in the S3 and Google Cloud Storage interfaces, |
|---|
| 48 | and “containers” in the Azure and OpenStack Storage interfaces. |
|---|
| 49 | |
|---|
| 50 | |
|---|
| 51 | Functional Requirements |
|---|
| 52 | ======================= |
|---|
| 53 | |
|---|
| 54 | * *Upload*: a LAFS share can be uploaded to an appropriately configured |
|---|
| 55 | Tahoe-LAFS storage server and the data is stored to the cloud |
|---|
| 56 | storage service. |
|---|
| 57 | |
|---|
| 58 | * *Scalable shares*: there is no hard limit on the size of LAFS share |
|---|
| 59 | that can be uploaded. |
|---|
| 60 | |
|---|
| 61 | If the cloud storage interface offers scalable files, then this could be |
|---|
| 62 | implemented by using that feature of the specific cloud storage |
|---|
| 63 | interface. Alternately, it could be implemented by mapping from the LAFS |
|---|
| 64 | abstraction of an unlimited-size immutable share to a set of size-limited |
|---|
| 65 | store objects. |
|---|
| 66 | |
|---|
| 67 | * *Streaming upload*: the size of the LAFS share that is uploaded |
|---|
| 68 | can exceed the amount of RAM and even the amount of direct attached |
|---|
| 69 | storage on the storage server. I.e., the storage server is required to |
|---|
| 70 | stream the data directly to the ultimate cloud storage service while |
|---|
| 71 | processing it, instead of to buffer the data until the client is finished |
|---|
| 72 | uploading and then transfer the data to the cloud storage service. |
|---|
| 73 | |
|---|
| 74 | * *Download*: a LAFS share can be downloaded from an appropriately |
|---|
| 75 | configured Tahoe-LAFS storage server, and the data is loaded from the |
|---|
| 76 | cloud storage service. |
|---|
| 77 | |
|---|
| 78 | * *Streaming download*: the size of the LAFS share that is |
|---|
| 79 | downloaded can exceed the amount of RAM and even the amount of direct |
|---|
| 80 | attached storage on the storage server. I.e. the storage server is |
|---|
| 81 | required to stream the data directly to the client while processing it, |
|---|
| 82 | instead of to buffer the data until the cloud storage service is finished |
|---|
| 83 | serving and then transfer the data to the client. |
|---|
| 84 | |
|---|
| 85 | * *Modify*: a LAFS share can have part of its contents modified. |
|---|
| 86 | |
|---|
| 87 | If the cloud storage interface offers scalable mutable files, then this |
|---|
| 88 | could be implemented by using that feature of the specific cloud storage |
|---|
| 89 | interface. Alternately, it could be implemented by mapping from the LAFS |
|---|
| 90 | abstraction of an unlimited-size mutable share to a set of size-limited |
|---|
| 91 | store objects. |
|---|
| 92 | |
|---|
| 93 | * *Efficient modify*: the size of the LAFS share being |
|---|
| 94 | modified can exceed the amount of RAM and even the amount of direct |
|---|
| 95 | attached storage on the storage server. I.e. the storage server is |
|---|
| 96 | required to download, patch, and upload only the segment(s) of the share |
|---|
| 97 | that are being modified, instead of to download, patch, and upload the |
|---|
| 98 | entire share. |
|---|
| 99 | |
|---|
| 100 | * *Tracking leases*: The Tahoe-LAFS storage server is required to track when |
|---|
| 101 | each share has its lease renewed so that unused shares (shares whose lease |
|---|
| 102 | has not been renewed within a time limit, e.g. 30 days) can be garbage |
|---|
| 103 | collected. This does not necessarily require code specific to each cloud |
|---|
| 104 | storage interface, because the lease tracking can be performed in the |
|---|
| 105 | storage server's generic component rather than in the component supporting |
|---|
| 106 | each interface. |
|---|
| 107 | |
|---|
| 108 | |
|---|
| 109 | Mapping |
|---|
| 110 | ======= |
|---|
| 111 | |
|---|
| 112 | This section describes the mapping between LAFS shares and store objects. |
|---|
| 113 | |
|---|
| 114 | A LAFS share will be split into one or more “chunks” that are each stored in a |
|---|
| 115 | store object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)` |
|---|
| 116 | chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive. |
|---|
| 117 | (It is not possible for `C` to be zero, because valid shares always have a header, |
|---|
| 118 | so, there is at least one chunk for each share.) |
|---|
| 119 | |
|---|
| 120 | For an existing share, the chunk size is determined by the size of the first |
|---|
| 121 | chunk. For a new share, it is a parameter that may depend on the storage |
|---|
| 122 | interface. It is an error for any chunk to be larger than the first chunk, or |
|---|
| 123 | for any chunk other than the last to be smaller than the first chunk. |
|---|
| 124 | If a mutable share with total size less than the default chunk size for the |
|---|
| 125 | storage interface is being modified, the new contents are split using the |
|---|
| 126 | default chunk size. |
|---|
| 127 | |
|---|
| 128 | *Rationale*: this design allows the `chunksize` parameter to be changed for |
|---|
| 129 | new shares written via a particular storage interface, without breaking |
|---|
| 130 | compatibility with existing stored shares. All cloud storage interfaces |
|---|
| 131 | return the sizes of store objects with requests to list objects, and so |
|---|
| 132 | the size of the first chunk can be determined without an additional request. |
|---|
| 133 | |
|---|
| 134 | The name of the store object for chunk `i` > 0 of a LAFS share with storage index |
|---|
| 135 | `STORAGEINDEX` and share number `SHNUM`, will be |
|---|
| 136 | |
|---|
| 137 | shares/`ST`/`STORAGEINDEX`/`SHNUM.i` |
|---|
| 138 | |
|---|
| 139 | where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the |
|---|
| 140 | `.0` is omitted. |
|---|
| 141 | |
|---|
| 142 | *Rationale*: this layout maintains compatibility with data stored by the |
|---|
| 143 | prototype S3 backend, for which Least Authority Enterprises has existing |
|---|
| 144 | customers. This prototype always used a single store object to store each |
|---|
| 145 | share, with name |
|---|
| 146 | |
|---|
| 147 | shares/`ST`/`STORAGEINDEX`/`SHNUM` |
|---|
| 148 | |
|---|
| 149 | By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts, |
|---|
| 150 | the storage server can obtain a list of store objects associated with a given |
|---|
| 151 | shareset without having to know the layout in advance, and without having to |
|---|
| 152 | make multiple API requests. This also simplifies sharing of test code between the |
|---|
| 153 | disk and cloud backends. |
|---|
| 154 | |
|---|
| 155 | Mutable and immutable shares will be “chunked” in the same way. |
|---|
| 156 | |
|---|
| 157 | |
|---|
| 158 | Rationale for Chunking |
|---|
| 159 | ---------------------- |
|---|
| 160 | |
|---|
| 161 | Limiting the amount of data received or sent in a single request has the |
|---|
| 162 | following advantages: |
|---|
| 163 | |
|---|
| 164 | * It is unnecessary to write separate code to take advantage of the |
|---|
| 165 | “large object” features of each cloud storage interface, which differ |
|---|
| 166 | significantly in their design. |
|---|
| 167 | * Data needed for each PUT request can be discarded after it completes. |
|---|
| 168 | If a PUT request fails, it can be retried while only holding the data |
|---|
| 169 | for that request in memory. |
|---|
| 170 | |
|---|
| 171 | |
|---|
| 172 | Costs |
|---|
| 173 | ===== |
|---|
| 174 | |
|---|
| 175 | In this section we analyze the costs of the proposed design in terms of network, |
|---|
| 176 | disk, memory, cloud storage, and API usage. |
|---|
| 177 | |
|---|
| 178 | |
|---|
| 179 | Network usage—bandwidth and number-of-round-trips |
|---|
| 180 | ------------------------------------------------- |
|---|
| 181 | |
|---|
| 182 | When a Tahoe-LAFS storage client allocates a new share on a storage server, |
|---|
| 183 | the backend will request a list of the existing store objects with the |
|---|
| 184 | appropriate prefix. This takes one HTTP request in the common case, but may |
|---|
| 185 | take more for the S3 interface, which has a limit of 1000 objects returned in |
|---|
| 186 | a single “GET Bucket” request. |
|---|
| 187 | |
|---|
| 188 | If the share is to be read, the client will make a number of calls each |
|---|
| 189 | specifying the offset and length of the required span of bytes. On the first |
|---|
| 190 | request that overlaps a given chunk of the share, the server will make an |
|---|
| 191 | HTTP GET request for that store object. The server may also speculatively |
|---|
| 192 | make GET requests for store objects that are likely to be needed soon (which |
|---|
| 193 | can be predicted since reads are normally sequential), in order to reduce |
|---|
| 194 | latency. |
|---|
| 195 | |
|---|
| 196 | Each read will be satisfied as soon as the corresponding data is available, |
|---|
| 197 | without waiting for the rest of the chunk, in order to minimize read latency. |
|---|
| 198 | |
|---|
| 199 | All four cloud storage interfaces support GET requests using the |
|---|
| 200 | Range HTTP header. This could be used to optimize reads where the |
|---|
| 201 | Tahoe-LAFS storage client requires only part of a share. |
|---|
| 202 | |
|---|
| 203 | If the share is to be written, the server will make an HTTP PUT request for |
|---|
| 204 | each chunk that has been completed. Tahoe-LAFS clients only write immutable |
|---|
| 205 | shares sequentially, and so we can rely on that property to simplify the |
|---|
| 206 | implementation. |
|---|
| 207 | |
|---|
| 208 | When modifying shares of an existing mutable file, the storage server will |
|---|
| 209 | be able to make PUT requests only for chunks that have changed. |
|---|
| 210 | (Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but |
|---|
| 211 | future versions will probably do so for MDMF files.) |
|---|
| 212 | |
|---|
| 213 | In some cases, it may be necessary to retry a request (see the `Structure of |
|---|
| 214 | Implementation`_ section below). In the case of a PUT request, at the point |
|---|
| 215 | at which a retry is needed, the new chunk contents to be stored will still be |
|---|
| 216 | in memory and so this is not problematic. |
|---|
| 217 | |
|---|
| 218 | In the absence of retries, the maximum number of GET requests that will be made |
|---|
| 219 | when downloading a file, or the maximum number of PUT requests when uploading |
|---|
| 220 | or modifying a file, will be equal to the number of chunks in the file. |
|---|
| 221 | |
|---|
| 222 | If the new mutable share content has fewer chunks than the old content, |
|---|
| 223 | then the remaining store objects for old chunks must be deleted (using one |
|---|
| 224 | HTTP request each). When reading a share, the backend must tolerate the case |
|---|
| 225 | where these store objects have not been deleted successfully. |
|---|
| 226 | |
|---|
| 227 | The last write to a share will be reported as successful only when all |
|---|
| 228 | corresponding HTTP PUTs and DELETEs have completed successfully. |
|---|
| 229 | |
|---|
| 230 | |
|---|
| 231 | |
|---|
| 232 | Disk usage (local to the storage server) |
|---|
| 233 | ---------------------------------------- |
|---|
| 234 | |
|---|
| 235 | It is never necessary for the storage server to write the content of share |
|---|
| 236 | chunks to local disk, either when they are read or when they are written. Each |
|---|
| 237 | chunk is held only in memory. |
|---|
| 238 | |
|---|
| 239 | A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite |
|---|
| 240 | database to store metadata about shares. In that case the same database would |
|---|
| 241 | be used for the cloud backend. This would enable lease tracking to be implemented |
|---|
| 242 | in the same way for disk and cloud backends. |
|---|
| 243 | |
|---|
| 244 | |
|---|
| 245 | Memory usage |
|---|
| 246 | ------------ |
|---|
| 247 | |
|---|
| 248 | The use of chunking simplifies bounding the memory usage of the storage server |
|---|
| 249 | when handling files that may be larger than memory. However, this depends on |
|---|
| 250 | limiting the number of chunks that are simultaneously held in memory. |
|---|
| 251 | Multiple chunks can be held in memory either because of pipelining of requests |
|---|
| 252 | for a single share, or because multiple shares are being read or written |
|---|
| 253 | (possibly by multiple clients). |
|---|
| 254 | |
|---|
| 255 | For immutable shares, the Tahoe-LAFS storage protocol requires the client to |
|---|
| 256 | specify in advance the maximum amount of data it will write. Also, a cooperative |
|---|
| 257 | client (including all existing released versions of the Tahoe-LAFS code) will |
|---|
| 258 | limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk |
|---|
| 259 | size will be greater than that, it is possible to ensure that for each allocation, |
|---|
| 260 | the maximum chunk data memory usage is the lesser of two chunks, and the allocation |
|---|
| 261 | size. (There is some additional overhead but it is small compared to the chunk |
|---|
| 262 | data.) If the maximum memory usage of a new allocation would exceed the memory |
|---|
| 263 | available, the allocation can be delayed or possibly denied, so that the total |
|---|
| 264 | memory usage is bounded. |
|---|
| 265 | |
|---|
| 266 | It is not clear that the existing protocol allows allocations for mutable |
|---|
| 267 | shares to be bounded in general; this may be addressed in a future protocol change. |
|---|
| 268 | |
|---|
| 269 | The above discussion assumes that clients do not maliciously send large |
|---|
| 270 | messages as a denial-of-service attack. Foolscap (the protocol layer underlying |
|---|
| 271 | the Tahoe-LAFS storage protocol) does not attempt to resist denial of service. |
|---|
| 272 | |
|---|
| 273 | |
|---|
| 274 | Storage |
|---|
| 275 | ------- |
|---|
| 276 | |
|---|
| 277 | The storage requirements, including not-yet-collected garbage shares, are |
|---|
| 278 | the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud |
|---|
| 279 | objects stored is equal to the total size of shares that the disk backend |
|---|
| 280 | would store. |
|---|
| 281 | |
|---|
| 282 | Erasure coding causes the size of shares for each file to be a |
|---|
| 283 | factor `shares.total` / `shares.needed` times the file size, plus overhead |
|---|
| 284 | that is logarithmic in the file size `¹¹`_. |
|---|
| 285 | |
|---|
| 286 | |
|---|
| 287 | API usage |
|---|
| 288 | --------- |
|---|
| 289 | |
|---|
| 290 | Cloud storage backends typically charge a small fee per API request. The number of |
|---|
| 291 | requests to the cloud storage service for various operations is discussed under |
|---|
| 292 | “network usage” above. |
|---|
| 293 | |
|---|
| 294 | |
|---|
| 295 | Structure of Implementation |
|---|
| 296 | =========================== |
|---|
| 297 | |
|---|
| 298 | A generic “cloud backend”, based on the prototype S3 backend but with support |
|---|
| 299 | for chunking as described above, will be written. |
|---|
| 300 | |
|---|
| 301 | An instance of the cloud backend can be attached to one of several |
|---|
| 302 | “cloud interface adapters”, one for each cloud storage interface. These |
|---|
| 303 | adapters will operate only on chunks, and need not distinguish between |
|---|
| 304 | mutable and immutable shares. They will be a relatively “thin” abstraction |
|---|
| 305 | layer over the HTTP APIs of each cloud storage interface, similar to the |
|---|
| 306 | S3Bucket abstraction in the prototype. |
|---|
| 307 | |
|---|
| 308 | For some cloud storage services it may be necessary to transparently retry |
|---|
| 309 | requests in order to recover from transient failures. (Although the erasure |
|---|
| 310 | coding may enable a file to be retrieved even when shares are not stored by or |
|---|
| 311 | not readable from all cloud storage services used in a Tahoe-LAFS grid, it may |
|---|
| 312 | be desirable to retry cloud storage service requests in order to improve overall |
|---|
| 313 | reliability.) Support for this will be implemented in the generic cloud backend, |
|---|
| 314 | and used whenever a cloud storage adaptor reports a transient failure. Our |
|---|
| 315 | experience with the prototype suggests that it is necessary to retry on transient |
|---|
| 316 | failures for Amazon's S3 service. |
|---|
| 317 | |
|---|
| 318 | There will also be a “mock” cloud interface adaptor, based on the prototype's |
|---|
| 319 | MockS3Bucket. This allows tests of the generic cloud backend to be run without |
|---|
| 320 | a connection to a real cloud service. The mock adaptor will be able to simulate |
|---|
| 321 | transient and non-transient failures. |
|---|
| 322 | |
|---|
| 323 | |
|---|
| 324 | Known Issues |
|---|
| 325 | ============ |
|---|
| 326 | |
|---|
| 327 | This design worsens a known “write hole” issue in Tahoe-LAFS when updating |
|---|
| 328 | the contents of mutable files. An update to a mutable file can require |
|---|
| 329 | changing the contents of multiple chunks, and if the client fails or is |
|---|
| 330 | disconnected during the operation the resulting state of the store objects |
|---|
| 331 | for that share may be inconsistent—no longer containing all of the old version, |
|---|
| 332 | but not yet containing all of the new version. A mutable share can be left in |
|---|
| 333 | an inconsistent state even by the existing Tahoe-LAFS disk backend if it fails |
|---|
| 334 | during a write, but that has a smaller chance of occurrence because the current |
|---|
| 335 | client behavior leads to mutable shares being written to disk in a single |
|---|
| 336 | system call. |
|---|
| 337 | |
|---|
| 338 | The best fix for this issue probably requires changing the Tahoe-LAFS storage |
|---|
| 339 | protocol, perhaps by extending it to use a two-phase or three-phase commit |
|---|
| 340 | (ticket #1755). |
|---|
| 341 | |
|---|
| 342 | |
|---|
| 343 | |
|---|
| 344 | References |
|---|
| 345 | =========== |
|---|
| 346 | |
|---|
| 347 | ¹ omitted |
|---|
| 348 | |
|---|
| 349 | .. _²: |
|---|
| 350 | |
|---|
| 351 | ² “Amazon S3” Amazon (2012) |
|---|
| 352 | |
|---|
| 353 | https://aws.amazon.com/s3/ |
|---|
| 354 | |
|---|
| 355 | .. _³: |
|---|
| 356 | |
|---|
| 357 | ³ “Rackspace Cloud Files” Rackspace (2012) |
|---|
| 358 | |
|---|
| 359 | https://www.rackspace.com/cloud/cloud_hosting_products/files/ |
|---|
| 360 | |
|---|
| 361 | .. _⁴: |
|---|
| 362 | |
|---|
| 363 | ⁴ “Google Cloud Storage” Google (2012) |
|---|
| 364 | |
|---|
| 365 | https://developers.google.com/storage/ |
|---|
| 366 | |
|---|
| 367 | .. _⁵: |
|---|
| 368 | |
|---|
| 369 | ⁵ “Windows Azure Storage” Microsoft (2012) |
|---|
| 370 | |
|---|
| 371 | https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/ |
|---|
| 372 | |
|---|
| 373 | .. _⁶: |
|---|
| 374 | |
|---|
| 375 | ⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012) |
|---|
| 376 | |
|---|
| 377 | http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html |
|---|
| 378 | |
|---|
| 379 | .. _⁷: |
|---|
| 380 | |
|---|
| 381 | ⁷ “OpenStack Object Storage” openstack.org (2012) |
|---|
| 382 | |
|---|
| 383 | http://openstack.org/projects/storage/ |
|---|
| 384 | |
|---|
| 385 | .. _⁸: |
|---|
| 386 | |
|---|
| 387 | ⁸ “Google Cloud Storage Reference Guide” Google (2012) |
|---|
| 388 | |
|---|
| 389 | https://developers.google.com/storage/docs/reference-guide |
|---|
| 390 | |
|---|
| 391 | .. _⁹: |
|---|
| 392 | |
|---|
| 393 | ⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012) |
|---|
| 394 | |
|---|
| 395 | http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx |
|---|
| 396 | |
|---|
| 397 | .. _¹⁰: |
|---|
| 398 | |
|---|
| 399 | ¹⁰ “Representational state transfer” English Wikipedia (2012) |
|---|
| 400 | |
|---|
| 401 | https://en.wikipedia.org/wiki/Representational_state_transfer |
|---|
| 402 | |
|---|
| 403 | .. _¹¹: |
|---|
| 404 | |
|---|
| 405 | ¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012) |
|---|
| 406 | |
|---|
| 407 | :doc:`../../performance` |
|---|