| 1 | .. -*- coding: utf-8-with-signature -*- |
|---|
| 2 | |
|---|
| 3 | ===================== |
|---|
| 4 | Lease database design |
|---|
| 5 | ===================== |
|---|
| 6 | |
|---|
| 7 | The target audience for this document is developers who wish to understand |
|---|
| 8 | the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0. |
|---|
| 9 | |
|---|
| 10 | |
|---|
| 11 | Introduction |
|---|
| 12 | ------------ |
|---|
| 13 | |
|---|
| 14 | A "lease" is a request by an account that a share not be deleted before a |
|---|
| 15 | specified time. Each storage server stores leases in order to know which |
|---|
| 16 | shares to spare from garbage collection. |
|---|
| 17 | |
|---|
| 18 | Motivation |
|---|
| 19 | ---------- |
|---|
| 20 | |
|---|
| 21 | The leasedb will replace the current design in which leases are stored in |
|---|
| 22 | the storage server's share container files. That design has several |
|---|
| 23 | disadvantages: |
|---|
| 24 | |
|---|
| 25 | - Updating a lease requires modifying a share container file (even for |
|---|
| 26 | immutable shares). This complicates the implementation of share classes. |
|---|
| 27 | The mixing of share contents and lease data in share files also led to a |
|---|
| 28 | security bug (ticket `#1528`_). |
|---|
| 29 | |
|---|
| 30 | - When only the disk backend is supported, it is possible to read and |
|---|
| 31 | update leases synchronously because the share files are stored locally |
|---|
| 32 | to the storage server. For the cloud backend, accessing share files |
|---|
| 33 | requires an HTTP request, and so must be asynchronous. Accepting this |
|---|
| 34 | asynchrony for lease queries would be both inefficient and complex. |
|---|
| 35 | Moving lease information out of shares and into a local database allows |
|---|
| 36 | lease queries to stay synchronous. |
|---|
| 37 | |
|---|
| 38 | Also, the current cryptographic protocol for renewing and cancelling leases |
|---|
| 39 | (based on shared secrets derived from secure hash functions) is complex, |
|---|
| 40 | and the cancellation part was never used. |
|---|
| 41 | |
|---|
| 42 | The leasedb solves the first two problems by storing the lease information in |
|---|
| 43 | a local database instead of in the share container files. The share data |
|---|
| 44 | itself is still held in the share container file. |
|---|
| 45 | |
|---|
| 46 | At the same time as implementing leasedb, we devised a simpler protocol for |
|---|
| 47 | allocating and cancelling leases: a client can use a public key digital |
|---|
| 48 | signature to authenticate access to a foolscap object representing the |
|---|
| 49 | authority of an account. This protocol is not yet implemented; at the time |
|---|
| 50 | of writing, only an "anonymous" account is supported. |
|---|
| 51 | |
|---|
| 52 | The leasedb also provides an efficient way to get summarized information, |
|---|
| 53 | such as total space usage of shares leased by an account, for accounting |
|---|
| 54 | purposes. |
|---|
| 55 | |
|---|
| 56 | .. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528 |
|---|
| 57 | |
|---|
| 58 | |
|---|
| 59 | Design constraints |
|---|
| 60 | ------------------ |
|---|
| 61 | |
|---|
| 62 | A share is stored as a collection of objects. The persistent storage may be |
|---|
| 63 | remote from the server (for example, cloud storage). |
|---|
| 64 | |
|---|
| 65 | Writing to the persistent store objects is in general not an atomic |
|---|
| 66 | operation. So the leasedb also keeps track of which shares are in an |
|---|
| 67 | inconsistent state because they have been partly written. (This may |
|---|
| 68 | change in future when we implement a protocol to improve atomicity of |
|---|
| 69 | updates to mutable shares.) |
|---|
| 70 | |
|---|
| 71 | Leases are no longer stored in shares. The same share format is used as |
|---|
| 72 | before, but the lease slots are ignored, and are cleared when rewriting a |
|---|
| 73 | mutable share. The new design also does not use lease renewal or cancel |
|---|
| 74 | secrets. (They are accepted as parameters in the storage protocol interfaces |
|---|
| 75 | for backward compatibility, but are ignored. Cancel secrets were already |
|---|
| 76 | ignored due to the fix for `#1528`_.) |
|---|
| 77 | |
|---|
| 78 | The new design needs to be fail-safe in the sense that if the lease database |
|---|
| 79 | is lost or corruption is detected, no share data will be lost (even though |
|---|
| 80 | the metadata about leases held by particular accounts has been lost). |
|---|
| 81 | |
|---|
| 82 | |
|---|
| 83 | Accounting crawler |
|---|
| 84 | ------------------ |
|---|
| 85 | |
|---|
| 86 | A "crawler" is a long-running process that visits share container files at a |
|---|
| 87 | slow rate, so as not to overload the server by trying to visit all share |
|---|
| 88 | container files one after another immediately. |
|---|
| 89 | |
|---|
| 90 | The accounting crawler replaces the previous "lease crawler". It examines |
|---|
| 91 | each share container file and compares it with the state of the leasedb, and |
|---|
| 92 | may update the state of the share and/or the leasedb. |
|---|
| 93 | |
|---|
| 94 | The accounting crawler may perform the following functions (but see ticket |
|---|
| 95 | #1834 for a proposal to reduce the scope of its responsibility): |
|---|
| 96 | |
|---|
| 97 | - Remove leases that are past their expiration time. (Currently, this is |
|---|
| 98 | done automatically before deleting shares, but we plan to allow expiration |
|---|
| 99 | to be performed separately for individual accounts in future.) |
|---|
| 100 | |
|---|
| 101 | - Delete the objects containing unleased shares — that is, shares that have |
|---|
| 102 | stable entries in the leasedb but no current leases (see below for the |
|---|
| 103 | definition of "stable" entries). |
|---|
| 104 | |
|---|
| 105 | - Discover shares that have been manually added to storage, via ``scp`` or |
|---|
| 106 | some other out-of-band means. |
|---|
| 107 | |
|---|
| 108 | - Discover shares that are present when a storage server is upgraded to |
|---|
| 109 | a leasedb-supporting version from a previous version, and give them |
|---|
| 110 | "starter leases". |
|---|
| 111 | |
|---|
| 112 | - Recover from a situation where the leasedb is lost or detectably |
|---|
| 113 | corrupted. This is handled in the same way as upgrading from a previous |
|---|
| 114 | version. |
|---|
| 115 | |
|---|
| 116 | - Detect shares that have unexpectedly disappeared from storage. The |
|---|
| 117 | disappearance of a share is logged, and its entry and leases are removed |
|---|
| 118 | from the leasedb. |
|---|
| 119 | |
|---|
| 120 | |
|---|
| 121 | Accounts |
|---|
| 122 | -------- |
|---|
| 123 | |
|---|
| 124 | An account holds leases for some subset of shares stored by a server. The |
|---|
| 125 | leasedb schema can handle many distinct accounts, but for the time being we |
|---|
| 126 | create only two accounts: an anonymous account and a starter account. The |
|---|
| 127 | starter account is used for leases on shares discovered by the accounting |
|---|
| 128 | crawler; the anonymous account is used for all other leases. |
|---|
| 129 | |
|---|
| 130 | The leasedb has at most one lease entry per account per (storage_index, |
|---|
| 131 | shnum) pair. This entry stores the times when the lease was last renewed and |
|---|
| 132 | when it is set to expire (if the expiration policy does not force it to |
|---|
| 133 | expire earlier), represented as Unix UTC-seconds-since-epoch timestamps. |
|---|
| 134 | |
|---|
| 135 | For more on expiration policy, see :doc:`../garbage-collection`. |
|---|
| 136 | |
|---|
| 137 | |
|---|
| 138 | Share states |
|---|
| 139 | ------------ |
|---|
| 140 | |
|---|
| 141 | The leasedb holds an explicit indicator of the state of each share. |
|---|
| 142 | |
|---|
| 143 | The diagram and descriptions below give the possible values of the "state" |
|---|
| 144 | indicator, what that value means, and transitions between states, for any |
|---|
| 145 | (storage_index, shnum) pair on each server:: |
|---|
| 146 | |
|---|
| 147 | |
|---|
| 148 | # STATE_STABLE -------. |
|---|
| 149 | # ^ | ^ | | |
|---|
| 150 | # | v | | v |
|---|
| 151 | # STATE_COMING | | STATE_GOING |
|---|
| 152 | # ^ | | | |
|---|
| 153 | # | | v | |
|---|
| 154 | # '----- NONE <------' |
|---|
| 155 | |
|---|
| 156 | |
|---|
| 157 | **NONE**: There is no entry in the ``shares`` table for this (storage_index, |
|---|
| 158 | shnum) in this server's leasedb. This is the initial state. |
|---|
| 159 | |
|---|
| 160 | **STATE_COMING**: The share is being created or (if a mutable share) |
|---|
| 161 | updated. The store objects may have been at least partially written, but |
|---|
| 162 | the storage server doesn't have confirmation that they have all been |
|---|
| 163 | completely written. |
|---|
| 164 | |
|---|
| 165 | **STATE_STABLE**: The store objects have been completely written and are |
|---|
| 166 | not in the process of being modified or deleted by the storage server. (It |
|---|
| 167 | could have been modified or deleted behind the back of the storage server, |
|---|
| 168 | but if it has, the server has not noticed that yet.) The share may or may not |
|---|
| 169 | be leased. |
|---|
| 170 | |
|---|
| 171 | **STATE_GOING**: The share is being deleted. |
|---|
| 172 | |
|---|
| 173 | State transitions |
|---|
| 174 | ----------------- |
|---|
| 175 | |
|---|
| 176 | • **STATE_GOING** → **NONE** |
|---|
| 177 | |
|---|
| 178 | trigger: The storage server gains confidence that all store objects for |
|---|
| 179 | the share have been removed. |
|---|
| 180 | |
|---|
| 181 | implementation: |
|---|
| 182 | |
|---|
| 183 | 1. Remove the entry in the leasedb. |
|---|
| 184 | |
|---|
| 185 | • **STATE_STABLE** → **NONE** |
|---|
| 186 | |
|---|
| 187 | trigger: The accounting crawler noticed that all the store objects for |
|---|
| 188 | this share are gone. |
|---|
| 189 | |
|---|
| 190 | implementation: |
|---|
| 191 | |
|---|
| 192 | 1. Remove the entry in the leasedb. |
|---|
| 193 | |
|---|
| 194 | • **NONE** → **STATE_COMING** |
|---|
| 195 | |
|---|
| 196 | triggers: A new share is being created, as explicitly signalled by a |
|---|
| 197 | client invoking a creation command, *or* the accounting crawler discovers |
|---|
| 198 | an incomplete share. |
|---|
| 199 | |
|---|
| 200 | implementation: |
|---|
| 201 | |
|---|
| 202 | 1. Add an entry to the leasedb with **STATE_COMING**. |
|---|
| 203 | |
|---|
| 204 | 2. (In case of explicit creation) begin writing the store objects to hold |
|---|
| 205 | the share. |
|---|
| 206 | |
|---|
| 207 | • **STATE_STABLE** → **STATE_COMING** |
|---|
| 208 | |
|---|
| 209 | trigger: A mutable share is being modified, as explicitly signalled by a |
|---|
| 210 | client invoking a modification command. |
|---|
| 211 | |
|---|
| 212 | implementation: |
|---|
| 213 | |
|---|
| 214 | 1. Add an entry to the leasedb with **STATE_COMING**. |
|---|
| 215 | |
|---|
| 216 | 2. Begin updating the store objects. |
|---|
| 217 | |
|---|
| 218 | • **STATE_COMING** → **STATE_STABLE** |
|---|
| 219 | |
|---|
| 220 | trigger: All store objects have been written. |
|---|
| 221 | |
|---|
| 222 | implementation: |
|---|
| 223 | |
|---|
| 224 | 1. Change the state value of this entry in the leasedb from |
|---|
| 225 | **STATE_COMING** to **STATE_STABLE**. |
|---|
| 226 | |
|---|
| 227 | • **NONE** → **STATE_STABLE** |
|---|
| 228 | |
|---|
| 229 | trigger: The accounting crawler discovers a complete share. |
|---|
| 230 | |
|---|
| 231 | implementation: |
|---|
| 232 | |
|---|
| 233 | 1. Add an entry to the leasedb with **STATE_STABLE**. |
|---|
| 234 | |
|---|
| 235 | • **STATE_STABLE** → **STATE_GOING** |
|---|
| 236 | |
|---|
| 237 | trigger: The share should be deleted because it is unleased. |
|---|
| 238 | |
|---|
| 239 | implementation: |
|---|
| 240 | |
|---|
| 241 | 1. Change the state value of this entry in the leasedb from |
|---|
| 242 | **STATE_STABLE** to **STATE_GOING**. |
|---|
| 243 | |
|---|
| 244 | 2. Initiate removal of the store objects. |
|---|
| 245 | |
|---|
| 246 | |
|---|
| 247 | The following constraints are needed to avoid race conditions: |
|---|
| 248 | |
|---|
| 249 | - While a share is being deleted (entry in **STATE_GOING**), we do not accept |
|---|
| 250 | any requests to recreate it. That would result in add and delete requests |
|---|
| 251 | for store objects being sent concurrently, with undefined results. |
|---|
| 252 | |
|---|
| 253 | - While a share is being added or modified (entry in **STATE_COMING**), we |
|---|
| 254 | treat it as leased. |
|---|
| 255 | |
|---|
| 256 | - Creation or modification requests for a given mutable share are serialized. |
|---|
| 257 | |
|---|
| 258 | |
|---|
| 259 | Unresolved design issues |
|---|
| 260 | ------------------------ |
|---|
| 261 | |
|---|
| 262 | - What happens if a write to store objects for a new share fails |
|---|
| 263 | permanently? If we delete the share entry, then the accounting crawler |
|---|
| 264 | will eventually get to those store objects and see that their lengths |
|---|
| 265 | are inconsistent with the length in the container header. This will cause |
|---|
| 266 | the share to be treated as corrupted. Should we instead attempt to |
|---|
| 267 | delete those objects immediately? If so, do we need a direct |
|---|
| 268 | **STATE_COMING** → **STATE_GOING** transition to handle this case? |
|---|
| 269 | |
|---|
| 270 | - What happens if only some store objects for a share disappear |
|---|
| 271 | unexpectedly? This case is similar to only some objects having been |
|---|
| 272 | written when we get an unrecoverable error during creation of a share, but |
|---|
| 273 | perhaps we want to treat it differently in order to preserve information |
|---|
| 274 | about the storage service having lost data. |
|---|
| 275 | |
|---|
| 276 | - Does the leasedb need to track corrupted shares? |
|---|
| 277 | |
|---|
| 278 | |
|---|
| 279 | Future directions |
|---|
| 280 | ----------------- |
|---|
| 281 | |
|---|
| 282 | Clients will have key pairs identifying accounts, and will be able to add |
|---|
| 283 | leases for a specific account. Various space usage policies can be defined. |
|---|
| 284 | |
|---|
| 285 | Better migration tools ('tahoe storage export'?) will create export files |
|---|
| 286 | that include both the share data and the lease data, and then an import tool |
|---|
| 287 | will both put the share in the right place and update the recipient node's |
|---|
| 288 | leasedb. |
|---|