| 1 | |
| 2 | = Tahoe Directory Nodes = |
| 3 | |
| 4 | As explained in the architecture docs, Tahoe can be roughly viewed as a |
| 5 | collection of three layers. The lowest layer is the distributed filestore, or |
| 6 | DHT: it provides operations that accept files and upload them to the mesh, |
| 7 | creating a URI in the process which securely references the file's contents. |
| 8 | The middle layer is the filesystem, creating a structure of directories and |
| 9 | filenames resembling the traditional unix/windows filesystems. The top layer |
| 10 | is the application layer, which uses the lower layers to provide useful |
| 11 | services to users, like a backup application, or a way to share files with |
| 12 | friends. |
| 13 | |
| 14 | This document examines the middle layer, the "filesystem". |
| 15 | |
| 16 | == DHT Primitives == |
| 17 | |
| 18 | In the lowest layer (DHT), we've defined two operations thus far, both of |
| 19 | which refer to "CHK URIs", which reference immutable data: |
| 20 | |
| 21 | chk_uri = put(data) |
| 22 | data = get(chk_uri) |
| 23 | |
| 24 | We anticipate creating mutable slots in the DHT layer at some point, which |
| 25 | will add some new operations to this layer: |
| 26 | |
| 27 | slotname = create_slot() |
| 28 | set(slotname, data) |
| 29 | data = get(slotname) |
| 30 | |
| 31 | == Filesystem Goals == |
| 32 | |
| 33 | The main goal for the middle (filesystem) layer is to give users a way to |
| 34 | organize the data that they have uploaded into the mesh. The traditional way |
| 35 | to do this in computer filesystems is to put this data into files, give those |
| 36 | files names, and collect these names into directories. |
| 37 | |
| 38 | Each directory is a series of name-value pairs, which maps "child name" to an |
| 39 | object of some kind. Those child objects might be files, or they might be |
| 40 | other directories. |
| 41 | |
| 42 | The directory structure is therefore a directed graph of nodes, in which each |
| 43 | node might be a directory node or a file node. All file nodes are terminal |
| 44 | nodes. |
| 45 | |
| 46 | == Dirnode Goals == |
| 47 | |
| 48 | What properties might be desireable for these directory nodes? In no |
| 49 | particular order: |
| 50 | |
| 51 | 1: functional. Code which does not work doesn't count. |
| 52 | |
| 53 | 2: easy to document, explain, and understand |
| 54 | |
| 55 | 3: private: it should not be possible for others to see the contents of a directory |
| 56 | |
| 57 | 4: integrity: it should not be possible for others to modify the contents of a directory |
| 58 | |
| 59 | 5: available: directories should survive host failure, just like files do |
| 60 | |
| 61 | 6: efficient: in storage, communication bandwidth, number of round-trips |
| 62 | |
| 63 | 7: easy to delegate individual directories in a flexible way |
| 64 | |
| 65 | 8: updateness: everybody looking at a directory should see the same contents |
| 66 | |
| 67 | 9: monotonicity: everybody looking at a directory should see the same sequence of updates |
| 68 | |
| 69 | We do not meet all of these goals. For the current release, we favored !#1, |
| 70 | !#2, and !#7 above the rest, which lead us to the following design. In a later |
| 71 | #section, we discuss some alternate designs and potential changes to the |
| 72 | #existing code that can help us achieve the other goals. |
| 73 | |
| 74 | In tahoe-0.4.0, each "dirnode" is stored as a file on a single "vdrive |
| 75 | server". The name of this file is an unguessable string. The contents are an |
| 76 | encrypted representation of the directory's name-to-child mapping. Foolscap |
| 77 | is used to provide remote access to this file. A collection of "directory |
| 78 | URIs" are used to hold all the parameters necessary to access, read, and |
| 79 | write this dirnode. |
| 80 | |
| 81 | == Dirnode secret values == |
| 82 | |
| 83 | Each dirnode begins life as a "writekey", a randomly-generated AES key. This |
| 84 | key is hashed (using a tagged hash, see src/allmydata/util/hashutil.py for |
| 85 | details) to form the "readkey". The readkey is hashed to form the "storage |
| 86 | index". The writekey is hashed with a different tag to form the "write |
| 87 | enabler". |
| 88 | |
| 89 | Clients who have read-write access to the dirnode know the writekey, and can |
| 90 | derive all the other secrets from it. Clients with merely read-only access to |
| 91 | the dirnode know the readkey (and can derive the storage index), but do not |
| 92 | know the writekey or the write enabler. The vdrive server knows only the |
| 93 | storage index and the write enabler. |
| 94 | |
| 95 | == Dirnode capability URIs == |
| 96 | |
| 97 | The "write capability" for a dirnode grants read-write access to its |
| 98 | contents. This is expressed on concrete form as the "dirnode write URI": a |
| 99 | printable string which contains the following pieces of information: |
| 100 | |
| 101 | * furl of the vdrive server hosting this dirnode |
| 102 | * writekey |
| 103 | |
| 104 | The "read capability" grants read-only access to a dirnode, and its "dirnode |
| 105 | read URI" contains: |
| 106 | |
| 107 | * furl of the vdrive server hosting this dirnode |
| 108 | * readkey |
| 109 | |
| 110 | For example, |
| 111 | URI:DIR:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:shrrn75qq3x7uxfzk326ncahd4====== |
| 112 | is a write-capability URI, while |
| 113 | URI:DIR-RO:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:4c2legsthoe52qywuaturgwdrm====== |
| 114 | is a read-capability URI, both for the same dirnode. |
| 115 | |
| 116 | |
| 117 | == Dirnode storage format == |
| 118 | |
| 119 | Each dirnode is stored in a single file, saved on the vdrive server, using |
| 120 | the (base32-encoded) storage index as a filename. The contents of this file |
| 121 | are a serialized dictionary which maps H_name (explained below) to a tuple |
| 122 | with three values: (E_name, E_write, E_read). The vdrive server is made |
| 123 | available as a Foolscap "Referenceable" object, with the following |
| 124 | operations: |
| 125 | |
| 126 | create_dirnode(index, write_enabler) -> None |
| 127 | list(index) -> list of (E_name, E_write, E_read) tuples |
| 128 | get(index, H_name) -> (E_write, E_read) |
| 129 | set(index, write_enabler, H_name, E_name, E_write, E_read) |
| 130 | delete(index, write_enabler, H_name) |
| 131 | |
| 132 | For any given entry of this dictionary, the following values are obtained by |
| 133 | hashing or encryption: |
| 134 | |
| 135 | H_name is the hash of the readkey and the child's name. |
| 136 | E_name is the child's name, encrypted with the readkey |
| 137 | E_write is the child's write-URI, encrypted with the writekey |
| 138 | E_read is the child's read-URI, encrypted with the readkey |
| 139 | |
| 140 | All encryption uses AES in CTR mode, in which the high-order 10 or 12 bytes |
| 141 | of the 16-byte key are used as an IV (randomly chosen each time the data is |
| 142 | changed), and the remaining bytes are used as the CTR-mode offset. An |
| 143 | HMAC-SHA256 is computed for each encrypted value and stored alongside. The |
| 144 | stored E_name/E_write/E_read values are thus the concatenation of IV, |
| 145 | encrypted data, and HMAC. |
| 146 | |
| 147 | When a new dirnode is created, it records the write_enabler. All operations |
| 148 | that modify an existing dirnode (set and delete) require the write_enabler be |
| 149 | presented. |
| 150 | |
| 151 | This approach insures that clients who do not have the read or write keys |
| 152 | (including the vdrive server, which knows the storage index but not the keys) |
| 153 | will be unable to see any of the contents of the dirnode. Clients who have |
| 154 | the readkey but not the writekey will not be allowed to modify the dirnode. |
| 155 | The H_name value allows clients to perform lookups of specific keys rather |
| 156 | than requiring them to download the whole dirnode for each operation. |
| 157 | |
| 158 | By putting both read-only and read-write child access capabilities in each |
| 159 | entry, encrypted by different keys, this approach provides transitive |
| 160 | read-only-ness: if a client has only a readkey for the parent dirnode, they |
| 161 | will only get readkeys (and not writekeys) for any children, including other |
| 162 | directories. When we create mutable slots in the mesh and we start having |
| 163 | read-write file URIs, we can use the same approach to insure that read-only |
| 164 | access to a directory means read-only access to the files as well. |
| 165 | |
| 166 | |
| 167 | == Design Goals, redux == |
| 168 | |
| 169 | How well does this design meet the goals? |
| 170 | |
| 171 | * !#1 functional: YES: the code works and has extensive unit tests |
| 172 | * !#2 documentable: YES: this document is the existence proof |
| 173 | * !#3 private: MOSTLY: see the discussion below |
| 174 | * !#4 integrity: MOSTLY: the vdrive server can rollback individual slots |
| 175 | * !#5 availability: BARELY: if the vdrive server is offline, the dirnode will be unuseable. If the vdrive erver fails, the dirnode will be lost forever. |
| 176 | * !#6 efficient: MOSTLY: |
| 177 | * network: single dirnode lookup is very efficient, since clients can fetch specific keys rather than being required to get or set the entire dirnode each time. Traversing many directories takes a lot of roundtrips, and these can't be collapsed with promise-pipelining because the intermediate values must only be visible to the client. Modifying many dirnodes at once (e.g. importing a large pre-existing directory tree) is pretty slow, since each graph edge must be created independently. |
| 178 | * storage: each child has a separate IV, which makes them larger than if all children were aggregated into a single encrypted string |
| 179 | |
| 180 | * !#7 delegation: VERY: each dirnode is a completely independent object, to which clients can be granted separate ead-write or read-only access |
| 181 | |
| 182 | * !#8 updateness: VERY: with only a single point of access, and no caching, each client operation starts by fetching the current value, so there are no opportunities for staleness |
| 183 | |
| 184 | * !#9 monotonicity: VERY: the single point of access also protects against retrograde motion |
| 185 | |
| 186 | |
| 187 | === Privacy leaks in the vdrive server === |
| 188 | |
| 189 | Dirnodes are very private against other clients: traffic between the client |
| 190 | and the vdrive server is protected by the Foolscap SSL connection, so they |
| 191 | can observe very little. Storage index values are hashes of secrets and thus |
| 192 | unguessable, and they are not made public, so other clients cannot snoop |
| 193 | through encrypted dirnodes that they have not been told about. |
| 194 | |
| 195 | On the other hand, the vdrive server gets to see the access patterns of each |
| 196 | client who is using dirnodes hosted there. The childnames and URIs are |
| 197 | encrypted and not visible to anyone (including the vdrive server), but the |
| 198 | vdrive server is in a good position to infer a lot of data about the |
| 199 | directory structure. It knows the length of all childnames, and from the |
| 200 | length of the child URIs themselves it can tell whether children are file |
| 201 | URIs vs. directory URIs vs read-only directory URIs. By watching a client's |
| 202 | access patterns it can deduce the connection between (encrypted) child 1 and |
| 203 | target directory 2 (i.e. if the client does a 'get' of the first child, then |
| 204 | immediately does an operation on directory 2, it can assume the two are |
| 205 | related. From this the vdrive server can build a graph with the same shape as |
| 206 | the filesystem, even though the nodes and edges will be unlabled. |
| 207 | |
| 208 | By providing CHK-level storage services as well (or colluding with a server |
| 209 | who is), the vdrive server can infer the storage index of file nodes that are |
| 210 | downloaded shortly after their childname is looked up. |
| 211 | |
| 212 | |
| 213 | === Integrity failures in the vdrive server === |
| 214 | |
| 215 | The HMAC prevents the vdrive server from modifying the child names or child |
| 216 | URI values without detection: changing a few bytes will cause an HMAC failure |
| 217 | that the client can detect. This means the vdrive server can make the dirnode |
| 218 | unavailable, but not corrupt it. |
| 219 | |
| 220 | However, the vdrive server can perform a rollback attack: either replacing an |
| 221 | individual entry in the encrypted table with an old version, or replacing the |
| 222 | entire table. Despite not knowing what the child names or URIs are, the |
| 223 | vdrive server can undo changes made by authorized clients. It could also |
| 224 | perform selective rollback, showing different clients different versions of |
| 225 | the filesystem. To solve this problem either requires mutable data (like a |
| 226 | sequence number or hash) to be stored in the URI which points to this dirnode |
| 227 | (rendering them non-constant, and losing most of their value), or requires |
| 228 | spreading the dirnode out over multiple non-colluding servers (which might |
| 229 | improve availability but makes updateness and monotonicity harder). |
| 230 | |
| 231 | |
| 232 | === Improving the availability of dirnodes === |
| 233 | |
| 234 | Clearly it is somewhat disappointing to have a sexy distributed filestore at |
| 235 | the bottom layer and then have a single-point-of-failure vdrive server on top |
| 236 | of it. However, this approach meets many of the design goals and is extremely |
| 237 | simple to explain and implement. There are many avenues to improve the |
| 238 | reliability and availability of dirnodes. (note that reliability and |
| 239 | availability can be separate goals). |
| 240 | |
| 241 | A simple way to improve the reliability of dirnodes would be to make the |
| 242 | vdrive server be responsible for saving the dirnode contents in a fashion |
| 243 | that will survive the failure of its local disk, for example by simply |
| 244 | rsync'ing all the dirnodes off to a separate machine on a periodic basis, and |
| 245 | pulling them back in the case of disk failure. |
| 246 | |
| 247 | To improve availability, we must allow clients to access their dirnodes even |
| 248 | if the vdrive server is offline. The first step here is to create multiple |
| 249 | vdrive servers, putting a list of furls into the DIR:URI, with instructions |
| 250 | to update all of them during write, and accept the first answer that comes |
| 251 | back during read. This introduces issues of updateness and monotonicity: if a |
| 252 | dirnode is changed while one of the vdrive servers is offline, the servers |
| 253 | will diverge, and subsequent clients will see different contents depending |
| 254 | upon which server they ask. |
| 255 | |
| 256 | A more comforting way to improve both reliability and availability is to |
| 257 | spread the dirnodes out over the mesh in the same way that CHK files work. |
| 258 | The general name for this approach is the "SSK directory slot", a structure |
| 259 | for keeping a mutable slot on multiple hosts, setting and retrieving its |
| 260 | contents at various times, and reconciling differences by comparing sequence |
| 261 | numbers. The "slot name" is the hash of a public key, which is also used to |
| 262 | sign updates, such that the SSK storage hosts will only accept updates from |
| 263 | those in possession of the corresponding private key. This approach (although |
| 264 | not yet implemented) will provide fairly good reliability and availability |
| 265 | properties, at the expense of complexity and updateness/monotonicity. It can |
| 266 | also improve integrity, since an attacker would have to corrupt multiple |
| 267 | storage servers to successfully perform a rollback attack. |
| 268 | |
| 269 | Reducing centralization can improve reliability, as long as the overall |
| 270 | reliability of the mesh is greater than the reliability of the original |
| 271 | centralized services. |
| 272 | |
| 273 | === Improving the efficiency of dirnodes === |
| 274 | |
| 275 | By storing each child of a dirnode in a separate element of the dictionary, |
| 276 | we provide efficient directory traversal and clean+simple dirnode delegation |
| 277 | behavior. This comes at the cost of efficiency for other operations, |
| 278 | specifically things that operation on multiple dirnodes at once. |
| 279 | |
| 280 | When a backup program is run for the first time, it needs to copy a large |
| 281 | amount of data from a pre-existing filesystem into reliable storage. This |
| 282 | means that a large and complex directory structure needs to be duplicated in |
| 283 | the dirnode layer. With the one-object-per-dirnode approach described here, |
| 284 | this requires as many operations as there are edges in the imported |
| 285 | filesystem graph. |
| 286 | |
| 287 | Another approach would be to aggregate multiple directories into a single |
| 288 | storage object. This object would contain a serialized graph rather than a |
| 289 | single name-to-child dictionary. Most directory operations would fetch the |
| 290 | whole block of data (and presumeably cache it for a while to avoid lots of |
| 291 | re-fetches), and modification operations would need to replace the whole |
| 292 | thing at once. This "realm" approach would have the added benefit of |
| 293 | combining more data into a single encrypted bundle (perhaps hiding the shape |
| 294 | of the graph from the vdrive server better), and would reduce round-trips |
| 295 | when performing deep directory traversals (assuming the realm was already |
| 296 | cached). It would also prevent fine-grained rollback attacks from working: |
| 297 | the vdrive server could change the entire dirnode to look like an earlier |
| 298 | state, but it could not independently roll back individual edges. |
| 299 | |
| 300 | The drawbacks of this aggregation would be that small accesses (adding a |
| 301 | single child, looking up a single child) would require pulling or pushing a |
| 302 | lot of unrelated data, increasing network overhead (and necessitating |
| 303 | test-and-set semantics for the modification side, which increases the chances |
| 304 | that a user operation will fail, making it more challenging to provide |
| 305 | promises of atomicity to the user). |
| 306 | |
| 307 | It would also make it much more difficult to enable the delegation |
| 308 | ("sharing") of specific directories. Since each aggregate "realm" provides |
| 309 | all-or-nothing access control, the act of delegating any directory from the |
| 310 | middle of the realm would require the realm first be split into the upper |
| 311 | piece that isn't being shared and the lower piece that is. This splitting |
| 312 | would have to be done in response to what is essentially a read operation, |
| 313 | which is not traditionally supposed to be a high-effort action. |
| 314 | |
| 315 | |
| 316 | === Dirnode expiration and leases === |
| 317 | |
| 318 | Dirnodes are created any time a client wishes to add a new directory. How |
| 319 | long do they live? What's to keep them from sticking around forever, taking |
| 320 | up space that nobody can reach any longer? |
| 321 | |
| 322 | Our plan is to define the vdrive servers to keep dirnodes alive with |
| 323 | "leases". Clients which know and care about specific dirnodes can ask to keep |
| 324 | them alive for a while, by renewing a lease on them (with a typical period of |
| 325 | one month). Clients are expected to assist in the deletion of dirnodes by |
| 326 | canceling their leases as soon as they are done with them. This means that |
| 327 | when a client deletes a directory, it should also cancel its lease on that |
| 328 | directory. When the lease count on a dirnode goes to zero, the vdrive server |
| 329 | can delete the related storage. Multiple clients may all have leases on the |
| 330 | same dirnode: the server may delete the dirnode only after all of the leases |
| 331 | have gone away. |
| 332 | |
| 333 | We expect that clients will periodically create a "manifest": a list of |
| 334 | so-called "refresh capabilities" for all of the dirnodes and files that they |
| 335 | can reach. They will give this manifest to the "repairer", which is a service |
| 336 | that keeps files (and dirnodes) alive on behalf of clients who cannot take on |
| 337 | this responsibility for themselves. These refresh capabilities include the |
| 338 | storage index, but do *not* include the readkeys or writekeys, so the |
| 339 | repairer does not get to read the files or directories that it is helping to |
| 340 | keep alive. |
| 341 | |
| 342 | After each change to the user's vdrive, the client creates a manifest and |
| 343 | looks for differences from their previous version. Anything which was removed |
| 344 | prompts the client to send out lease-cancellation messages, allowing the data |
| 345 | to be deleted. |
| 346 | |
| 347 | |
| 348 | == Starting Points: root dirnodes == |
| 349 | |
| 350 | Any client can record the URI of a directory node in some external form (say, |
| 351 | in a local file) and use it as the starting point of later traversal. The |
| 352 | current vdrive servers are configured to create a "root" dirnode at startup |
| 353 | and publish its URI to the world: this forms the basis of the "global shared |
| 354 | vdrive" used in the demonstration application. In addition, client code is |
| 355 | currently designed to create a new (unattached) dirnode at startup and record |
| 356 | its URI: this forms the root of the "per-user private vdrive" presented as |
| 357 | the "~" directory. |
| 358 | |
| 359 | == Mounting and Sharing Directories == |
| 360 | |
| 361 | The biggest benefit of this dirnode approach is that sharing individual |
| 362 | directories is almost trivial. Alice creates a subdirectory that she wants to |
| 363 | use to share files with Bob. This subdirectory is attached to Alice's |
| 364 | filesystem at "~alice/share-with-bob". She asks her filesystem for the |
| 365 | read-write directory URI for that new directory, and emails it to Bob. When |
| 366 | Bob receives the URI, he asks his own local vdrive to attach the given URI, |
| 367 | perhaps at a place named "~bob/shared-with-alice". Every time either party |
| 368 | writes a file into this directory, the other will be able to read it. If |
| 369 | Alice prefers, she can give a read-only URI to Bob instead, and then Bob will |
| 370 | be able to read files but not change the contents of the directory. Neither |
| 371 | Alice nor Bob will get access to any files above the mounted directory: there |
| 372 | are no 'parent directory' pointers. If Alice creates a nested set of |
| 373 | directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to |
| 374 | share-with-bob to Bob, then Bob will be unable to write to either |
| 375 | share-with-bob/ or subdir2/. |
| 376 | |
| 377 | A suitable UI needs to be created to allow users to easily perform this |
| 378 | sharing action: dragging a folder their vdrive to an IM or email user icon, |
| 379 | for example. The UI will need to give the sending user an opportunity to |
| 380 | indicate whether they want to grant read-write or read-only access to the |
| 381 | recipient. The recipient then needs an interface to drag the new folder into |
| 382 | their vdrive and give it a home. |
| 383 | |
| 384 | == Revocation == |
| 385 | |
| 386 | When Alice decides that she no longer wants Bob to be able to access the |
| 387 | shared directory, what should she do? Suppose she's shared this folder with |
| 388 | both Bob and Carol, and now she wants Carol to retain access to it but Bob to |
| 389 | be shut out. Ideally Carol should not have to do anything: her access should |
| 390 | continue unabated. |
| 391 | |
| 392 | The current plan is to have her client create a deep copy of the folder in |
| 393 | question, delegate access to the new folder to the remaining members of the |
| 394 | group (Carol), asking the lucky survivors to replace their old reference with |
| 395 | the new one. Bob may still have access to the old folder, but he is now the |
| 396 | only one who cares: everyone else has moved on, and he will no longer be able |
| 397 | to see their new changes. In a strict sense, this is the strongest form of |
| 398 | revocation that can be accomplished: there is no point trying to force Bob to |
| 399 | forget about the files that he read a moment before being kicked out. In |
| 400 | addition it must be noted that anyone who can access the directory can proxy |
| 401 | for Bob, reading files to him and accepting changes whenever he wants. |
| 402 | Preventing delegation between communication parties is just as pointless as |
| 403 | asking Bob to forget previously accessed files. However, there may be value |
| 404 | to configuring the UI to ask Carol to not share files with Bob, or to |
| 405 | removing all files from Bob's view at the same time his access is revoked. |
| 406 | |