wiki:DirectoryNode

Version 2 (modified by zooko, at 2008-04-08T15:32:30Z) (diff)

--

THIS PAGE IS STALE -- DO NOT BELIEVE WHAT IT SAYS.

ktwilight has offered to merge this page into docs/dirnodes.txt. After all the non-stale information from this wiki page has been merged into that text file, delete this wiki page. --Zooko 2008-04-08

Tahoe Directory Nodes

As explained in the architecture docs, Tahoe can be roughly viewed as a collection of three layers. The lowest layer is the distributed filestore, or DHT: it provides operations that accept files and upload them to the mesh, creating a URI in the process which securely references the file's contents. The middle layer is the filesystem, creating a structure of directories and filenames resembling the traditional unix/windows filesystems. The top layer is the application layer, which uses the lower layers to provide useful services to users, like a backup application, or a way to share files with friends.

This document examines the middle layer, the "filesystem".

DHT Primitives

In the lowest layer (DHT), we've defined two operations thus far, both of which refer to "CHK URIs", which reference immutable data:

chk_uri = put(data) data = get(chk_uri)

We anticipate creating mutable slots in the DHT layer at some point, which will add some new operations to this layer:

slotname = create_slot() set(slotname, data) data = get(slotname)

Filesystem Goals

The main goal for the middle (filesystem) layer is to give users a way to organize the data that they have uploaded into the mesh. The traditional way to do this in computer filesystems is to put this data into files, give those files names, and collect these names into directories.

Each directory is a series of name-value pairs, which maps "child name" to an object of some kind. Those child objects might be files, or they might be other directories.

The directory structure is therefore a directed graph of nodes, in which each node might be a directory node or a file node. All file nodes are terminal nodes.

Dirnode Goals

What properties might be desireable for these directory nodes? In no particular order:

1: functional. Code which does not work doesn't count.

2: easy to document, explain, and understand

3: private: it should not be possible for others to see the contents of a directory

4: integrity: it should not be possible for others to modify the contents of a directory

5: available: directories should survive host failure, just like files do

6: efficient: in storage, communication bandwidth, number of round-trips

7: easy to delegate individual directories in a flexible way

8: updateness: everybody looking at a directory should see the same contents

9: monotonicity: everybody looking at a directory should see the same sequence of updates

We do not meet all of these goals. For the current release, we favored #1, #2, and #7 above the rest, which lead us to the following design. In a later #section, we discuss some alternate designs and potential changes to the #existing code that can help us achieve the other goals.

In tahoe-0.4.0, each "dirnode" is stored as a file on a single "vdrive server". The name of this file is an unguessable string. The contents are an encrypted representation of the directory's name-to-child mapping. Foolscap is used to provide remote access to this file. A collection of "directory URIs" are used to hold all the parameters necessary to access, read, and write this dirnode.

Dirnode secret values

Each dirnode begins life as a "writekey", a randomly-generated AES key. This key is hashed (using a tagged hash, see src/allmydata/util/hashutil.py for details) to form the "readkey". The readkey is hashed to form the "storage index". The writekey is hashed with a different tag to form the "write enabler".

Clients who have read-write access to the dirnode know the writekey, and can derive all the other secrets from it. Clients with merely read-only access to the dirnode know the readkey (and can derive the storage index), but do not know the writekey or the write enabler. The vdrive server knows only the storage index and the write enabler.

Dirnode capability URIs

The "write capability" for a dirnode grants read-write access to its contents. This is expressed on concrete form as the "dirnode write URI": a printable string which contains the following pieces of information:

  • furl of the vdrive server hosting this dirnode
  • writekey

The "read capability" grants read-only access to a dirnode, and its "dirnode read URI" contains:

  • furl of the vdrive server hosting this dirnode
  • readkey

For example, URI:DIR:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:shrrn75qq3x7uxfzk326ncahd4====== is a write-capability URI, while URI:DIR-RO:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:4c2legsthoe52qywuaturgwdrm====== is a read-capability URI, both for the same dirnode.

Dirnode storage format

Each dirnode is stored in a single file, saved on the vdrive server, using the (base32-encoded) storage index as a filename. The contents of this file are a serialized dictionary which maps H_name (explained below) to a tuple with three values: (E_name, E_write, E_read). The vdrive server is made available as a Foolscap "Referenceable" object, with the following operations:

create_dirnode(index, write_enabler) -> None list(index) -> list of (E_name, E_write, E_read) tuples get(index, H_name) -> (E_write, E_read) set(index, write_enabler, H_name, E_name, E_write, E_read) delete(index, write_enabler, H_name)

For any given entry of this dictionary, the following values are obtained by hashing or encryption:

H_name is the hash of the readkey and the child's name. E_name is the child's name, encrypted with the readkey E_write is the child's write-URI, encrypted with the writekey E_read is the child's read-URI, encrypted with the readkey

All encryption uses AES in CTR mode, in which the high-order 10 or 12 bytes of the 16-byte key are used as an IV (randomly chosen each time the data is changed), and the remaining bytes are used as the CTR-mode offset. An HMAC-SHA256 is computed for each encrypted value and stored alongside. The stored E_name/E_write/E_read values are thus the concatenation of IV, encrypted data, and HMAC.

When a new dirnode is created, it records the write_enabler. All operations that modify an existing dirnode (set and delete) require the write_enabler be presented.

This approach insures that clients who do not have the read or write keys (including the vdrive server, which knows the storage index but not the keys) will be unable to see any of the contents of the dirnode. Clients who have the readkey but not the writekey will not be allowed to modify the dirnode. The H_name value allows clients to perform lookups of specific keys rather than requiring them to download the whole dirnode for each operation.

By putting both read-only and read-write child access capabilities in each entry, encrypted by different keys, this approach provides transitive read-only-ness: if a client has only a readkey for the parent dirnode, they will only get readkeys (and not writekeys) for any children, including other directories. When we create mutable slots in the mesh and we start having read-write file URIs, we can use the same approach to insure that read-only access to a directory means read-only access to the files as well.

Design Goals, redux

How well does this design meet the goals?

  • #1 functional: YES: the code works and has extensive unit tests
  • #2 documentable: YES: this document is the existence proof
  • #3 private: MOSTLY: see the discussion below
  • #4 integrity: MOSTLY: the vdrive server can rollback individual slots
  • #5 availability: BARELY: if the vdrive server is offline, the dirnode will be unuseable. If the vdrive erver fails, the dirnode will be lost forever.
  • #6 efficient: MOSTLY:
    • network: single dirnode lookup is very efficient, since clients can fetch specific keys rather than being required to get or set the entire dirnode each time. Traversing many directories takes a lot of roundtrips, and these can't be collapsed with promise-pipelining because the intermediate values must only be visible to the client. Modifying many dirnodes at once (e.g. importing a large pre-existing directory tree) is pretty slow, since each graph edge must be created independently.
    • storage: each child has a separate IV, which makes them larger than if all children were aggregated into a single encrypted string
  • #7 delegation: VERY: each dirnode is a completely independent object, to which clients can be granted separate ead-write or read-only access
  • #8 updateness: VERY: with only a single point of access, and no caching, each client operation starts by fetching the current value, so there are no opportunities for staleness
  • #9 monotonicity: VERY: the single point of access also protects against retrograde motion

Privacy leaks in the vdrive server

Dirnodes are very private against other clients: traffic between the client and the vdrive server is protected by the Foolscap SSL connection, so they can observe very little. Storage index values are hashes of secrets and thus unguessable, and they are not made public, so other clients cannot snoop through encrypted dirnodes that they have not been told about.

On the other hand, the vdrive server gets to see the access patterns of each client who is using dirnodes hosted there. The childnames and URIs are encrypted and not visible to anyone (including the vdrive server), but the vdrive server is in a good position to infer a lot of data about the directory structure. It knows the length of all childnames, and from the length of the child URIs themselves it can tell whether children are file URIs vs. directory URIs vs read-only directory URIs. By watching a client's access patterns it can deduce the connection between (encrypted) child 1 and target directory 2 (i.e. if the client does a 'get' of the first child, then immediately does an operation on directory 2, it can assume the two are related. From this the vdrive server can build a graph with the same shape as the filesystem, even though the nodes and edges will be unlabled.

By providing CHK-level storage services as well (or colluding with a server who is), the vdrive server can infer the storage index of file nodes that are downloaded shortly after their childname is looked up.

Integrity failures in the vdrive server

The HMAC prevents the vdrive server from modifying the child names or child URI values without detection: changing a few bytes will cause an HMAC failure that the client can detect. This means the vdrive server can make the dirnode unavailable, but not corrupt it.

However, the vdrive server can perform a rollback attack: either replacing an individual entry in the encrypted table with an old version, or replacing the entire table. Despite not knowing what the child names or URIs are, the vdrive server can undo changes made by authorized clients. It could also perform selective rollback, showing different clients different versions of the filesystem. To solve this problem either requires mutable data (like a sequence number or hash) to be stored in the URI which points to this dirnode (rendering them non-constant, and losing most of their value), or requires spreading the dirnode out over multiple non-colluding servers (which might improve availability but makes updateness and monotonicity harder).

Improving the availability of dirnodes

Clearly it is somewhat disappointing to have a sexy distributed filestore at the bottom layer and then have a single-point-of-failure vdrive server on top of it. However, this approach meets many of the design goals and is extremely simple to explain and implement. There are many avenues to improve the reliability and availability of dirnodes. (note that reliability and availability can be separate goals).

A simple way to improve the reliability of dirnodes would be to make the vdrive server be responsible for saving the dirnode contents in a fashion that will survive the failure of its local disk, for example by simply rsync'ing all the dirnodes off to a separate machine on a periodic basis, and pulling them back in the case of disk failure.

To improve availability, we must allow clients to access their dirnodes even if the vdrive server is offline. The first step here is to create multiple vdrive servers, putting a list of furls into the DIR:URI, with instructions to update all of them during write, and accept the first answer that comes back during read. This introduces issues of updateness and monotonicity: if a dirnode is changed while one of the vdrive servers is offline, the servers will diverge, and subsequent clients will see different contents depending upon which server they ask.

A more comforting way to improve both reliability and availability is to spread the dirnodes out over the mesh in the same way that CHK files work. The general name for this approach is the "SSK directory slot", a structure for keeping a mutable slot on multiple hosts, setting and retrieving its contents at various times, and reconciling differences by comparing sequence numbers. The "slot name" is the hash of a public key, which is also used to sign updates, such that the SSK storage hosts will only accept updates from those in possession of the corresponding private key. This approach (although not yet implemented) will provide fairly good reliability and availability properties, at the expense of complexity and updateness/monotonicity. It can also improve integrity, since an attacker would have to corrupt multiple storage servers to successfully perform a rollback attack.

Reducing centralization can improve reliability, as long as the overall reliability of the mesh is greater than the reliability of the original centralized services.

Improving the efficiency of dirnodes

By storing each child of a dirnode in a separate element of the dictionary, we provide efficient directory traversal and clean+simple dirnode delegation behavior. This comes at the cost of efficiency for other operations, specifically things that operation on multiple dirnodes at once.

When a backup program is run for the first time, it needs to copy a large amount of data from a pre-existing filesystem into reliable storage. This means that a large and complex directory structure needs to be duplicated in the dirnode layer. With the one-object-per-dirnode approach described here, this requires as many operations as there are edges in the imported filesystem graph.

Another approach would be to aggregate multiple directories into a single storage object. This object would contain a serialized graph rather than a single name-to-child dictionary. Most directory operations would fetch the whole block of data (and presumeably cache it for a while to avoid lots of re-fetches), and modification operations would need to replace the whole thing at once. This "realm" approach would have the added benefit of combining more data into a single encrypted bundle (perhaps hiding the shape of the graph from the vdrive server better), and would reduce round-trips when performing deep directory traversals (assuming the realm was already cached). It would also prevent fine-grained rollback attacks from working: the vdrive server could change the entire dirnode to look like an earlier state, but it could not independently roll back individual edges.

The drawbacks of this aggregation would be that small accesses (adding a single child, looking up a single child) would require pulling or pushing a lot of unrelated data, increasing network overhead (and necessitating test-and-set semantics for the modification side, which increases the chances that a user operation will fail, making it more challenging to provide promises of atomicity to the user).

It would also make it much more difficult to enable the delegation ("sharing") of specific directories. Since each aggregate "realm" provides all-or-nothing access control, the act of delegating any directory from the middle of the realm would require the realm first be split into the upper piece that isn't being shared and the lower piece that is. This splitting would have to be done in response to what is essentially a read operation, which is not traditionally supposed to be a high-effort action.

Dirnode expiration and leases

Dirnodes are created any time a client wishes to add a new directory. How long do they live? What's to keep them from sticking around forever, taking up space that nobody can reach any longer?

Our plan is to define the vdrive servers to keep dirnodes alive with "leases". Clients which know and care about specific dirnodes can ask to keep them alive for a while, by renewing a lease on them (with a typical period of one month). Clients are expected to assist in the deletion of dirnodes by canceling their leases as soon as they are done with them. This means that when a client deletes a directory, it should also cancel its lease on that directory. When the lease count on a dirnode goes to zero, the vdrive server can delete the related storage. Multiple clients may all have leases on the same dirnode: the server may delete the dirnode only after all of the leases have gone away.

We expect that clients will periodically create a "manifest": a list of so-called "refresh capabilities" for all of the dirnodes and files that they can reach. They will give this manifest to the "repairer", which is a service that keeps files (and dirnodes) alive on behalf of clients who cannot take on this responsibility for themselves. These refresh capabilities include the storage index, but do *not* include the readkeys or writekeys, so the repairer does not get to read the files or directories that it is helping to keep alive.

After each change to the user's vdrive, the client creates a manifest and looks for differences from their previous version. Anything which was removed prompts the client to send out lease-cancellation messages, allowing the data to be deleted.

Starting Points: root dirnodes

Any client can record the URI of a directory node in some external form (say, in a local file) and use it as the starting point of later traversal. The current vdrive servers are configured to create a "root" dirnode at startup and publish its URI to the world: this forms the basis of the "global shared vdrive" used in the demonstration application. In addition, client code is currently designed to create a new (unattached) dirnode at startup and record its URI: this forms the root of the "per-user private vdrive" presented as the "~" directory.

Mounting and Sharing Directories

The biggest benefit of this dirnode approach is that sharing individual directories is almost trivial. Alice creates a subdirectory that she wants to use to share files with Bob. This subdirectory is attached to Alice's filesystem at "~alice/share-with-bob". She asks her filesystem for the read-write directory URI for that new directory, and emails it to Bob. When Bob receives the URI, he asks his own local vdrive to attach the given URI, perhaps at a place named "~bob/shared-with-alice". Every time either party writes a file into this directory, the other will be able to read it. If Alice prefers, she can give a read-only URI to Bob instead, and then Bob will be able to read files but not change the contents of the directory. Neither Alice nor Bob will get access to any files above the mounted directory: there are no 'parent directory' pointers. If Alice creates a nested set of directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to share-with-bob to Bob, then Bob will be unable to write to either share-with-bob/ or subdir2/.

A suitable UI needs to be created to allow users to easily perform this sharing action: dragging a folder their vdrive to an IM or email user icon, for example. The UI will need to give the sending user an opportunity to indicate whether they want to grant read-write or read-only access to the recipient. The recipient then needs an interface to drag the new folder into their vdrive and give it a home.

Revocation

When Alice decides that she no longer wants Bob to be able to access the shared directory, what should she do? Suppose she's shared this folder with both Bob and Carol, and now she wants Carol to retain access to it but Bob to be shut out. Ideally Carol should not have to do anything: her access should continue unabated.

The current plan is to have her client create a deep copy of the folder in question, delegate access to the new folder to the remaining members of the group (Carol), asking the lucky survivors to replace their old reference with the new one. Bob may still have access to the old folder, but he is now the only one who cares: everyone else has moved on, and he will no longer be able to see their new changes. In a strict sense, this is the strongest form of revocation that can be accomplished: there is no point trying to force Bob to forget about the files that he read a moment before being kicked out. In addition it must be noted that anyone who can access the directory can proxy for Bob, reading files to him and accepting changes whenever he wants. Preventing delegation between communication parties is just as pointless as asking Bob to forget previously accessed files. However, there may be value to configuring the UI to ask Carol to not share files with Bob, or to removing all files from Bob's view at the same time his access is revoked.