Changes between Initial Version and Version 1 of DirectoryNode


Ignore:
Timestamp:
2007-07-03T00:45:10Z (17 years ago)
Author:
warner
Comment:

copy in docs/dirnodes.txt from the source tree

Legend:

Unmodified
Added
Removed
Modified
  • DirectoryNode

    v1 v1  
     1
     2= Tahoe Directory Nodes =
     3
     4As explained in the architecture docs, Tahoe can be roughly viewed as a
     5collection of three layers. The lowest layer is the distributed filestore, or
     6DHT: it provides operations that accept files and upload them to the mesh,
     7creating a URI in the process which securely references the file's contents.
     8The middle layer is the filesystem, creating a structure of directories and
     9filenames resembling the traditional unix/windows filesystems. The top layer
     10is the application layer, which uses the lower layers to provide useful
     11services to users, like a backup application, or a way to share files with
     12friends.
     13
     14This document examines the middle layer, the "filesystem".
     15
     16== DHT Primitives ==
     17
     18In the lowest layer (DHT), we've defined two operations thus far, both of
     19which refer to "CHK URIs", which reference immutable data:
     20
     21 chk_uri = put(data)
     22 data = get(chk_uri)
     23
     24We anticipate creating mutable slots in the DHT layer at some point, which
     25will add some new operations to this layer:
     26
     27 slotname = create_slot()
     28 set(slotname, data)
     29 data = get(slotname)
     30
     31== Filesystem Goals ==
     32
     33The main goal for the middle (filesystem) layer is to give users a way to
     34organize the data that they have uploaded into the mesh. The traditional way
     35to do this in computer filesystems is to put this data into files, give those
     36files names, and collect these names into directories.
     37
     38Each directory is a series of name-value pairs, which maps "child name" to an
     39object of some kind. Those child objects might be files, or they might be
     40other directories.
     41
     42The directory structure is therefore a directed graph of nodes, in which each
     43node might be a directory node or a file node. All file nodes are terminal
     44nodes.
     45
     46== Dirnode Goals ==
     47
     48What properties might be desireable for these directory nodes? In no
     49particular order:
     50
     51 1: functional. Code which does not work doesn't count.
     52
     53 2: easy to document, explain, and understand
     54
     55 3: private: it should not be possible for others to see the contents of a directory
     56
     57 4: integrity: it should not be possible for others to modify the contents of a directory
     58
     59 5: available: directories should survive host failure, just like files do
     60
     61 6: efficient: in storage, communication bandwidth, number of round-trips
     62
     63 7: easy to delegate individual directories in a flexible way
     64
     65 8: updateness: everybody looking at a directory should see the same contents
     66
     67 9: monotonicity: everybody looking at a directory should see the same sequence of updates
     68
     69We do not meet all of these goals. For the current release, we favored !#1,
     70!#2, and !#7 above the rest, which lead us to the following design. In a later
     71#section, we discuss some alternate designs and potential changes to the
     72#existing code that can help us achieve the other goals.
     73
     74In tahoe-0.4.0, each "dirnode" is stored as a file on a single "vdrive
     75server". The name of this file is an unguessable string. The contents are an
     76encrypted representation of the directory's name-to-child mapping. Foolscap
     77is used to provide remote access to this file. A collection of "directory
     78URIs" are used to hold all the parameters necessary to access, read, and
     79write this dirnode.
     80
     81== Dirnode secret values ==
     82
     83Each dirnode begins life as a "writekey", a randomly-generated AES key. This
     84key is hashed (using a tagged hash, see src/allmydata/util/hashutil.py for
     85details) to form the "readkey". The readkey is hashed to form the "storage
     86index". The writekey is hashed with a different tag to form the "write
     87enabler".
     88
     89Clients who have read-write access to the dirnode know the writekey, and can
     90derive all the other secrets from it. Clients with merely read-only access to
     91the dirnode know the readkey (and can derive the storage index), but do not
     92know the writekey or the write enabler. The vdrive server knows only the
     93storage index and the write enabler.
     94
     95== Dirnode capability URIs ==
     96
     97The "write capability" for a dirnode grants read-write access to its
     98contents. This is expressed on concrete form as the "dirnode write URI": a
     99printable string which contains the following pieces of information:
     100
     101 * furl of the vdrive server hosting this dirnode
     102 * writekey
     103
     104The "read capability" grants read-only access to a dirnode, and its "dirnode
     105read URI" contains:
     106
     107 * furl of the vdrive server hosting this dirnode
     108 * readkey
     109
     110For example,
     111URI:DIR:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:shrrn75qq3x7uxfzk326ncahd4======
     112is a write-capability URI, while
     113URI:DIR-RO:pb://xextf3eap44o3wi27mf7ehiur6wvhzr6@207.7.153.180:56677,127.0.0.1:56677/vdrive:4c2legsthoe52qywuaturgwdrm======
     114is a read-capability URI, both for the same dirnode.
     115
     116
     117== Dirnode storage format ==
     118
     119Each dirnode is stored in a single file, saved on the vdrive server, using
     120the (base32-encoded) storage index as a filename. The contents of this file
     121are a serialized dictionary which maps H_name (explained below) to a tuple
     122with three values: (E_name, E_write, E_read). The vdrive server is made
     123available as a Foolscap "Referenceable" object, with the following
     124operations:
     125
     126 create_dirnode(index, write_enabler) -> None
     127 list(index) -> list of (E_name, E_write, E_read) tuples
     128 get(index, H_name) -> (E_write, E_read)
     129 set(index, write_enabler, H_name, E_name, E_write, E_read)
     130 delete(index, write_enabler, H_name)
     131
     132For any given entry of this dictionary, the following values are obtained by
     133hashing or encryption:
     134
     135  H_name is the hash of the readkey and the child's name.
     136  E_name is the child's name, encrypted with the readkey
     137  E_write is the child's write-URI, encrypted with the writekey
     138  E_read is the child's read-URI, encrypted with the readkey
     139
     140All encryption uses AES in CTR mode, in which the high-order 10 or 12 bytes
     141of the 16-byte key are used as an IV (randomly chosen each time the data is
     142changed), and the remaining bytes are used as the CTR-mode offset. An
     143HMAC-SHA256 is computed for each encrypted value and stored alongside. The
     144stored E_name/E_write/E_read values are thus the concatenation of IV,
     145encrypted data, and HMAC.
     146
     147When a new dirnode is created, it records the write_enabler. All operations
     148that modify an existing dirnode (set and delete) require the write_enabler be
     149presented.
     150
     151This approach insures that clients who do not have the read or write keys
     152(including the vdrive server, which knows the storage index but not the keys)
     153will be unable to see any of the contents of the dirnode. Clients who have
     154the readkey but not the writekey will not be allowed to modify the dirnode.
     155The H_name value allows clients to perform lookups of specific keys rather
     156than requiring them to download the whole dirnode for each operation.
     157
     158By putting both read-only and read-write child access capabilities in each
     159entry, encrypted by different keys, this approach provides transitive
     160read-only-ness: if a client has only a readkey for the parent dirnode, they
     161will only get readkeys (and not writekeys) for any children, including other
     162directories. When we create mutable slots in the mesh and we start having
     163read-write file URIs, we can use the same approach to insure that read-only
     164access to a directory means read-only access to the files as well.
     165
     166
     167== Design Goals, redux ==
     168
     169How well does this design meet the goals?
     170
     171 * !#1 functional: YES: the code works and has extensive unit tests
     172 * !#2 documentable: YES: this document is the existence proof
     173 * !#3 private: MOSTLY: see the discussion below
     174 * !#4 integrity: MOSTLY: the vdrive server can rollback individual slots
     175 * !#5 availability: BARELY: if the vdrive server is offline, the dirnode will                          be unuseable. If the vdrive erver fails,                          the dirnode will be lost forever.
     176 * !#6 efficient: MOSTLY:
     177  * network: single dirnode lookup is very efficient, since clients can               fetch specific keys rather than being required to get or set               the entire dirnode each time. Traversing many directories               takes a lot of roundtrips, and these can't be collapsed with               promise-pipelining because the intermediate values must only               be visible to the client. Modifying many dirnodes at once               (e.g. importing a large pre-existing directory tree) is pretty slow, since each graph edge must be created independently.
     178  * storage: each child has a separate IV, which makes them larger than               if all children were aggregated into a single encrypted string
     179
     180 * !#7 delegation: VERY: each dirnode is a completely independent object,                to which clients can be granted separate ead-write or                read-only access
     181
     182 * !#8 updateness: VERY: with only a single point of access, and no caching,                each client operation starts by fetching the current                value, so there are no opportunities for staleness
     183
     184 * !#9 monotonicity: VERY: the single point of access also protects against                  retrograde motion
     185     
     186
     187=== Privacy leaks in the vdrive server ===
     188
     189Dirnodes are very private against other clients: traffic between the client
     190and the vdrive server is protected by the Foolscap SSL connection, so they
     191can observe very little. Storage index values are hashes of secrets and thus
     192unguessable, and they are not made public, so other clients cannot snoop
     193through encrypted dirnodes that they have not been told about.
     194
     195On the other hand, the vdrive server gets to see the access patterns of each
     196client who is using dirnodes hosted there. The childnames and URIs are
     197encrypted and not visible to anyone (including the vdrive server), but the
     198vdrive server is in a good position to infer a lot of data about the
     199directory structure. It knows the length of all childnames, and from the
     200length of the child URIs themselves it can tell whether children are file
     201URIs vs. directory URIs vs read-only directory URIs. By watching a client's
     202access patterns it can deduce the connection between (encrypted) child 1 and
     203target directory 2 (i.e. if the client does a 'get' of the first child, then
     204immediately does an operation on directory 2, it can assume the two are
     205related. From this the vdrive server can build a graph with the same shape as
     206the filesystem, even though the nodes and edges will be unlabled.
     207
     208By providing CHK-level storage services as well (or colluding with a server
     209who is), the vdrive server can infer the storage index of file nodes that are
     210downloaded shortly after their childname is looked up.
     211
     212
     213=== Integrity failures in the vdrive server ===
     214
     215The HMAC prevents the vdrive server from modifying the child names or child
     216URI values without detection: changing a few bytes will cause an HMAC failure
     217that the client can detect. This means the vdrive server can make the dirnode
     218unavailable, but not corrupt it.
     219
     220However, the vdrive server can perform a rollback attack: either replacing an
     221individual entry in the encrypted table with an old version, or replacing the
     222entire table. Despite not knowing what the child names or URIs are, the
     223vdrive server can undo changes made by authorized clients. It could also
     224perform selective rollback, showing different clients different versions of
     225the filesystem. To solve this problem either requires mutable data (like a
     226sequence number or hash) to be stored in the URI which points to this dirnode
     227(rendering them non-constant, and losing most of their value), or requires
     228spreading the dirnode out over multiple non-colluding servers (which might
     229improve availability but makes updateness and monotonicity harder).
     230
     231
     232=== Improving the availability of dirnodes ===
     233
     234Clearly it is somewhat disappointing to have a sexy distributed filestore at
     235the bottom layer and then have a single-point-of-failure vdrive server on top
     236of it. However, this approach meets many of the design goals and is extremely
     237simple to explain and implement. There are many avenues to improve the
     238reliability and availability of dirnodes. (note that reliability and
     239availability can be separate goals).
     240
     241A simple way to improve the reliability of dirnodes would be to make the
     242vdrive server be responsible for saving the dirnode contents in a fashion
     243that will survive the failure of its local disk, for example by simply
     244rsync'ing all the dirnodes off to a separate machine on a periodic basis, and
     245pulling them back in the case of disk failure.
     246
     247To improve availability, we must allow clients to access their dirnodes even
     248if the vdrive server is offline. The first step here is to create multiple
     249vdrive servers, putting a list of furls into the DIR:URI, with instructions
     250to update all of them during write, and accept the first answer that comes
     251back during read. This introduces issues of updateness and monotonicity: if a
     252dirnode is changed while one of the vdrive servers is offline, the servers
     253will diverge, and subsequent clients will see different contents depending
     254upon which server they ask.
     255
     256A more comforting way to improve both reliability and availability is to
     257spread the dirnodes out over the mesh in the same way that CHK files work.
     258The general name for this approach is the "SSK directory slot", a structure
     259for keeping a mutable slot on multiple hosts, setting and retrieving its
     260contents at various times, and reconciling differences by comparing sequence
     261numbers. The "slot name" is the hash of a public key, which is also used to
     262sign updates, such that the SSK storage hosts will only accept updates from
     263those in possession of the corresponding private key. This approach (although
     264not yet implemented) will provide fairly good reliability and availability
     265properties, at the expense of complexity and updateness/monotonicity. It can
     266also improve integrity, since an attacker would have to corrupt multiple
     267storage servers to successfully perform a rollback attack.
     268
     269Reducing centralization can improve reliability, as long as the overall
     270reliability of the mesh is greater than the reliability of the original
     271centralized services.
     272
     273=== Improving the efficiency of dirnodes ===
     274
     275By storing each child of a dirnode in a separate element of the dictionary,
     276we provide efficient directory traversal and clean+simple dirnode delegation
     277behavior. This comes at the cost of efficiency for other operations,
     278specifically things that operation on multiple dirnodes at once.
     279
     280When a backup program is run for the first time, it needs to copy a large
     281amount of data from a pre-existing filesystem into reliable storage. This
     282means that a large and complex directory structure needs to be duplicated in
     283the dirnode layer. With the one-object-per-dirnode approach described here,
     284this requires as many operations as there are edges in the imported
     285filesystem graph.
     286
     287Another approach would be to aggregate multiple directories into a single
     288storage object. This object would contain a serialized graph rather than a
     289single name-to-child dictionary. Most directory operations would fetch the
     290whole block of data (and presumeably cache it for a while to avoid lots of
     291re-fetches), and modification operations would need to replace the whole
     292thing at once. This "realm" approach would have the added benefit of
     293combining more data into a single encrypted bundle (perhaps hiding the shape
     294of the graph from the vdrive server better), and would reduce round-trips
     295when performing deep directory traversals (assuming the realm was already
     296cached). It would also prevent fine-grained rollback attacks from working:
     297the vdrive server could change the entire dirnode to look like an earlier
     298state, but it could not independently roll back individual edges.
     299
     300The drawbacks of this aggregation would be that small accesses (adding a
     301single child, looking up a single child) would require pulling or pushing a
     302lot of unrelated data, increasing network overhead (and necessitating
     303test-and-set semantics for the modification side, which increases the chances
     304that a user operation will fail, making it more challenging to provide
     305promises of atomicity to the user).
     306
     307It would also make it much more difficult to enable the delegation
     308("sharing") of specific directories. Since each aggregate "realm" provides
     309all-or-nothing access control, the act of delegating any directory from the
     310middle of the realm would require the realm first be split into the upper
     311piece that isn't being shared and the lower piece that is. This splitting
     312would have to be done in response to what is essentially a read operation,
     313which is not traditionally supposed to be a high-effort action.
     314
     315
     316=== Dirnode expiration and leases ===
     317
     318Dirnodes are created any time a client wishes to add a new directory. How
     319long do they live? What's to keep them from sticking around forever, taking
     320up space that nobody can reach any longer?
     321
     322Our plan is to define the vdrive servers to keep dirnodes alive with
     323"leases". Clients which know and care about specific dirnodes can ask to keep
     324them alive for a while, by renewing a lease on them (with a typical period of
     325one month). Clients are expected to assist in the deletion of dirnodes by
     326canceling their leases as soon as they are done with them. This means that
     327when a client deletes a directory, it should also cancel its lease on that
     328directory. When the lease count on a dirnode goes to zero, the vdrive server
     329can delete the related storage. Multiple clients may all have leases on the
     330same dirnode: the server may delete the dirnode only after all of the leases
     331have gone away.
     332
     333We expect that clients will periodically create a "manifest": a list of
     334so-called "refresh capabilities" for all of the dirnodes and files that they
     335can reach. They will give this manifest to the "repairer", which is a service
     336that keeps files (and dirnodes) alive on behalf of clients who cannot take on
     337this responsibility for themselves. These refresh capabilities include the
     338storage index, but do *not* include the readkeys or writekeys, so the
     339repairer does not get to read the files or directories that it is helping to
     340keep alive.
     341
     342After each change to the user's vdrive, the client creates a manifest and
     343looks for differences from their previous version. Anything which was removed
     344prompts the client to send out lease-cancellation messages, allowing the data
     345to be deleted.
     346
     347
     348== Starting Points: root dirnodes ==
     349
     350Any client can record the URI of a directory node in some external form (say,
     351in a local file) and use it as the starting point of later traversal. The
     352current vdrive servers are configured to create a "root" dirnode at startup
     353and publish its URI to the world: this forms the basis of the "global shared
     354vdrive" used in the demonstration application. In addition, client code is
     355currently designed to create a new (unattached) dirnode at startup and record
     356its URI: this forms the root of the "per-user private vdrive" presented as
     357the "~" directory.
     358
     359== Mounting and Sharing Directories ==
     360
     361The biggest benefit of this dirnode approach is that sharing individual
     362directories is almost trivial. Alice creates a subdirectory that she wants to
     363use to share files with Bob. This subdirectory is attached to Alice's
     364filesystem at "~alice/share-with-bob". She asks her filesystem for the
     365read-write directory URI for that new directory, and emails it to Bob. When
     366Bob receives the URI, he asks his own local vdrive to attach the given URI,
     367perhaps at a place named "~bob/shared-with-alice". Every time either party
     368writes a file into this directory, the other will be able to read it. If
     369Alice prefers, she can give a read-only URI to Bob instead, and then Bob will
     370be able to read files but not change the contents of the directory. Neither
     371Alice nor Bob will get access to any files above the mounted directory: there
     372are no 'parent directory' pointers. If Alice creates a nested set of
     373directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to
     374share-with-bob to Bob, then Bob will be unable to write to either
     375share-with-bob/ or subdir2/.
     376
     377A suitable UI needs to be created to allow users to easily perform this
     378sharing action: dragging a folder their vdrive to an IM or email user icon,
     379for example. The UI will need to give the sending user an opportunity to
     380indicate whether they want to grant read-write or read-only access to the
     381recipient. The recipient then needs an interface to drag the new folder into
     382their vdrive and give it a home.
     383
     384== Revocation ==
     385
     386When Alice decides that she no longer wants Bob to be able to access the
     387shared directory, what should she do? Suppose she's shared this folder with
     388both Bob and Carol, and now she wants Carol to retain access to it but Bob to
     389be shut out. Ideally Carol should not have to do anything: her access should
     390continue unabated.
     391
     392The current plan is to have her client create a deep copy of the folder in
     393question, delegate access to the new folder to the remaining members of the
     394group (Carol), asking the lucky survivors to replace their old reference with
     395the new one. Bob may still have access to the old folder, but he is now the
     396only one who cares: everyone else has moved on, and he will no longer be able
     397to see their new changes. In a strict sense, this is the strongest form of
     398revocation that can be accomplished: there is no point trying to force Bob to
     399forget about the files that he read a moment before being kicked out. In
     400addition it must be noted that anyone who can access the directory can proxy
     401for Bob, reading files to him and accepting changes whenever he wants.
     402Preventing delegation between communication parties is just as pointless as
     403asking Bob to forget previously accessed files. However, there may be value
     404to configuring the UI to ask Carol to not share files with Bob, or to
     405removing all files from Bob's view at the same time his access is revoked.
     406