[tahoe-dev] Privacy of data when stored on allmydata.com
warner-tahoe at allmydata.com
Thu Feb 5 13:04:57 PST 2009
Hi Andrej.. my responses are also inline.
Also, note that a lot of this explanation involves the difference between the
Tahoe free-software project and AllMyData.com's commercial service. We
sometimes confuse the two, because Zooko and Peter and I are all employees of
allmydata: we're both the main hackers on Tahoe and the main hackers on a
.com service which uses Tahoe. As tahoe hackers, we've designed tahoe to
handle two use cases: a hacker-oriented friendnet, and a commercial
consumer-oriented backup service. While we've made compromises on the
commercial service, we've tried very hard to not let those affect tahoe
> * Tahoe is very carefully designed to protect the integrity and
> > confidentiality of the data stored therein. If you don't have the
> > filecap/dircap, you can't read the file.
> Should I read this as "user needs to have additional copies of
> 'filecap/dircap' thingy" or he will not be able to restore if they are lost
> on the PC that created backup?
Absolutely. Tahoe's capability-based access control model puts all the
authority in the cap. You can read a file if and only if you know the readcap
for that file. You might know the readcap because you uploaded the original
file and wrote the cap down, or because someone else (who knew it already)
gave it to you, or because you pulled it out of a directory that you can
The ability to read the contents of a directory is controlled in a similar
way, but since directories are mutable, there are two separate capabilities:
the read+write cap (which we usually abbreviate to "writecap"), and the
read-only cap ("readcap"). If you have the writecap, you can derive the
readcap. If you only have the readcap, you cannot derive the writecap (in
fact the readcap is basically a cryptographic hash of the writecap).
The directory is just a table that maps child name to a cap. If you can read
the directory, you get to know all the caps inside it. (actually it's a bit
more complicated: if you have the directory readcap, you can get readcaps for
everything inside it, but if you also know the directory writecap, you can
get all the writecaps inside it: this enables "transitive readonlyness").
If the phrase "Directed Graph" means anything to you, that's the best way to
explain how Tahoe files and directories work, in which some out-of-band
mechanism is used to hold one or more starting points. The "alias" file is
just such a mechanism. Since Tahoe (as opposed to allmydata) is aimed at a
hacker audience, the documentation is phrased in terms of caps, and this
directed graph. (note that the term "DAG", for Directed Acyclic Graph, is a
bit more common, but Tahoe isn't limited to DAGs, since it's perfectly happy
Now, since Allmydata (as opposed to tahoe) is aimed at a consumer audience,
the allmydata.com user experience hides most of this stuff. These users know
about accounts (identified with an email address) and passwords. The
allmydata web site is responsible for turning an email+password pair into a
"webdrive", or giving it to a native windows client, where it is used as the
root directory of the windows virtual drive. These users almost never see the
filecaps or dircaps.
The Tahoe CLI is sort of halfway between these places. You can use it with
caps directly, but for convenience, most folks put a directory writecap in
their alias file and then forget about it. But, being aware of the underlying
model will help CLI users build an accurate picture of how file access is
> Aha, another cap... maybe we need a dictionary? Is there a Wiki somewhere
> for this project? I have the feeling it would be usefull...
True, we've used a lot of abbreviations. Here's a bit of a table. Each kind
of capability string has a distinctive prefix.
1: immutable file read-only capability string URI:CHK:
2: immutable file verify capability string URI:CHK-Verifier:
3: mutable file read-write capability string URI:SSK:
4: mutable file read-only capability string URI:SSK-RO:
5: mutable file verify capability string URI:SSK-Verifier:
6: directory read-write capability string URI:DIR2:
7: directory read-only capability string URI:DIR2-RO:
8: directory verify capability string URI:DIR2-Verifier:
In Tahoe, directories are built out of mutable files (a directory is really
just a particular way to interpret the contents of a given mutable file), and
non-directory mutable files aren't used very much. All normal data files are
uploaded into immutable files by default.
Some capabilities can be used to derive others. If you have #1, you can
derive #2 (but not the other way around). The full table is:
So we use "filecap" to talk about #1+#3+#4, but (since most files are
immutable) we're usually talking about #1. We use "dircap" to talk about #6
and #7. We use "readcap" to talk about #1,#4, and #7, but usually we refer to
#7 as a "directory readcap". We use "writecap" to talk about #3 and #6.
A "verifycap" is the weakest capability that still allows every bit of every
share to be validated (hashes checked, signatures verified, etc). That means
#2, #5, and #8.
When we talk about a "repaircap", we mean "the weakest capability that can
still be used to repair the file". Given the current limitations of the
repairer and our webapi, that means we're talking about #1, #3, or #6.
Eventually we'll fix this limitation, and any verifycap should be useable as
a repaircap too. (there's much less work involved to let #2 repair a file..
it's just an incomplete API, rather than a fundamental redesign of the server
We then use the somewhat-vague term "rootcap" to refer to a cap (usually a
directory write cap) that is not present inside any directory, so the only
way to ever reach it is to remember it somewhere outside of Tahoe. It might
be remembered in the allmydata.com rootcap database (indexed by account name
plus password), or it might be remembered in a ~/.tahoe/private/aliases file,
or it might just be written down on a piece of paper. The point is that you
have to start from somewhere, and we refer to such a starting point as a
> > Customers who maintain private directories
> Is that any dir that is created with tahoe create-alias/add-alias withour
> specifying allmydata.com rootcap obtained from
> https://www.allmydata.com/native_client.php ?
Basically, yeah, we're defining a "private directory" as a directory which is
not reachable from the allmydata.com rootcap (to be technical, it is a
directory that is not included in the transitive closure of
child-is-in-directory relationships given the allmydata.com customer rootcap
table as a starting point). "tahoe mkdir" can be used to create private
directories (and "tahoe create-alias" is really just a wrapper for
mkdir+add-alias). If you then create a link to that new directory from
something that allmydata.com can reach, then it's not private anymore (with
respect to allmydata.com).
With the current tahoe limitations described earlier, if the tools at
allmydata.com can't reach a file or directory through the user's rootcap,
then we can't perform file repair or accounting on them. Furthermore, those
objects are basically invisible to the garbage-collection process, so the
next time we do a GC run (to delete shares for files that have been deleted
from customer directories), they'll qualify as garbage, and will be removed.
> I assume that "repair" of files assume that original file still exists
Nope. The tahoe repair process (which you can trigger yourself by running
"tahoe check --repair FILE_OR_DIR") uses the remaining shares to generate new
ones. It basically downloads the ciphertext of the file, one segment at a
time, creates whatever new shares are required, and uploads them to new
servers. As long as the file or directory is still retrievable (i.e. as long
as there are still at least "k" shares remaining), it can be repaired back up
to the full "N" shares. Since the tahoe defaults are k=3 and N=10, any file
can be repaired as long as there are at least 3 shares left.
> I would say that this thinking does not apply to most of youir clients that
> make a living with or from there data.
As a Tahoe hacker, I'd agree with you. As an allmydata.com employee, I have
to say that we've received very few such requests, and it really does seem
that most consumers expect that an online data-storage company will have the
ability to see their files. Sad but true. It's my hope that providing better
tools will eventually lead consumers to expect more.
> There are few more issues here, but you get the picture.
I absolutely agree with that. The reason we're such fans of capability-based
access control systems is because they make it easier to achieve POLA: the
Principle Of Least Authority. As a storage provider, we should not have the
ability to read your files, and if we don't have that ability, then an
attacker who manages to break into our database will not have it either. We
need the ability to repair your files, and our business model requires that
we have the ability to measure how much space they're taking up. But unless
the customer requires us to be able to recover their files after they
lose/forget their rootcap (and there are other mechanisms to provide that
feature), we don't need to see the files.
So, once we finish fixing Tahoe to remove the limitations described earlier,
allmydata.com can be improved to offer truly private storage to customers who
want it, and who are willing to put in the extra effort that is required.
(note that to achieve this, you must use your own computer to access your
files, so you can't use the www.allmydata.com webdrive, since that would
reveal your caps to our machines.. for some users, this convenience is too
much to give up).
> > Once Accounting and traversal caps are done, we can give our customers a
> > choice between password-recovery and privacy. The installer will have a
> > checkbox that explains the options and lets them choose whether to give
> > us a rootcap or not.
> That would be great. Dont forget to mention the need to have additional
> copies of whatever-caps storred on additional machine, in order to be able
> to restore data if installation disk is lost?
The capability scheme we're using focusses everything into that one rootcap.
Assuming that you've put all your files and directories into a big tree, the
only thing you need to remember (and keep secret) is that root directory
writecap. This is a single string, about 90ish characters long, which is
short enough to print out and keep in a safety deposit box. (when we get to
DSA-based directories, we think we can shorten this to about 20 base32
characters, and might be able to express it as an algorithmically-generated
passphrase, so it might even be possible to just memorize it). As the saying
goes, cryptography is a way to turn a large security problem into a small
Data security has a fundametal tension between confidentiality and
availability/reliability. If you publish your data in a major newspaper, you
get no confidentiality, but on the other hand the chances of actually losing
the data forever are pretty much zero. If you keep it in a safe that's wired
with a self-destruct charge that explodes if someone breaks in, then the
chances of anyone else seeing your private data is pretty much zero, but
there are all sorts of accidents that could make your data unavailable
permanently. The way you manage your tahoe rootcap puts you somewhere on this
same spectrum. But because the rootcap is pretty small, it's easier to work
with, so you can pick a spot that you'll be comfortable with.
When we get to the point where we can have that checkbox, the "don't share it
with allmydata" side will include a list of suggestions for keeping that
rootcap safe. I also expect we'll build some secret-splitting tools, like
something that prints out letters to send to 5 friends, such that if you
reassemble the data in 3 of those letters, you can reconstruct the rootcap..
things like that.
> > Some backup schemes will aggregate files together in a way that makes
> > differential backups more difficult or less efficient..
> As mentioned in my response to Fran__ois, this is not a sollution in my case,
> as a) it requires at least third of the disk space to be free to create
> encrypted tar file, and b) it will require a download of complete initial
> full backup, plus potentialy several direfential backups, just to restore a
> single file.
Yeah. I'm not personally fond of aggregated backup schemes for those same
reasons. They do tend to be faster and more efficient, though.
> I will follow closely the discussion on upcoming sollutions proposed, but
> this seems a bit in the future. I am wondering when can we hope for
> Accounting and traversal caps to be implemented
I'm not sure. Accounting is the next big project that we have scheduled..
maybe a couple of months? DSA directories and traversal caps are probably the
next big one after that.
> and if plugging in some sort of on the fly encryption to FUSE mounter such
> as blackmatch would be a quick fix?
Maybe. You could use one of the existing FUSE-based encryption tools (like
encfs) to get a filesystem full of separate encrypted files, then copy *that*
into allmydata. The encrypted files will change just as quickly or slowly as
the plaintext, so your backups should be reasonably efficient. This scheme
would require 2x disk sapce, though, one for your original plaintext, and a
second for the ciphertext.
I've not used the FUSE-to-Tahoe layers at all, so I don't know how well
they'd work. I also don't know how well FUSE handles multiple layers at once
(you'd want to experiment with having encfs on top of blackmatch, I guess).
> It certanly does, and thank you for that. It would make a great content for
> a Wiki page :-)
Feel free to add content to the tahoe Trac (at allmydata.org) if it seems
like it would be useful. I don't know if we have an allmydata.com wiki
anywhere, but maybe we could enhance the FAQ section of the website.
More information about the tahoe-dev