[tahoe-dev] tahoe overview draft 2 (was Re: question about sharing...)

Sat Jul 2 07:49:59 PDT 2011

Hi Zooko,

Thanks for the thoughtful, detailed feedback.  I've tried to address
your points in this draft and welcome comments.

[[http://tahoe-lafs.org/trac/tahoe-lafs][Tahoe-LAFS]] is described as
"the first decentralized storage system with provider-independent
security".  Its name indicates that it's a "file system" but it's
different than traditional file systems in ways that are important to
understand before you start using it.  This page will try to explain
at a high level, in plain English, how Tahoe-LAFS works and provide
links that will allow you to learn about it in detail.

Before we go any further, please read the
[[http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/about.rst][one-page
summary]], then come back here.  As you saw on that page, Tahoe-LAFS
reliably and safely stores your data on unreliable servers, and the
administrators of those servers won't be able to read your data.  It
does this by encrypting your data before storing it, so that all the
servers ever see is random-looking bits and they can't recover the
actual content of your files.

Tahoe-LAFS also guards against storage server failure by cleverly
encoding your data and spreading it across many servers so that your
data is still available if even a few of the servers fail.  Of course,
this will use more disk storage than simply storing the file once, but
the overhead is less than you might think, and you can manage the
trade-off between extra storage and fault-tolerance.  If you wanted to
be very safe, for example, you could configure Tahoe-LAFS to store
your data on 10 storage servers and be able to recover your data if
only 3 of them were available.  This would cost 3.3 times more storage
space than just storing it on a single disk, but is obviously far more
robust.  If you don't require this level of high availability then
Tahoe-LAFS can store your data with less redundancy.  It's up to you.

http://bigasterisk.com/tahoe-playground/ shows, in a fun visual way,
how this works.

Capabilities (vs. Traditional Filesystem Permissions)

Before we get into how Tahoe-LAFS stores files, it will be useful to
recap how a traditional file system works.  Traditional filesystems
start at a well-known "root" and allow users to explore the filesystem
from there.  Because the root is well-known, you can go to it and list
the files in it; you can also go "up" from any directory to its
parent.  Because users can explore file systems in this way, each user
would be able to do anything they wanted unless there were some sort
of permission check, so these filesystems implement "Access Control
List" (ACL) permission checks.  These checks specify which users are
allowed to access each file, and prevent users from doing things they
can figure out how to do but are not permitted to do.  I can, for
example, discover a directory's existence and learn its name, but I
might not be allowed to read from it.  In order to implement these
permission checks, though, the file system has to know who you are, so
you need to log in.  In order to prove that you are who you claim to
be, you have to provide a password and/or other credentials.  Then you
need to specify who has what kind of access to each file and
directory.  This approach works well enough, but it is complex and
because of that it's very difficult to ensure that it's secure.

Tahoe-LAFS does away with the complexity inherent in the ACL approach.
Each file and directory in Tahoe-LAFS is identified by a "capability"
which is a string of characters that looks something like
URI:CHK:riplmjitnwh25ur3jomzyxrww4:et4gkxykswl7lstw5q4g5suf6y2xyyphvid5nn2r3ktvhytbs5da:3:10:3472
. This capability serves as both the identifier of the file or
directory and as the authorization code necessary to get access to it.

A file can have different capabilities, for example, one capability
might allow you to read the file but a different capability might
allow you to read and write the file.  Each capability contains the
two things that you need to access the file: how to find the encrypted
bits (the "storage index"), and how to decrypt them (the "encryption
key").

Access to any given file is a simple yes/no proposition: if you know
that file's capability then you'll be able to read it, if you don't
then you won't be able to.  It doesn't matter who you are, or what
group you're in, or if you're a "superuser" or not.  In fact,
Tahoe-LAFS doesn't have any sense of "identity" at all: you don't have
to sign in or provide credentials to prove who you are, because
Tahoe-LAFS doesn't know or care.

It's important to understand that a capability specifies the location
of a file, but it's different than a traditional file system "path".
Tahoe-LAFS has no well-known "root" so there's no way to poke around
and try to discover things inside it.  Each directory and file can be
found only by its capability and can't be discovered in any other way.
Each capability is 256 bits long, which is more than 10 to the 76th
power, which means that they're basically impossible to guess.

Each directory capability acts like a traditional file system "root"
directory: it's the top of a tree structure that you can explore.  A
user with a directory capability can list the files and directories in
it, and can browse down into the subdirectories below it just like a
traditional file system, but they can't browse "up" beyond the
capability to see other directories within the same Tahoe-LAFS file
system.  There's no single "root" directory so users cannot discover
things that they're not supposed to know so the in-line ACL checks
implemented by traditional file systems are unnecessary.

If you're curious about the capability model, it's worth taking some
time to learn more about it:
http://www.eros-os.org/essays/capintro.html

Sharing

As you can imagine, the Tahoe-LAFS capability model makes sharing
easy: just give the other person the capability string of the file or
directory that you'd like to share with them.  Once you've done that,
the other person will be able to do everything that the capability
enables, and nothing more.  If you share a read-only capability then
the person you shared it with will be able to read the file but not
write it.  If you share a read-write capability then they will be able
to do both.

The simplicity of the Tahoe-LAFS capability model makes very
fine-grained sharing control easy.  In a traditional Unix filesystem,
for example, you can control access to a file based on the person that
owns the file, the group that owns the file, and "other" people.
Since creating a group is something that only superusers can do, it's
easy to imagine situations where a regular user would like to share a
file with a small community of people but can't do it because she
isn't allowed to create groups.  A typical solution to this problem is
to add another layer of complexity on top of the file system layer to
provide this level of control.  Tahoe's capability model provides this
functionality easily.  If I want to give one person read-write
capability and give another four people read-only capability that's
easy to do - just send the appropriate capability string to the
appropriate people.

We've talked about how each Tahoe-LAFS directory capability acts like
a filesystem "root" directory.  This makes sharing parts of a
directory tree easy and safe.  Let's say I have a directory with two
subdirectories in it, one for work stuff, and one for the startup idea
that I'm kicking around with a few friends.  I'd like to share each of
these directories with a set of people but I don't want those sets to
intersect (or even know about one another).  This is difficult to do
with traditional filesystems since you can list the contents of the
directory above the one you're in even if you can't go into any of the
other directories in it.  So my co-workers could see that there's a
"startup idea" directory even if they couldn't list the files in it.

With Tahoe-LAFS, if I share the read-write capability of the "work"
directory with my coworkers, they will see a filesystem with files and
directories and they'll be able to add their own files and directories
to it.  They'll have no way to find out that there's a "startup idea"
folder, though, since the capability of the "work" directory won't
allow them to explore other areas of the filesystem.

Revoking

Sharing is easy but "revoking" is much harder to do, for both
traditional file systems and Tahoe-LAFS.  The fundamental problem,
which is the same in both cases, is that once you give someone else
the ability to read some data, you can't prevent that person from
re-distributing that data in ways that you might not intend.  At first
glance, it would appear that a traditional filesystem offers stronger
protection for this case, but in fact the ACL approach and capability
approach are similar in the ways in which data can "leak": both
techniques prevent "outsiders" from seeing data that they're not
supposed to, but both can be subverted by people who are given the
ability to read some data, and then choose to export it from the
system and distribute it outside the scope of the system.

In both cases, then, it's important to be very sure that you trust the
people that you're sharing data with, because once you share the data
there's no going back.

Links

Tahoe-LAFS home page:
http://tahoe-lafs.org/
More info on Access Control Lists:
http://en.wikipedia.org/wiki/Access_control_list
A relevant mailing list discussion:
http://tahoe-lafs.org/pipermail/tahoe-dev/2011-June/006388.html