[tahoe-dev] AllMyData Architecture

Thu Nov 1 23:53:20 PDT 2007

I have picked up a great deal on Friday mornings about your  
architecture but there are still holes in my understanding.
I don't see a top level architecture at your web site.
I don't know at which audience the page at (http:// 
tahoebs1.allmydata.com:8011/) is aimed.
Perhaps it is meant to be a part of a presentation.
For instance there is no file named (/data1/tahoe/client1/start.html)  
on my computer; Perhaps I came in late but I can not find the context  
that would explain that file name.

Starting from the page at (http://allmydata.org/trac/tahoe) I am  
invited to install the software.
I have from this page no inkling what the software will do for me, or  
to me; not even a bald claim.
I trust you guys but I don't even understand what I am trusting you  
to do for me.
I understand some of the top level goals of the software (only  
because of Friday mornings) but not how I can ask the software to do  
those things for me.

I have been confused for sometime about which parts of the software  
there are and where they run.

Comments on installing instructions.

I am not a python wizard and the term "Python Package Index" leaves  
me cold.
Perhaps a link would help.
Perhaps this mode is only for Python practitioners.

In the "Running-In-Place" scheme I presume that the code that I avoid  
loading runs on some other machine instead thereby providing a service.
There are trust issues here. If I build and run code on my machine  
then I know can, in principle, read and understand it.
Otherwise I trust a service.
This may be OK but I now have no way yet of knowing what I am  
trusting it to do.

I know enough about your architecture to know that it provides  
unprecedented security properties.
If you are trying to gain adherents as a result of these properties  
then there must be ways to understand how these properties arise.
Reading the code is necessary for those who don't trust you.
But even in that extreme case it is necessary to understand the  
claims and invariants upon which these unique security properties rest,
I see not even bald claims for these properties.

I gather that it works something along the following lines.

There is a body of code that runs on the data owner's machine. (I  
don't know what you call it; I will call it the 'adapter' here.)
The adapter presents some sort of file system interface to whoever  
addresses it as a file system.
The adapter requires access to Internet to access a set of other  
machines whose nature is mostly abstracted in this note.
Files written to this file system occupy space on the other machines.
The written information is encrypted so that only the size and time  
of the written data is available outside your computer; indeed it is  
known only to the adapter and the writer, and, of course, their  
respective TCBs.
As a file system the adapter also supports reading files therein.
This activity is also visible to the other machines.
The data is represented on the other machines redundantly so that the  
data remains available even when some of the other machines are not  
available.

With these bald claims I can already begin to reason about security  
maters even without understanding or buying into capability discipline.
This could perhaps motivate a class of hackers to look closer. Here  
are some further high level, yet precise claims.

There is a software interface to the adapter whereby a program that  
can read a file from the file system can instead acquire a token from  
the adapter for that file.
That token (What I suppose you call the URI) is about 100 characters  
long and is pure data.
If another computer somewhere, attached to the Internet, running an  
adapter with access to the same(?) set of 'other machines' acquires  
this token, it can be delivered to that adapter so as to create a  
virtual hard-link (Unix speak) to the original file.

Alternatively there is available a token from the first adapter for  
the file that affords only read access.
This token is cryptographically secure, as is the read-only restriction.
Tokens for directories come in three flavors, (1) read-write which  
allows the token holder to add and delete entries from the directory,
(add only is another obvious candidate) (2) read only which does not  
allow modification to the directory, and (3) transitive read which  
allows only read-only access to files therein, and transitive read  
access to directories therein. (This text is wobbly.)

Now some of the capability discipline leaks into the description;  
some will recognize it and others will not.
Many more and perhaps most of the current security properties are now  
made clear, at least as claims.
All without reading any code, if they trust you guys.

Another claim: secrecy and integrity of your data depends on only the  
logic of your adapter; it does not depend on the logic of a possibly  
modified adapter in the machines to which you send tokens.
Availability depends on the logic of code running in the other  
computers and the availability of those other computers.
I have ignored here those important properties relating to data healing.

There is now a logical frame work to begin understanding the more  
detailed mechanisms inside that make the software feasible and the  
claims thereby plausible.

I have been sloppy but it is possible to make precise claims with  
little or no more bulk.
I think there should be such prominent claims near the front door for  
the hackers and security professionals.
If your lawyers are queasy, then make these as aspirations, not  
claims, and invite the customer to audit your architecture for  
shortcomings.

I am excited about your technology. Good luck.