[tahoe-dev] streaming upload and BackupDB proposal

Fri Sep 26 07:05:55 PDT 2008

Hi Brian:

There is no user-visible notion equivalent to "inode" on NTFS.  Nobody
can relink or delete the file while you are holding it open (as far as
I know).  Maybe Mike or Greg knows more details.

 > Note that we can use this hash-to-decide-about-uploading pass (let's
 > call it the pre-hash pass) to pre-compute the AES key for the file,
 > saving one of the two passes that the upload process would normally
 > need to do.

I am dissatisfied with the fact that we tend to make multiple passes
over a file in order to upload it.  Note that we don't have to do so
-- Tahoe supports streaming upload in which you upload the file in a
single pass.  (I am also very dissatisfied with the fact that people
using Tahoe through the wapi don't have the option of streaming at all
-- ticket #320).

The original reason that we make multiple passes was convergent
encryption (which, if I recall correctly, was explained to me by Jim
McCoy in the summer of 1998, ten years ago), in order to save storage
space.

In the year 2000, Doceur reported that he could save 50% of storage
space by using convergent encryption on a bunch of Windows
workstations on the Microsoft campus.  More recently some measurements
on the allmydata.com user files, as well as in the wild using
dupfilefind [1] and such tools have shown that for most people and for
most uses the gains from convergence are nowhere near that -- more
like 0.1% or less.

(Also of course, along the way Drew Perttula observed that convergence
has a small chance of causing a large privacy failure, depending on
the shape of your files and on what the attacker knows [2].)

So, for people who don't want to take that risk of exposing their
files, or who know that the gains that they can get from convergence
are likely to be less than 0.1% of storage, they really do not want to
spend the time, disk grinding, and CPU usage to pre-hash all of their
files before uploading them.

So in my mind there is one reason left why the current version of
Tahoe -- v1.3.0 -- has its default setting set to do a hashing pass
over the file before uploading it.  That reason is that we don't have
a backupdb, so we have no way of knowing if we've already uploaded a
file.

Therefore, to my way of thinking, one of the major improvements that
the backupdb can give us is that we can start using streaming upload
on backed-up files by default, and thus get backup performance similar
to Apple Time Machine and our numerous other competitors.

Now it might be a win for the backupdb to do a hashing pre-pass in
certain cases.  You gave the motivating example in your message that
if you "mv /home/warner /home/brianwarner", you don't want it to spend
6.3 days backing everything up, but instead spend 15 hours hashing it
all.  I agree that it might make sense for backupdb to hash files
sometimes in order to check if they are really the same, when it has
good reason to suspect that they are the same.  However, in deciding
when to do that, let us consider the cost of hashing to be potentially
significant (if the user is using streaming upload), and not free
(which is the case, as you say, if the user is using convergent upload
anyway).

Hm...  You know, we might be able to do a lot better than "15 hours of
rehashing every file" if we track dev,inode pairs along with
size,mtime...  :-)  We might be able to get it down to "2.6 minutes of
traversing the filesystem checking that the shape of the tree and the
dev,inode,size,mtime metadata still fits".

Okay, time for me to walk the children to school.  More later!  :-)

Regards,

Zooko

[1] http://pypi.python.org/pypi/dupfilefind
[2] http://hacktahoe.org/drew_perttula.html

tickets mentioned in this mail:

http://allmydata.org/trac/tahoe/ticket/320 # add streaming upload to  
HTTP interface