[tahoe-dev] Tahoe performance

Wed Feb 18 17:34:16 PST 2009

On Fri, 13 Feb 2009 15:57:17 -0600
Luke Scharf <luke.scharf at clusterbee.net> wrote:

> He's running both sides of rsync locally. You can use rsync to sync local
> filesystems, which is what he's suggesting. All you need in order for that
> to work is for the filesystem to be mountable and behave in a minimally
> POSIX-ish way.

As Shawn pointed out, using FUSE on host A to mount a Tahoe virtual
directory, then running rsync locally on A to copy files from your ext3
partition to your FUSE-mounted Tahoe partition, would allow rsync to avoid
copies based upon the metadata matching. This is the same technique that
'tahoe backup' uses at the application level. For rsync to win here, the FUSE
binding must preserve the metadata properly.. I'm not sure which FUSE
bindings, if any, implement the utime() system call which rsync/cp/touch use
to make the target file look like the original one.

When rsync decides that the file might have changed, over a regular network
(e.g. ssh), it uses a clever differencing algorithm. It does one read at the
source, one read at the destination, some amount of writing over the network,
and some amount of writing at the destination, where the amount of
network+writing depends upon how much has actually changed (there's some
reading over the network too, but it's a small fraction of the filesize). The
algorithm is so clever that it can efficiently handle insertions and
deletions.

When rsync is used locally (i.e. no ssh), I don't know whether it uses this
algorithm, or if it just compares timestamps and does a one-read/one-write
copy when the timestamps don't match. For files that haven't changed much,
the IO load is about the same: one read + one write, vs two reads and writes
of just the deltas. For some media (think USB flash drives), reads are much
faster than writes. So I can imagine rsync being either clever or non-clever
in the local case.

The problem is if/when rsync's clever algorithm gets pointed at a "local"
target that's really a FUSE-mounted Tahoe filesystem. In that case, both
reads *and* writes involve a lot of network IO, and Tahoe is going to use
immutable files anyways, so "partial writes" are just as expensive as a full
write (worse, in fact, unless the FUSE layer is smart enough to cache the
file when you did the read pass). So if the file hasn't changed, you get (one
disk read + one tahoe read), and if the file *has* changed, you get (one disk
read + one tahoe read + one tahoe write), where each tahoe read/write is of
the whole file.

Basically, we can't take advantage of the full rsync cleverness until:

 * most files are stored in mutable files, instead of immutable ones
 * we implement efficient partial-write for mutable files (this is our "MDMF"
   goal: "Medium-size Distributed Mutable Files")
 * we have a FUSE binding that exposes this partial-write

and even then, we'd prefer an improvement:

 * store rsync signatures next to the Tahoe file, so that "one whole read"
   pass can be replaced with a smaller read of just the rsync checksums

I'm not sure how to take advantage of that last part without rewriting rsync,
because there's certainly no way at the open/seek/read POSIX level to tell
the filesystem that we only care about a CRC or MD4 of the chunk being read,
not the full bytes themselves. This is a reason to use a separate tool,
rather than rsync+FUSE: the POSIX file-io abstraction boundary hides some
intent that would be useful to know about.

The tool that Shawn is building sounds like it's designed to accomplish all
of these goals, and then some, at the expense of having the resulting
filesystem be mostly stored in a custom database (i.e. regular Tahoe nodes
won't know how to interpret it, so you couldn't view one of the directories
without that tool, and you couldn't share just a piece of the filesystem with
someone else).

cheers,
 -Brian