[tahoe-dev] newbie questions

Fri Sep 19 13:29:39 PDT 2008

On Fri, 19 Sep 2008 16:16:40 +0200
Alain Baeckeroot <alain.baeckeroot at univ-avignon.fr> wrote:

> Hello

Welcome to the list!

> We just discovered Tahoe and are curious to try it for our internal
> use, for a cheap, easy to install, and redundant distributed archive
> system.

Excellent!

> 1/ Is it possible to specify that we have N servers and want
> tolerance to K failure ? Or to know the state of redundancy ?

FYI: in our nomenclature (which is reasonably close to the standard usage)
"N" is the number of shares created, and "k" is the number of shares you
need. So you can tolerate losing N-k shares and still be able to recover the
file. If there are exactly N servers, you'll have one share per server, and
you can lose N-k servers and still be able to recover the file. (when you
have more than N servers, you can probably tolerate slightly more than N-k
lost servers).

Tahoe currently uses 3-out-of-10 encoding (k=3, N=10), so you can tolerate 7
lost shares (although you'd want to repair well before you got to that
point). There's nothing magic about these (very conservative) numbers. To
change them, just modify line 58 of src/allmydata/client.py, to change the
definition of DEFAULT_ENCODING_PARAMETERS. (we don't yet have a simple
configuration file to set this one, sorry).

The "check" and "deep-check" operations will tell you how many shares exist
for any given file, which is what I think you mean by "state of redundancy".
The docs/webapi.txt document describes how to perform these operations
through an HTTP POST command.

> 2/ http://lwn.net/Articles/280483/ explains that files are locally
> encrypted with AES, then splited with some kind of error correction
> algorithm. Is it possible to not encrypt, and only use the Tahoe as a
> redundant distributed filesystem ?

No, not currently. We've found that AES is fast enough (something on the
order of 8MBps/64Mbps) that removing it wouldn't make the system work
significantly faster, and the security properties are much easier to maintain
(and the code is much simpler and safer) by making the encryption mandatory.

If you'd like to do some benchmarking with and without AES, I'd love to hear
about your results. The upload/download process provides many performance
statistics to show how fast different parts of the process took. If you were
to patch src/allmydata/immutable/upload.py:417 to replace the 'AES' object
with a dummy version, and again in download.py:51, then you could measure the
performance without the encryption overhead. Make sure not to mingle files
created this way with the regular encrypted ones, of course, since by
changing the algorithm to remove the encryption, you'll break the property
that a single URI (aka read-cap) refers to exactly one file.

> 3/ Did someone benchmark the performance on a LAN ? with CLI and/or
> fuse ?

We have automated performance measurements, run both on a LAN (the "in-colo"
test) and over a home DSL like (the "dsl" test").
http://allmydata.org/trac/tahoe/wiki/Performance contains some summaries of
the results and links to the graphs of performance over time, like this one:
http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_rate.html .

We currently upload in-colo at about 1.4MBps/11.3Mbps, and download at about
2.3MBps/18.6Mbps . We think that by adding pipelining of segments, we should
be able to at least double this rate (since from the graph of performance
over time, you can see that we used to get 4.5MBps down, before we reduced
the segment size last March).

These tests are all driven by code inside a Tahoe node. When driven by a
webapi operation or a CLI command (which mostly uses the webapi interface),
it will be necessary to transfer the data to/from the node over HTTP, so the
performance will be slightly lower. We don't have any tests of performance
through FUSE.

Other performance numbers of interest include how much latency there is
(which matters more for small files than large ones), and the performance of
mutable files (which are used to contain directories). The allmydata.org
Performance page contains automated test results for these values too.

> 4/ About fuse modules, found in contrib/  
> 	impl_a and impl_b said that only read is suportted
>   but http://allmydata.org/~warner/pycon-tahoe.html says FUSE plugin:
>   "allowing them to read and write arbitrary files."
>   We would be very happy if read and write work under linux :-) (we
> don't use atime, nor do tricky things on our filesystems)

Our linux FUSE modules are not very mature yet. My PyCon paper was meant to
point out that a fully-functional FUSE plugin will allow arbitrary
applications to access files inside the tahoe virtual filesystem, as opposed
to the user needing a special FTP-like program to manually move files between
the tahoe virtual filesystem and their local disk (where other applications
could work on them).

The windows FUSE module (which is actually based on the SMB protocol) works
fairly well, for both read and write, and is the basis for the allmydata.com
commercial product. Rob Kinninmont is working on a Mac FUSE module, which
ought to work on linux as well (the MacFUSE folks claim to be source-code
compatible with Linux/BSD FUSE). I believe his work is both read and write,
but I'll leave that to him to describe.

> 5/ Does it scale to TB filesystems ? 
> 
> Ideally we would like a ~15+ nodes with 500 GB each, and tolerance of
> 3~4 faulty servers.

Yes. The allmydata.com commercial grid currently has about 40 nodes, each
with a single 1TB disk, for a total backend space of 40TB. There is currently
about 11TB of user data in this grid, at 3-out-of-10 encoding, filling about
36TB of backend space. With k=3/N=10, we can lose 7 disks and not lose any
user data. When a disk fails, we will use the new Repairer (in the
soon-to-be-released 1.3.0 version) to regenerate the missing shares onto a
new server.

From a fault-analysis point of view, the number of files we'd lose if we lost
8 disks is small, and grows smaller with the total number of servers. This is
because the files are pseudo-randomly distributed. I don't have the math on
me right now, but I believe it is a very small percentage (the question to
pose to your math major friends is: choose 10 servers out of a set of 40, now
what are the chances that the 10 you picked include servers
#1-#8?).

If you had 15 nodes with 500GB each (so 7.5TB of backend space), and wanted
to tolerate 4 failures, you could use k=11/N=15, which could accomodate 5.5TB
of user data.

> Congrats for this very impressive job. Best regards.
> Alain Baeckeroot.

Thanks!

 -Brian