[tahoe-dev] tahoe-lafs suitable for 500TB ~ 3PB cluster?

Sun Apr 21 21:14:33 UTC 2013

Hi, Dieter! Welcome.

On Tue, Apr 16, 2013 at 11:55 AM, Plaetinck, Dieter <dieter at vimeo.com> wrote:
> I'm looking to build a large highly available storage cluster
> i'm not really interested in the specific encryption/security features.  So far I've been using openstack swift
> for a smaller cluster (70TB usable, 216TB raw) and I'm reasonably satisfied, but it has a lot of overhead in terms of cpu, network and storage, because it uses a simple replication strategy (in our case rep. level 3) and because the design is just simple and inefficient.

This is an interesting use case! I would be very pleased if Tahoe-LAFS
satisfied this. I don't know of any reason off the top of my head why
it can't. However, the largest Tahoe-LAFS deployment to date to my
knowledge was the allmydata.com system, which topped out at about 200
disks. So your cluster would be the largest one to date, which of
course means you'll probably get to contribute to the public good by
discovering and reporting never-before-seen problems. ;-)

> I'm looking to store video files (avg about 150MB, max size 5GB).
> the size of the cluster will be between 500TB to 3PB (usable) space, it depends on how feasible it would be to implement.
> at 500TB i would need 160Mbps output performance, at 1TB about 600Mbps, at 2TB about 1500Mbps and at 3PB about 6Gbps.
> output performance scales exponentially wrt clustere size.

I don't understand this. Why do output (what we would call "download"
or "read") bandwidth requirements go up when the cluster size goes up?
Oh, I guess because you need to service more users with a larger
cluster.

I think the fastest and best way to find out is to try deploying
Tahoe-LAFS on some subset of your hardware and network. Take a couple
of machines, driving a few dozen disks, deploy Tahoe-LAFS on them, and
measure the download performance. If it isn't good enough, write to
this list so that we know we should look for ways to improve it.

By the way, I'm planning to do a very similar experiment with someone
else within a week or so. If that experiment comes together, we'll
post our results to this list.

> * what is the cpu,memory,network overhead of tahoe-lafs?

I'd say "pretty reasonablish" on all these. It takes about 56 MB of
RAM per storage server process. We used to run, IIRC, eight such
processes on a 1-core Opteron, so that gives you a feel for how little
CPU one requires.

There's a performance problem in Tahoe-LAFS, but it isn't *overhead*,
it is unnecessary delay. Network operations often stall  unnecessarily
and take longer than they needed to. But they aren't busily saturating
the network while they do that, they're leaving it idle while they do
that, so it doesn't interfere as much as it might with other network
usage.

We have a wiki page about performance. I'm afraid it isn't really
well-organized, but the facts on it are mostly right:

https://tahoe-lafs.org/trac/tahoe-lafs/wiki/Performance

>  does it have a lot of consistency checking overhead? what does it have to do that's not just (for ingest) splitting incoming files, sending chunks to servers to store them to disk, and on output the reverse? i assume the error codes are not too expensive to compute and to check because the cpu has opcodes for them?

It uses SHA256 and Merkle Trees, and it tests integrity both before
and after the erasure-decoding step. This is really heavy-duty from
the perspective of filesystems folks, but we came from the perspective
of crypto folks, where SHA256 is a strong standard.

> * could i run it on commodity hardware servers? (i.e. say servers with 24x4TB=96TB, a common 8-core cpu and 4-8 GB ram?)

Yes.

> * can you change the parameters on a production cluster? (i.e. extend cluster with new servers (or permanently take clusters out), and increase or decrease how many nodes you can loose)

There are some problematic cases that can arise in here, because it
doesn't spontaneously "rebalance" or migrate data around to try to
optimize for the new population of servers. But, it is really nice for
the fact that it gracefully handles addition and removal (even
unannounced or unplanned removal) of servers without noticeable
downtime.

The problematic cases have to do with the long-term. What happens when
you've spread a file out across 10 servers, and after a few years, 7
of that original tranch have been decomissioned? There's a process for
re-spreading the file across newer servers, but that process isn't
triggered automatically, and there are some cases where it doesn't do
the right thing.

Hopefully by the time 7 of your original servers have been
decomissioned, we'll have improved this. ;-) It is one of the features
which _might_ be the topic of a Google Summer of Code project for this
summer:

https://tahoe-lafs.org/trac/tahoe-lafs/wiki/GSoCIdeas

Regards,

Zooko