[tahoe-dev] Real-world Tahoe-LAFS grid deployment

Francois Deppierraz francois at ctrlaltdel.ch
Sun Nov 14 11:57:19 UTC 2010


Hi folks,

For a long time now, I have been thinking about posting some details on
my current Tahoe-LAFS deployment and the operational issues I ran into.

The grid is currently composed of 71 storage nodes for a total raw
capacity of 24 TB. Most of the capacity comes from three 4U Transtec
servers located in three different data-centers. Each server typically
has 24 320 GB SATA disks, two of them as a RAID-1 system partition and
the remaining 22 as independent storage nodes. All three data-centers
have 100Mbps+ Internet access and a latency of less than 10ms between
each other.

This grid has currently different use cases:

  1. An HTTPS-accessible shared file store used by one company, a few
     associations and friends.
  2. A backup repository for multiple servers, 'tahoe backup' run as a
     daily cron job. Tahoe-LAFS gateway is running locally.
  3. A private file repository locally accessed by HTTP and SFTP.
  4. An enterprise authentication-less rapidshare-like file sharing
     service.

The first incarnation of this grid was composed of one virtual machine
and two physical servers, each with two 1 TB disks. Back then, I have
been using 2-of-5 encoding parameters because there wasn't enough
storage servers to handle the default 3-of-10.

After the three Transtec servers were added, I then switched to 3-of-10
until I read Zooko's email¹ in which he advised to set M to the number
of storage nodes in the grid. Now, almost all the clients of this grid
are using 22-of-66.

The need to change encoding parameters makes adding more servers to the
grid or removing old servers cumbersome. To make matters worse, the
current repairer uses the same encoding parameters which were used when
the file was initially uploaded. In this case, that means that a file
which was uploaded early in the life of the grid (my root directory for
instance) is still only stored on 5 disks. Those disks might even be
housed by the same server in case of very bad luck because Tahoe-LAFS is
currently missing peer selection with data-center/rack/server awareness
(ticket #467).

Basically, increasing the grid size, currently means re-uploading all
the file with new encoding parameters. Fortunately, this is easily done
with commands like 'tahoe cp -r tahoe:old tahoe:new' but it takes quite
a bit of time...

Another bunch of issues has to do with the repair process itself.

I'm currently using 'tahoe deep-check --repair' command in cron jobs to
handle the periodic check and repair process. This process is quite
slow, it takes about 1.3 seconds per file on this grid. For instance, in
one root directory on this grid contains 59'000 files, the check and
repair process takes about 21 hours to finish where only 60 repairs have
been performed.

Another issue I experienced during the setup of this check and repair
process is bug #755. Basically, the deep-check command fails as soon as
a single file is not accessible or irremediably broken. I plan to have a
patch ready for release 1.9.0 that fixes that.

Because the check and repair process is not yet working as expected,
garbage collection is currently disabled on this grid to prevent data loss.

I would love to see a smarter repair process being implemented, this
already documented in tickets such as #483 (repairer service), #450
(checker / repair agent) and #543 (rebalancing manager). The repairer
could also perhaps use a sqlite database similar to the backupdb to
remember the last time a specific capability was checked. This could be
use to considerably speed-up a daily deep-check operation.

I'm very interested in other people experiences on actual, planned or
even aborted Tahoe-LAFS deployments. So, please speak-up now ;)

Cheers,

François

¹ http://www.mail-archive.com/tahoe-dev@tahoe-lafs.org/msg00793.html
http://tahoe-lafs.org/trac/tahoe-lafs/ticket/467
http://tahoe-lafs.org/trac/tahoe-lafs/ticket/755
http://tahoe-lafs.org/trac/tahoe-lafs/ticket/483
http://tahoe-lafs.org/trac/tahoe-lafs/ticket/450
http://tahoe-lafs.org/trac/tahoe-lafs/ticket/543


More information about the tahoe-dev mailing list