[tahoe-dev] Choosing a distributed FS?

Thu Apr 11 02:34:17 UTC 2013

Hi All,

Whilst I am familiar with the general concepts of a distributed 
filesystem/datastore, I have only recently started looking for candidate 
implementations, and find myself a little overwhelmed by the amount of 
choice available.

The elegance of Tahoe-FS appeals to me, but I am still unsure how well 
my various applications may fit with it and/or other options.

I have 4 possible applications in mind, and was hoping people here could 
give me advice to help me narrow down the field of candidates.

Whilst I am definitely not asking people in this forum to recommend some 
other product, I am hoping to at least get comments along the lines of: 
"Tahoe-FS would definitely be a good fit" or "Tahoe-FS would definitely 
not be a good fit - you could look at XYZ instead."
Of course, if you are comfortable giving more detailed comments on how 
different products would fit my needs, I would be very grateful.

I support the IT of an Architecture practice that employs about 40 
architects, so we have a set of servers in a computer room, and 40+ 
workstations on desks, all connected with a Gb LAN.
We manage the workstations with FOG ( http://www.fogproject.org/ ) which 
supports PXE boot, wake-on-lan (WOL), and a task scheduler; so I can 
schedule tasks that will perform unattended actions, such as booting all 
workstations to a specific PXE-provided OS, and initiating some action.
(Current actions in this vein include overnight virus scanning, for 
example.)

So, here are the applications I am considering:

1. We typically use less than half the disk space on the workstations - 
so I am considering creating a distributed datastore out of the unused 
space, and utilising it for non-online storage - thereby freeing up 
considerable space on the servers.

The standard software image we deploy to the workstations is a 60GB 
image (of Windows - yuk!).
On the older workstations there may only be 20GB spare, but on many of 
the newer ones, there would be 60-140GB, depending on the drive 
installed. Assuming an average of 40GB spare, 40x40GB = 1.6TB, which is 
not too shabby.

So my thought was to configure a small linux kernel with a distributed 
FS installed (Tahoe-FS?), and use FOG to boot the workstations to this 
kernel each night - thus giving me a TB datastore that I can use each night.
For example, this would make an ideal area for disk-based backup of the 
servers (fileserver, email server, intranet server, FOG image server).

It could also be a useful place to archive OS images.

The data would normally be large immutable files:
* GB tar archives of full and incremental backup images;
* GB OS image files;

Most data would not be appended to, but would simply be stored, and 
possibly deleted after some time.
Tahoe-FS seems a good choice here, although I have been looking at Ceph 
as well.

2. I am also considering if I want to make this distributed FS online 
during the day.
Tahoe-FS can support Windows storage nodes (yes?), and so I *could* add 
Tahoe-FS to our standard workstation image, and thereby have a TB 
datastore available during office hours.
I am still not sure this is practical, as the number of running 
workstations will vary, and if someone came in over the weekend, we 
would have to start most of the workstations (using WOL) to get the 
datastore up and running.

And I am concerned that the load of being an active storage node might 
slow the workstations down sufficiently to annoy the users.

I would have to investigate the file-save semantics of the applications 
we use most (ArchiCAD, Sketchup, MS Word, image-editing), but I think 
they are mostly file-replace options rather than file-append operations.

Does anyone have suggestions or comments on this?

3. I am planning to consolidate 3 new servers (HP proliant something); 
run our existing server processes in container-type VMs (eg OpenVZ); and 
create a distributed filesystem out of the local disks direct-attached 
to the servers (about 350GB each) to store VM images, email-inboxes, etc.
I still haven't worked out whether I will assemble the software 
components for this myself, or use a pre-assembled solution such as 
ProxMox or OpenStack.

The privacy features of Tahoe-FS are not so important in this 
application, and I am wondering if something like Ceph (or other?) would 
be a better fit?

4. I also support a volunteer library.
I have installed a single server and 6 or 7 workstations and they are 
all running linux.
I am going to create a distributed FS, again using spare space on the 
workstations, for storing unattended backups of the server data.
In this instance, I wouldn't even need to boot the workstations to a 
different OS. The benefit of a separate OS would be isolation of the 
backup data; and the downside would be that the backup data would not be 
readily available for immediate recovery of a lost or corrupted file.

Again, Tahoe-FS and Ceph seem viable candidates, although there seem to 
be countless others which could also be considered.

Thanks in advance for any and all advice and suggestions.

Cheers!
Nik