[tahoe-dev] The airspeed of a disk-laden swallow (African)
Shawn Willden
shawn-tahoe at willden.org
Sun May 31 19:16:35 PDT 2009
Has there been any discussion or thought about how hard it would be to
bootstrap Tahoe storage nodes via sneakernet?
I'm finding myself putting a huge amount of effort into how to deal with
problems of files "aging" between the time it's recognized that they need to
be backed up and the time they actually get uploaded. This "bootstrap"
problem for a backup is a really big one given the amounts of data that most
people I know have, and the bandwidth that most of them have.
Since my backup work is focused primarily on "friendnets", sneakernet may be a
very viable option. "Never underestimate the bandwith of a station wagon
loaded with tapes hurtling down the highway", and all that.
If I could mount a 2 TB drive on my machine, generate 3-of-10 shares for 600
GB of data and write them to the drive, and then take it around to my
friends' homes copying the right 200 GB for each of them and appropriately
registering it with their Tahoe node, I could eliminate over two *years* of
uploading at my upstream data rates. Having bootstrapped the storage nodes
that way, I calculate that my connection could easily keep up with daily
backups, even without using compressed deltas, and without a helper.
Occasionally I might get a big surge in new data, causing it to take a few
days (or weeks) to catch up again, and perhaps once in a while I might have
to do the sneakernet thing again, but probably not.
Has anyone thought about this, or what would be involved?
On the client end, it seems like Tahoe could go through the normal process of
identifying storage servers, generating shares, etc., but then simply write
them to disk somewhere rather then uploading them, each share somehow labeled
with the intended destination storage server. Theoretically, it could even
be mixed -- if sending to server foo, bar or baz, then write to this disk,
otherwise transmit normally.
On the server end, either the data could be placed directly into the storage
area, with whatever necessary bookkeeping updated, or else a simulated client
could push the data to the server, but at IPC (or at least LAN) speeds.
Crazy idea, I know. It sure would solve a lot of my problems, though.
Any thoughts on how difficult this would be to implement?
Shawn
More information about the tahoe-dev
mailing list