[tahoe-dev] Potential use for personal backup

Saint Germain saintger at gmail.com
Tue Jun 12 00:06:31 UTC 2012


On Sun, 3 Jun 2012 22:31:09 -0600, "Zooko Wilcox-O'Hearn"
<zooko at zooko.com> wrote :

> On Sat, Jun 2, 2012 at 1:12 PM, Saint Germain <saintger at gmail.com>
> wrote:
> > On Wed, 23 May 2012 00:39:23 +0200, Guido Witmond <guido at wtmnd.nl>
> > wrote :
> >
> >> I can offer you a 25 GB storage node for testing for a short
> >> while, say a month. It's on my home-adsl with 10Mb/1Mb up- and
> >> download speeds. Don't expect miracles but it should give you an
> >> idea what to expect.
> 
> That's very cool that Guido Witmond offers free storage service to
> random people from the mailing list so they can test. :-) I'm hereby
> offering you free storage service too -- see below.
> 

Back on the mailing-list after a busy week.
Thanks for the offer, but you are right, I will start simple first.
I keep your offer in case I have more energy to make a "remote backup"
benchmark.

> > One interesting point is that I started with a 1-out-of-2 scheme
> > (the other storage node was local on my computer) and that Guido's
> > storage node was unavailable. So when I copied a file on the
> > "grid", it was pretty quick.
> >
> > So that leads to one interesting question. Next time I will connect
> > with Guido's node available, will it upload the files I have
> > uploaded on my local storage node ?
> 
> As Terrell correctly answered, it will not automatically do this
> repair/rebalance/fix-up behavior.
> 
> From one perspective, this is a feature rather than a bug. It means
> the behavior of the system is simpler and easier to predict.
> 
> See the FAQ "Will this thing run only when I tell it to? Will it use
> up a lot of my network bandwidth, CPU, or RAM?":
> 
> https://tahoe-lafs.org/trac/tahoe-lafs/wiki/FAQ#Q18_unobtrusive_software
> 
> Of course, from another perspective this is missing functionality.
> 

Yes I understand now.

> May I make a suggestion? Start simple!
> 
> Set K=H=N=1 and use a single storage server.
> 
> Then do your experiments and measurements. For one thing, this will
> avoid complexity and interesting failure modes, and for another thing
> it will make the system more comparable to other open source backup
> tools that you are evaluating like obnam  ¹ and backshift ² and so on.
> Those systems typically have a single server.
> 
> I think the erasure-coding and the way that Tahoe-LAFS automatically
> uses "the current set of servers"  (instead of requiring an
> administrator to tweak some configuration whenever a server comes or
> goes) are really cool features. But, they should probably be regarded
> as "advanced features" that are not necessary for a standard backup
> operation, and which inevitably bring complications.
> 
> 

Ok I did what you recommended and I finish my benchmark with tahoe-LAFS.
I used an artifical random directory in order for everyone to reproduce
(24 GB, 10 000 files). Files contain only numbers in order to see
the effect of compression.

So I started from there for my first backup:
http://groups.google.com/group/backshift/msg/d3d645e4a5a55ae0

Then I did some modifications:
  - Rename a directory of small files (1000 files of 1 MB)
  - Rename a 500 MB file
  - Modify a 500 MB file (delete a line at the middle of the file)
  - Modify a 500 MB file (modify a line at the middle of the file)
  - Duplicate a 500 MB file
  - Duplicate a 500 MB file and modify a line at the middle

First backup: 24 GB in 111 mn
Second backup (after modifications): +1751 Mo in 118mn
Restore: 68 mn

As a basis for comparison, here are the results for bup:
First backup: 12693 Mo in 36mn
Second backup (after modifications): +37 Mo in 12mn
Restore: 34mn

The complete benchmark (with description of the setup, the
modifications, etc.) is currently only in french.
I may need to do the translation myself if Google Translate is not good
enough ;-)


> Now, David-Sarah Hopwood, Zancas Wilcox, and I have founded a startup
> company to commercialize Tahoe-LAFS storage service. Our business
> model is "ciphertext storage service". Customers pay us a metered
> price ($1.00/GB/month) and we operate a Tahoe-LAFS storage server on
> behalf of that customer. The storage server writes the ciphertext to
> Amazon S3 instead of to a local disk. The name of our company is Least
> Authority Enterprises (https://leastauthority.com).
> 

Great !
Maybe you can add it to :
http://en.wikipedia.org/wiki/Comparison_of_online_backup_services
http://www.onlinebackupdeals.com
Even if I understand that tahoe-LAFS is more than just backup.

> I hope to make Least Authority Enterprises into "the simplest way to
> get started using Tahoe-LAFS".
> 
> Least Authority Enterprises hereby offers you a free trial account.
> We'll leave it running until the first month where we get a bill from
> Amazon.com and we say "Holy crap! We're still paying for Saint
> Germain's experiment!". Then we'll turn it off.
> 
> You *can* combine the Least Authority Enterprises (LAE) server with
> Guido Witmond's server, in a 1-out-of-1 scenario (each file will be
> randomly allocated to one server or the other) or in a 1-out-of-2
> (each file will be replicated to both servers). But like I say, you
> should probably start simple and do your basic experiment before you
> try something like that.
> 

Thanks, but I followed your advice and started on a simple storage node
on my computer so I don't need your kind offer at the moment.

> The LAE storage server will have faster network that Guido's home ADSL
> line, but the speed of the network probably doesn't matter much.
> 
> The biggest performance win will be when you run "tahoe backup" and it
> sees that the timestamps on a file haven't changed, so it skips over
> the file.
> 
> The second-biggest performance win will be when you upload a file (by
> "tahoe backup" or otherwise) and after transferring the file contents
> to your Tahoe-LAFS gateway, and after the gateway encrypts and hashes
> it, the gateway talks to the storage servers and finds out that a copy
> of this identical file has already been uploaded, so it does not need
> to transfer the contents to the storage server. (This is only a
> significant win if the distance from your client to your gateway is
> shorter than the distance from your gateway to your server. This is
> another reason -- besides confidentiality and integrity -- why it is a
> good idea to run the gateway locally on the same computer as the
> client.)
> 

See my benchmark above. I seem to have no performance win at all ?
But it has correctly detected the deduplicated file, so what could have
gone wrong ?

> The biggest performance *lose* will be when you've made a small change
> to a large file, such as a minor modification to a 2 GB virtual
> machine image, and tahoe re-uploads the entire 2 GB file instead of
> computing a delta. :-)
> 

That I can confirm ;-)

Thank you again for your patience and your help !


More information about the tahoe-dev mailing list