[tahoe-dev] Potential use for personal backup

Saint Germain saintger at gmail.com
Thu Jun 14 23:21:13 UTC 2012


On Thu, 14 Jun 2012 14:15:41 -0300, "Zooko Wilcox-O'Hearn"
<zooko at zooko.com> wrote :

> Thanks for reporting about these backup measurements, Saint Germain!
> 
> On Mon, Jun 11, 2012 at 9:06 PM, Saint Germain <saintger at gmail.com>
> wrote:
> >
> > As a basis for comparison, here are the results for bup:
> > First backup: 12693 Mo in 36mn
> 
> Hm, does bup do compression? If not, why wasn't this 24 GB?
> 

As it is based on Git, I think it uses zlib compression.

> > Second backup (after modifications): +37 Mo in 12mn
> 
> This is very cool! I want to steal this feature from
> bup/backshift/Low-Bandwidth-File-System and add it to Tahoe-LAFS...
> 
> 

Don't forget Obnam. It is the only one which do deduplication on file
data chunk and encryption (not very easy as you certainly already know).

> > Then I did some modifications:
> >  - Rename a directory of small files (1000 files of 1 MB)
> >  - Rename a 500 MB file
> >  - Modify a 500 MB file (delete a line at the middle of the file)
> >  - Modify a 500 MB file (modify a line at the middle of the file)
> >  - Duplicate a 500 MB file
> >  - Duplicate a 500 MB file and modify a line at the middle
> >
> > First backup: 24 GB in 111 mn
> > Second backup (after modifications): +1751 Mo in 118mn
> > Restore: 68 mn
> 
> 
> >> The second-biggest performance win will be when you upload a file
> >> (by "tahoe backup" or otherwise) and after transferring the file
> >> contents to your Tahoe-LAFS gateway, and after the gateway
> >> encrypts and hashes it, the gateway talks to the storage servers
> >> and finds out that a copy of this identical file has already been
> >> uploaded, so it does not need to transfer the contents to the
> >> storage server.
> 
> > See my benchmark above. I seem to have no performance win at all ?
> > But it has correctly detected the deduplicated file, so what could
> > have gone wrong ?
> 
> I think what's going on here is that it takes just as long to transfer
> the file from the tahoe client to the gateway and let the gateway
> discover that it is a duplicate as to transfer the file from the
> client to the gateway and then from the gateway to the storage server
> for storage, since they are all on the same machine.
> 
> So you can't see any performance improvement due to the deduplication
> when the gateway and the storage server are on the same computer.
> 
> If you want to see a performance improvement due to the deduplication,
> then keep the gateway on the same computer with the client, but move
> the storage server far away, e.g. to the free Tahoe-LAFS-on-S3 service
> that we sent you email about. ;-)
> 
> As well as showing up a performance differential by making the
> gateway↔connection much slower, this would also be a realistic
> benchmark, since having your storage off-site is a useful thing to do
> in real usage.
> 

Yes I would like to make another benchmark using remote storage.
I don't when I'll have the time though.

> >> The biggest performance *lose* will be when you've made a small
> >> change to a large file, such as a minor modification to a 2 GB
> >> virtual machine image, and tahoe re-uploads the entire 2 GB file
> >> instead of computing a delta. :-)
> >
> > That I can confirm ;-)
> 
> Thanks. :-) I'd really like to experiment with using backshift's
> variable-length, content-determined chunking and LZMA compression and
> Tahoe-LAFS's storage.
> 

I've had bad experience with LZMA (see Backshift mailing-list).
It certainly saves more space than gzip or bz2, but for an exorbitant
price.
Again, don't forget Obnam : I was quite impressed by it.

Regards,


More information about the tahoe-dev mailing list