#598 closed enhancement (fixed)
add 'tahoe backup' command: fast versioned readonly backups
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 1.3.0 |
Component: | code-frontend-cli | Version: | 1.2.0 |
Keywords: | Cc: | tahoe-dev@… | |
Launchpad Bug: |
Description
As a complement to the only-the-latest-version 'tahoe sync' command described in #597, I'd like to have a full-featured multiple-version 'tahoe backup' command too. This would behave like the existing windows-only allmydata.com backup tool:
tahoe backup LOCALDIR ALIAS:BACKUPBASEDIR
LOCALDIR refers to a directory on the local disk. ALIAS:BACKUPBASEDIR will refer to a writeable Tahoe directory; it will be created if it does not already exist.
Each time this is run, ALIAS:BACKUPBASEDIR/$TIMESTAMP will be created, as a read-only directory that contains an exact mirror of the local disk's LOCALDIR subtree. In addition, ALIAS:BACKUPBASEDIR/Latest will be a read-only reference to the same directory. Over time, BACKUPBASEDIR/ will be filled with a series of timestamped directories, containing historical backups.
Whenever possible, $TIMESTAMP[n] will contain references to files and directories created under $TIMESTAMP[n-1]; i.e. backups will share unchanged objects with earlier backups. Each backup, once finished, will not be changed again. If/when Tahoe acquires immutable dirnodes, 'tahoe backup' will take advantage of them. Meanwhile, it will use read-only dirnodes, by throwing out the write-cap for the $TIMESTAMP directory when the backup is done.
This will use the same backupdb as described in #597 to reduce the amount of work that must be done for unchanged files.
A basic backup system could be constructed by simply running 'tahoe backup' in a cron job. It might be a good idea to have a lockfile of some sort to make this usage safer (i.e. prevent overruns from causing two simultaneous backups from running at the same time).
Change History (9)
comment:1 Changed at 2009-01-31T00:10:03Z by warner
comment:2 Changed at 2009-01-31T00:22:51Z by warner
The basic flowchart I've got in mind:
- start with a writecap to the Backups/ directory
- locate the most recent version, get its readcap (or None)
- newdircap = process(olddircap, localdir)
- process(olddircap, localdir):
- fetch contents of olddircap, if any
- create empty mapping for new directory contents
- list localdir
- for each directory:
- newdircontents[name] = process(olddircontents[name], localdir+name)
- for each file:
- newdircontents[name] = upload-with-backupdb(localdir+name)
- now compare newdircontents with olddircontents, including metadata
- if identical, return olddircap
- if not, mkdir, set_children(newdircontents), return readonly(new-dircap)
- add top-level newdircap to Backups/$CURRENT_TIMESTAMP
If upload-with-backupdb works as described in http://allmydata.org/pipermail/tahoe-dev/2008-May/000620.html , then the workload of a null backup will be the recursive read of the entire most-recent-version subtree. To avoid even that:
- maintain a backupdb table that maps from HASH(newdircontents) to dircap
- instead of comparing newdircontents with olddircontents, hash newdircontents and look for the result in the table
- if mkdir() must be used, add an entry to the table afterwards
- allow entries to be removed from the table at some point (perhaps any entry which is not used in a 'tahoe backup' run should be discarded at the end of that run)
With that in place, a null backup should involve nothing but local stat() calls.
comment:3 Changed at 2009-02-03T01:52:56Z by warner
Some data points: home directory sizes on some developer's machines:
- warner@fluxx: 98k dirs, 776k files, 50GB of data
- warner@luther: 85k dirs, 1065k files, 81GB
- zandr: 500k dirs, 1.1M files, 1.2TB
- zandr (notebook): 61k dirs, 400k files, 184GB
- zooko: 153k dirs, 1306k files, ??
So, to use "tahoe backup" on these systems, the backupdb must be able to efficiently manage a million entries. I think this is too big for a simple pickle to handle well.
I'll do some experimentation, but my current plan is to use a sqlite database, one for the file-oriented backupdb, and a second for the directory-contents db.
Going forward, of course, it would be nice to allow the use of mysql or postgres. But sqlite is in the python2.5 stdlib, and has a synchronous interface (which makes the implementation of tahoe_backup.py a bit easier), and doesn't require any external setup. Whereas mysql/postgres would require a separate process to be configured and a DB to be set up, along with user-account setup. Another question is to use sqlite directly or use the Axiom layer (which we're using as an experiment in the disk-watcher).. I'm inclined to use sqlite directly, again because of avoiding lots of new dependencies.
comment:4 Changed at 2009-02-03T02:06:50Z by warner
zooko's system with 153k dirs and 1306k files has about 69GB of data
comment:5 Changed at 2009-02-03T03:10:30Z by warner
cfce8b5eab431772 has the first cut: no backupdb, but the other functionality is there.
comment:6 Changed at 2009-02-03T03:43:58Z by zooko
My system with 153k dirs and 1306k files has 35,350 files which are duplicates -- that set of 35,350 files has only 17,675 unique md5 hashes.
comment:7 Changed at 2009-02-03T04:00:49Z by zooko
- Cc tahoe-dev@… added
Note that I'm adding Cc: tahoe-dev@… to this ticket, so until that Cc: is removed any comments posted here will be mailed to the list.
comment:8 Changed at 2009-02-06T04:19:15Z by warner
- Milestone changed from undecided to 1.3.0
- Resolution set to fixed
- Status changed from new to closed
Done. 177ffa0870390c6e was the last patch: the "tahoe backup" command now uses the backupdb and avoids uploading any file that looks like it was unchanged. I'll create a separate ticket (#606) for adding a directory cache to the backupdb.. that can be a future enhancement that will improve performance even further.
comment:9 Changed at 2009-02-20T23:46:42Z by azazel
I've done some little benchmarks of uploading one of my darcs repos to the production grid. I've uploaded it first using "tahoe cp -r -v" and then i uploaded a tar (not zipped) of the same data. The repo is composed of 67 dirs and 4098 files, the tar size is 27 MB. The "cp -r -v" took roughly 3.5 hours, the "cp repo.tar" took 760 seconds. The client is configured to use an helper.
Here are the stats for one of the files involved in the first upload:
* Timings: o File Size: 3424 bytes o Total: 3.88s (882Bps) + Storage Index: 194us (17.64MBps) + [Contacting Helper]: 723ms # [Helper Already-In-Grid Check]: 228ms + [Upload Ciphertext To Helper]: 352ms (9.7kBps) + Peer Selection: 879ms + Encode And Push: 1.05s (69.9kBps) # Cumulative Encoding: 705us (4.86MBps) # Cumulative Pushing: 48ms (71.0kBps) # Send Hashes And Close: 881ms + [Helper Total]: 3.37s
Next, the stats for the tar upload:
* Timings: o File Size: 27176960 bytes o Total: 760.13s (35.8kBps) + Storage Index: 261us (104099.02MBps) + [Contacting Helper]: 702ms # [Helper Already-In-Grid Check]: 454ms + [Upload Ciphertext To Helper]: 723.25s (37.6kBps) + Peer Selection: 461ms + Encode And Push: 35.01s (807.5kBps) # Cumulative Encoding: 1.50s (18.13MBps) # Cumulative Pushing: 32.16s (845.1kBps) # Send Hashes And Close: 996ms + [Helper Total]: 759.47s
This small test demostrated an overhead of 1.5 ~ 2 seconds for every upload operation. Lastly i post the results of a "du --si $repo; find $repo -type f |wc -l; find $repo -type d |wc-l" command:
33k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/EGG-INFO 136k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/command/generate 144k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/command 25k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/examples/db 29k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/examples 115k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/loadable 29k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/setup_cmd 78k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_command/test_generate 82k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_command 213k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_loadable 426k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test 922k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture 955k wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg 353k wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope/schema/tests 627k wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope/schema 635k wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope 54k wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/EGG-INFO 689k wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg 46k wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/common/tests 168k wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/common 267k wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/tests 1,0M wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface 1,1M wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope 87k wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/EGG-INFO 1,1M wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg 21k wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/zope/event 29k wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/zope 29k wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/EGG-INFO 58k wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg 41k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component/bbb 29k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component/testfiles 672k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component 680k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope 95k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/EGG-INFO 775k wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg 4,0M wip/cute/lib/python2.5/site-packages 13k wip/cute/lib/python2.5/distutils 4,0M wip/cute/lib/python2.5 4,0M wip/cute/lib 0 wip/cute/include 1,2M wip/cute/bin 5,3M wip/cute/cute/_darcs/pristine.hashed 13M wip/cute/cute/_darcs/patches 21k wip/cute/cute/_darcs/prefs 435k wip/cute/cute/_darcs/inventories 19M wip/cute/cute/_darcs 463k wip/cute/cute/docs/tutorial/images 517k wip/cute/cute/docs/tutorial 0 wip/cute/cute/docs/experiments 517k wip/cute/cute/docs 29k wip/cute/cute/lib/cute/app 25k wip/cute/cute/lib/cute/ui/widgets 91k wip/cute/cute/lib/cute/ui/resources 13k wip/cute/cute/lib/cute/ui/designer_plugins 13k wip/cute/cute/lib/cute/ui/ui 8,2k wip/cute/cute/lib/cute/ui/test 304k wip/cute/cute/lib/cute/ui 8,2k wip/cute/cute/lib/cute/db/search 17k wip/cute/cute/lib/cute/db/source 91k wip/cute/cute/lib/cute/db 3,6M wip/cute/cute/lib/cute/tests/sample_data/birt/images/logos 263k wip/cute/cute/lib/cute/tests/sample_data/birt/images/productlines 4,1M wip/cute/cute/lib/cute/tests/sample_data/birt/images 4,3M wip/cute/cute/lib/cute/tests/sample_data/birt 4,3M wip/cute/cute/lib/cute/tests/sample_data 4,3M wip/cute/cute/lib/cute/tests 4,8M wip/cute/cute/lib/cute 4,8M wip/cute/cute/lib 24M wip/cute/cute 33M wip/cute 3647 70
Looks like this is more important than #597 .