[tahoe-dev] BackupDB proposal
Ben Laurie
ben at links.org
Thu May 29 01:45:11 PDT 2008
Brian Warner wrote:
> Here's a design I've been hacking on for a "backup database", meant to
> avoid duplicate uploads of files that are already present in the grid.
> This would speed up "tahoe cp -r" -style backups by avoiding a lot of
> the network IO for files that haven't changed since the last time you
> did a backup.
>
> Please take a look at it and let me know what you think. Zooko and Rob,
> I'm especially interested in your feedback, since you've got a lot more
> experience with the old MV backupdb than I've got.
>
> When this feature goes in (perhaps later this week?), this document will live
> in docs/backupdb.txt .
Rather than messing around with a database, I would store hashes
alongside each file and check whether the hash has changed. Obviously
you incur the cost of rehashing the local file each time, but, well, who
cares?
>
> cheers,
> -Brian
>
>
> = The Tahoe BackupDB =
>
> To speed up backup operations, Tahoe maintains a small database known as the
> "backupdb". This is used to avoid re-uploading files which have already been
> uploaded recently.
>
> This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite
> single-file database. It is used by the "tahoe backup" command, and by the
> "tahoe cp" command when the --use-backupdb option is included.
>
> The purpose of this database is specifically to manage the file-to-cap
> translation (the "upload" step). It does not address directory updates.
>
> The overall goal of optimizing backup is to reduce the work required when the
> source disk has not changed since the last backup. In the ideal case, running
> "tahoe backup" twice in a row, with no intervening changes to the disk, will
> not require any network traffic.
>
> This database is optional. If it is deleted, the worst effect is that a
> subsequent backup operation may use more effort (network bandwidth, CPU
> cycles, and disk IO) than it would have without the backupdb.
>
> == Schema ==
>
> The database contains the following tables:
>
> CREATE TABLE version
> (
> version integer # contains one row, set to 0
> );
>
> CREATE TABLE last_upload
> (
> path varchar(1024), # index, this is os.path.abspath(fn)
> size integer, # os.stat(fn)[stat.ST_SIZE]
> mtime number, # os.stat(fn)[stat.ST_MTIME]
> fileid integer
> );
>
> CREATE TABLE caps
> (
> fileid integer PRIMARY KEY AUTOINCREMENT,
> filecap varchar(256), # URI:CHK:...
> last_uploaded timestamp,
> last_checked timestamp
> );
>
> CREATE TABLE keys_to_files
> (
> readkey varchar(256) PRIMARY KEY, # index, AES key portion of filecap
> fileid integer
> );
>
> Notes: if we extend the backupdb to assist with directory maintenance (see
> below), we may need paths in multiple places, so it would make sense to
> create a table for them, and change the last_upload table to refer to a
> pathid instead of an absolute path:
>
> CREATE TABLE paths
> (
> path varchar(1024), # index
> pathid integer PRIMARY KEY AUTOINCREMENT
> );
>
> == Operation ==
>
> The upload process starts with a pathname (like ~/.emacs) and wants to end up
> with a file-cap (like URI:CHK:...).
>
> The first step is to convert the path to an absolute form
> (/home/warner/emacs) and do a lookup in the last_upload table. If the path is
> not present in this table, the file must be uploaded. The upload process is:
>
> 1. record the file's size and modification time
> 2. upload the file into the grid, obtaining an immutable file read-cap
> 3. add an entry to the 'caps' table, with the read-cap, and the current time
> 4. extract the read-key from the read-cap, add an entry to 'keys_to_files'
> 5. add an entry to 'last_upload'
>
> If the path *is* present in 'last_upload', the easy-to-compute identifying
> information is compared: file size and modification time. If these differ,
> the file must be uploaded. The row is removed from the last_upload table, and
> the upload process above is followed.
>
> If the path is present but the mtime differs, the file may have changed. If
> the size differs, then the file has certainly changed. The client will
> compute the CHK read-key for the file by hashing its contents, using exactly
> the same algorithm as the node does when it uploads a file (including
> ~/.tahoe/private/convergence). It then checks the 'keys_to_files' table to
> see if this file has been uploaded before: perhaps the file was moved from
> elsewhere on the disk. If no match is found, the file must be uploaded, so
> the upload process above is follwed.
>
> If the read-key *is* found in the 'keys_to_files' table, then the file has
> been uploaded before, but we should consider performing a file check / verify
> operation to make sure we can skip a new upload. The fileid is used to
> retrieve the entry from the 'caps' table, and the last_checked timestamp is
> examined. If this timestamp is too old, a filecheck operation should be
> performed, and the file repaired if the results are not satisfactory. A
> "random early check" algorithm should be used, in which a check is performed
> with a probability that increases with the age of the previous results. E.g.
> files that were last checked within a month are not checked, files that were
> checked 5 weeks ago are re-checked with 25% probability, 6 weeks with 50%,
> more than 8 weeks are always checked. This reduces the "thundering herd" of
> filechecks-on-everything that would otherwise result when a backup operation
> is run one month after the original backup. The readkey can be submitted to
> the upload operation, to remove a duplicate hashing pass through the file and
> reduce the disk IO. In a future version of the storage server protocol, this
> could also improve the "streamingness" of the upload process.
>
> If the file's size and mtime match, the file is considered to be unmodified,
> and the last_checked timestamp from the 'caps' table is examined as above
> (possibly resulting in a filecheck or repair). The --no-timestamps option
> disables this check: this removes the danger of false-positives (i.e. not
> uploading a new file, because it appeared to be the same as a previously
> uploaded one), but increases the amount of disk IO that must be performed
> (every byte of every file must be hashed to compute the readkey).
>
> This algorithm is summarized in the following pseudocode:
>
> {{{
> def backup(path):
> abspath = os.path.abspath(path)
> result = check_for_upload(abspath)
> now = time.time()
> if result == MUST_UPLOAD:
> filecap = upload(abspath, key=result.readkey)
> fileid = db("INSERT INTO caps (filecap, last_uploaded, last_checked)",
> (filecap, now, now))
> db("INSERT INTO keys_to_files", (result.readkey, filecap))
> db("INSERT INTO last_upload", (abspath,current_size,current_mtime,fileid))
> if result in (MOVED, ALREADY_UPLOADED):
> age = now - result.last_checked
> probability = (age - 1*MONTH) / 1*MONTH
> probability = min(max(probability, 0.0), 1.0)
> if random.random() < probability:
> do_filecheck(result.filecap)
> if result == MOVED:
> db("INSERT INTO last_upload",
> (abspath, current_size, current_mtime, result.fileid))
>
>
> def check_for_upload(abspath):
> row = db("SELECT (size,mtime,fileid) FROM last_upload WHERE path == %s"
> % abspath)
> if not row:
> return check_moved(abspath)
> current_size = os.stat(abspath)[stat.ST_SIZE]
> current_mtime = os.stat(abspath)[stat.ST_MTIME]
> (last_size,last_mtime,last_fileid) = row
> if file_changed(current_size, last_size, current_mtime, last_mtime):
> db("DELETE FROM last_upload WHERE fileid=%s" % fileid)
> return check_moved(abspath)
> (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" +
> " WHERE fileid == %s" % last_fileid)
> return ALREADY_UPLOADED(filecap=filecap, last_checked=last_checked)
>
> def file_changed(current_size, last_size, current_mtime, last_mtime):
> if last_size != current_size:
> return True
> if NO_TIMESTAMPS:
> return True
> if last_mtime != current_mtime:
> return True
> return False
>
> def check_moved(abspath):
> readkey = hash_with_convergence(abspath)
> fileid = db("SELECT (fileid) FROM keys_to_files WHERE readkey == %s"%readkey)
> if not fileid:
> return MUST_UPLOAD(readkey=readkey)
> (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" +
> " WHERE fileid == %s" % fileid)
> return MOVED(fileid=fileid, filecap=filecap, last_checked=last_checked)
>
> def do_filecheck(filecap):
> health = check(filecap)
> if health < DESIRED:
> repair(filecap)
>
> }}}
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
>
>
--
http://www.apache-ssl.org/ben.html http://www.links.org/
"There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit." - Robert Woodruff
More information about the tahoe-dev
mailing list