= The Tahoe BackupDB = To speed up backup operations, Tahoe maintains a small database known as the "backupdb". This is used to avoid re-uploading files which have already been uploaded recently. This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite single-file database. It is used by the "tahoe backup" command (unless the --no-backupdb option is included). In the future, it will also be used by "tahoe mirror", and by "tahoe cp" when the --use-backupdb option is included. The purpose of this database is specifically to manage the file-to-cap translation (the "upload" step). It does not address directory updates. A future version will include a directory cache. The overall goal of optimizing backup is to reduce the work required when the source disk has not changed since the last backup. In the ideal case, running "tahoe backup" twice in a row, with no intervening changes to the disk, will not require any network traffic. This database is optional. If it is deleted, the worst effect is that a subsequent backup operation may use more effort (network bandwidth, CPU cycles, and disk IO) than it would have without the backupdb. The database uses sqlite3, which is included as part of the standard python library with python2.5 and later. For python2.4, please install the "pysqlite2" package (which, despite the name, actually provides sqlite3 rather than sqlite2). == Schema == The database contains the following tables: CREATE TABLE version ( version integer # contains one row, set to 1 ); CREATE TABLE last_upload ( path varchar(1024), PRIMARY KEY -- index, this is os.path.abspath(fn) size integer, -- os.stat(fn)[stat.ST_SIZE] mtime number, -- os.stat(fn)[stat.ST_MTIME] ctime number, -- os.stat(fn)[stat.ST_MTIME] fileid integer ); CREATE TABLE caps ( fileid integer PRIMARY KEY AUTOINCREMENT, filecap varchar(256) UNIQUE -- URI:CHK:... ); CREATE TABLE last_upload ( fileid INTEGER PRIMARY KEY, last_uploaded TIMESTAMP, last_checked TIMESTAMP ); Notes: if we extend the backupdb to assist with directory maintenance (see below), we may need paths in multiple places, so it would make sense to create a table for them, and change the last_upload table to refer to a pathid instead of an absolute path: CREATE TABLE paths ( path varchar(1024) UNIQUE, -- index pathid integer PRIMARY KEY AUTOINCREMENT ); == Operation == The upload process starts with a pathname (like ~/.emacs) and wants to end up with a file-cap (like URI:CHK:...). The first step is to convert the path to an absolute form (/home/warner/emacs) and do a lookup in the last_upload table. If the path is not present in this table, the file must be uploaded. The upload process is: 1. record the file's size, creation time, and modification time 2. upload the file into the grid, obtaining an immutable file read-cap 3. add an entry to the 'caps' table, with the read-cap, to get a fileid 4. add an entry to the 'last_upload' table, with the current time 5. add an entry to the 'local_files' table, with the fileid, the path, and the local file's size/ctime/mtime If the path *is* present in 'last_upload', the easy-to-compute identifying information is compared: file size and ctime/mtime. If these differ, the file must be uploaded. The row is removed from the last_upload table, and the upload process above is followed. If the path is present but ctime or mtime differs, the file may have changed. If the size differs, then the file has certainly changed. At this point, a future version of the "backup" command might hash the file and look for a match in an as-yet-defined table, in the hopes that the file has simply been moved from somewhere else on the disk. This enhancement requires changes to the Tahoe upload API before it can be significantly more efficient than simply handing the file to Tahoe and relying upon the normal convergence to notice the similarity. If ctime, mtime, or size is different, the client will upload the file, as above. If these identifiers are the same, the client will assume that the file is unchanged (unless the --ignore-timestamps option is provided, in which case the client always re-uploads the file), and it may be allowed to skip the upload. For safety, however, we require the client periodically perform a filecheck on these probably-already-uploaded files, and re-upload anything that doesn't look healthy. The client looks the fileid up in the 'last_upload' table, to see how long it has been since the file was last checked. A "random early check" algorithm should be used, in which a check is performed with a probability that increases with the age of the previous results. E.g. files that were last checked within a month are not checked, files that were checked 5 weeks ago are re-checked with 25% probability, 6 weeks with 50%, more than 8 weeks are always checked. This reduces the "thundering herd" of filechecks-on-everything that would otherwise result when a backup operation is run one month after the original backup. If a filecheck reveals the file is not healthy, it is re-uploaded. If the filecheck shows the file is healthy, or if the filecheck was skipped, the client gets to skip the upload, and uses the previous filecap (from the 'caps' table) to add to the parent directory. If a new file is uploaded, a new entry is put in the 'caps' and 'last_upload' table, and an entry is made in the 'local_files' table to reflect the mapping from local disk pathname to uploaded filecap. If an old file is re-uploaded, the 'last_upload' entry is updated with the new timestamps. If an old file is checked and found healthy, the 'last_upload' entry is updated. Relying upon timestamps is a compromise between efficiency and safety: a file which is modified without changing the timestamp or size will be treated as unmodified, and the "tahoe backup" command will not copy the new contents into the grid. The --no-timestamps can be used to disable this optimization, forcing every byte of the file to be hashed and encoded. == DIRECTORY CACHING == A future version of the backupdb will also record a secure hash of the most recent contents of each tahoe directory that was used in the last backup run. The directories created by the "tahoe backup" command are all read-only, so it should be difficult to violate the assumption that these directories are unmodified since the previous pass. In the future, Tahoe will provide truly immutable directories, making this assumption even more solid. In the current implementation, when the backup algorithm is faced with the decision to either create a new directory or share an old one, it must read the contents of the old directory to compare it against the desired new contents. This means that a "null backup" (performing a backup when nothing has been changed) must still read every Tahoe directory from the previous backup. With a directory-caching backupdb, these directory reads will be bypassed, and the null backup will use minimal network bandwidth: one directory read and two modifies. The Archives/ directory must be read to locate the latest backup, and must be modified to add a new snapshot, and the Latest/ directory will be updated to point to that same snapshot.