#119 closed task (fixed)

lease expiration / deletion / garbage-collection

Reported by: warner Owned by:
Priority: major Milestone: 1.4.1
Component: code-storage Version: 0.7.0
Keywords: Cc:
Launchpad Bug:

Description

I think the last Big Thing we need to develop (as opposed to implement or fix) is a structure to both maintain the long-term health of files and also insure their eventual deletion. I think these need to be developed together, since they are closely related.

Leases need to expire after a while (we're thinking of one month as a good timeout). Files that are supposed to stick around longer than this either need to be kept alive by the original uploader or by someone to whom they've delegated this task. If the original uploader expects to be around at least once a month, they can do it themselves, but for a backup application we can't impose this requirement. We refer to this task as "refreshing", and the provider of this service is either doing it out of the kindness of their heart (in the friend-net use case) or as part of a paid service (in the commercial-offering use case).

The refreshing process will also perform "file checking", which is simply counting the number of shares that are available for any given file. This gives a rough measure of the "health" of the file. The process may also perform "file verification" from time to time, which is downloading the crypttext and checking its hash against the value in the URI extension block.

If either checking/verification process discovers a problem, the "file repairer" may be triggered, which uses the remaining shares to reconstruct the correct crypttext, then re-encodes and re-uploads any shares which have been lost.

This series of processes all serve to improve the health of the file, at various bandwidth/CPU costs: refreshing/checking is cheap, repair/re-upload is expensive. The intent is to use the refreshing service to keep the file as healthy as possible at low cost, and use the checker results to trigger more costly repair operations as little as possible. Refreshing must take place at least once a month to keep the leases alive. The required filecheck frequency wil depend upon how quickly storage servers drop out of the grid: we expect that files will undergo an exponential decay curve, so we must do checks frequently enough to reduce the chance that the health will decay beyond repair. The exact parameters will be tunable, of course, to pick a tradeoff between bandwidth consumed and the chance that a file will decay too quickly to be saved.

Files that are deleted from a vdrive need to have their shares dereferenced in a timely fashion (I'm thinking by the end of the day for this). If the reference count drops to zero, the share should be deleted immediately (for a storage server on a home user's machine who wants their disk for other purposes), or marked for deletion as soon as the storage is needed for something else (for a dedicated commercial server with nothing better to do with that disk space; there's a chance that someone will re-upload the file that was just deleted, and if the share is still around then we can avoid repeating the upload). Deleted files should also be removed from the filechecker and repair mechanisms.

Note that files should be deleted promptly, rather than allowing their leases to expire on their own, to reduce the storage overhead (storage consumed beyond that required to desired files). The lease expiration mechanism is a necessary fallback to keep storage usage from growing without bound, but without prompt deletion, high churn rates could cause actual storage consumed to grow larger than desired.

Finally, many of our use cases will want to enforce a utilization quota on each user, limiting the amount of storage space they are allowed to consume. The file-repair service may be a good place to enforce this (with a rule saying that you can upload as much as you want, but the repair service won't help you exceed your quota). Eventually we may want each client to have membership credentials which would allow storage servers to measure how much space each client is consuming: with this, a daily (or slower) process could calculate how much global space is consumed by each client, and flag or revoke membership for clients which use more space than they've contracted for.

Change History (11)

comment:1 Changed at 2007-09-25T04:19:56Z by zooko

  • Milestone changed from 1.0 to 0.7.0
  • Version changed from 0.5.0 to 0.6.0

comment:2 Changed at 2007-11-01T20:14:26Z by zooko

  • Milestone changed from 0.7.0 to 0.7.1
  • Version changed from 0.6.0 to 0.6.1

We're focussing on an imminent v0.7.0 (see the roadmap) which hopefully has #197 -- Small Distributed Mutable Files and also a fix for #199 -- bad SHA-256. So I'm bumping less urgent tickets to v0.7.1.

comment:3 Changed at 2007-11-13T18:23:23Z by zooko

  • Milestone changed from 0.7.1 to 1.0
  • Version changed from 0.6.1 to 0.7.0

This is an important, required, feature, but it is a big feature to implement, and I don't think we are going to get it done in the next six weeks, so I'm putting it in Milestone 1.0.

comment:4 Changed at 2008-01-05T03:52:19Z by warner

  • Milestone changed from 1.0 to 0.8.0

comment:5 Changed at 2008-01-09T01:09:21Z by warner

  • Milestone changed from 0.8.0 (Allmydata 3.0 Beta) to 0.10.0

we've decided to push this out past 0.9.0

comment:6 Changed at 2008-05-09T00:09:49Z by warner

  • Milestone changed from 1.1.0 to undecided

this isn't a 1.1.0 thing

comment:7 Changed at 2008-06-03T05:26:22Z by warner

Here are some random notes that used to be in roadmap.txt:

 multiple categories of leases:
  1: committed leases -- we will not delete these in any case, but will instead
     tell an uploader that we are full
   1a: active leases
   1b: in-progress leases (partially filled, not closed, pb connection is
       currently open)
  2: uncommitted leases -- we will delete these in order to make room for new
     lease requests
   2a: interrupted leases (partially filled, not closed, pb connection is
       currently not open, but they might come back)
   2b: expired leases

  (I'm not sure about the precedence of these last two. Probably deleting
  expired leases instead of deleting interrupted leases would be okay.)

comment:8 Changed at 2008-09-03T01:36:58Z by warner

  • Summary changed from lease expiration / deletion / filechecking / quotas to lease expiration / deletion / garbage-collection / quotas

We've basically split lease/gc into a separate task from checker/repairer, so I'm removing the checker/repairer aspects of this ticket. This ticket will focus on lease/gc work.

comment:9 Changed at 2008-09-24T13:19:26Z by zooko

  • Summary changed from lease expiration / deletion / garbage-collection / quotas to lease expiration / deletion / garbage-collection

I'm not sure, but I think we've tentatively agreed to focus on garbage collection separately from the notion of accounting or quotes, so I'm changing the name of this ticket.

comment:10 Changed at 2008-09-24T13:51:20Z by zooko

I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html

comment:11 Changed at 2009-03-24T00:52:05Z by warner

  • Milestone changed from eventually to 1.3.1
  • Resolution set to fixed
  • Status changed from new to closed

I recently pushed a number of changes that roughly implement this. What we have right now (and will be in 1.3.1 or whatever-comes-after-1.3.0) is:

  • uploading a new immutable share, or creating a new mutable slot, results in a fixed-duration anonymous 31-day lease
  • the "tahoe check/deep-check --add-lease" CLI command (and some webapi equivalents) will add new fixed-duration anonymous 31-day leases to shares of existing files and directories
  • the storage server can optionally be configured to expire leases and delete shares when the last lease expires, in one of three modes:
    • honor the original 31-day timer
    • use an alternative timeout (perhaps 60 days)
    • expire leases that were created/renewed before an absolute cutoff date
  • storage server has a webapi page to display expiration status, space recovered, etc

There are lots of details about how GC currently works in source:docs/garbage-collection.txt . There are ways it can be improved (in particular by associated leases with account identifiers, to reduce the scope of the lease, to make it easier for leaseholders to safely cancel leases; also to reduce renewal traffic by switching to an expire-the-account mode instead of the current expire-the-file mode). But for moderate sized grids, the mark-and-sweep lease/GC approach ought to be sufficient.

Note: See TracTickets for help on using tickets.