#871 new defect

handle out-of-disk-space condition

Reported by: zooko Owned by: somebody
Priority: major Milestone: soon
Component: code Version: 1.5.0
Keywords: reliability availability Cc:
Launchpad Bug:

Description

How does a Tahoe-LAFS node handle it when it runs out of disk space? This happens somewhat frequently with the allmydata.com nodes because they are configured to keep about 10 GB of space free (in order to allow updates to mutable shares, using the reserved_space configuration), and when someone uses the storage servers as a web gateway (they are all configured to serve as web gateways) then sometimes the download cache fills up the remaining 10 GB and causes the download to fail, and then the cache doesn't get cleaned up, and then whenever the node runs it gets out-of-disk-space problems such as being unable to open the twistd.log file. I will open another ticket about the fact that the cache isn't getting cleaned up, but this ticket is about making the tahoe-lafs node fail gracefully and with a useful error message when there is no disk space.

Change History (15)

comment:1 Changed at 2009-12-26T02:59:09Z by warner

heh, I think you mean "fail gracefully without an error message".. where would the message go? :)

More seriously though, this is a tricky situation. A lot of operations can continue to work normally. We certainly want storage server reads to keep working, and these should never require additional disk space. Many client operations should work: full immutable downloads are held entirely in RAM (since we do streaming downloads and pause the process until the HTTP client accepts each segment), and small uploads are entirely RAM. Large uploads (either mutable or immutable) cause twisted.web to use a tempfile, and random-access immutable downloads currently use a tempfile. All mutable downloads are RAM based, as are all directory operations.

I suppose that when the log stops working due to a full disk, it would be nice if we could connect via 'flogtool tail' and find out about the node's predicament. The easiest way to emit a message that will be retrievable a long time later is to emit one at a very high severity level. This will trigger an incident, which won't be writable because the disk is full, so we need to make sure foolscap handles that reasonably.

I hesitate to suggest something so complex, but perhaps we should consider a second form of reserved-space parameter, which is applied to the non-storage-server disk consumers inside the node. Or maybe we could track down the non-storage-server disk consumers and make them all obey the same reserved-space parameter that the storage server tries to obey. With this sort of feature, the node would fail sort-of gracefully when the reserved-space limit was exceeded, by refusing to accept large uploads or perform large random-access downloads that would require more disk space. We'd have to decide what sorts of logging would be subject to this limit. Maybe a single Incident when the threshold was crossed (which would be logged successfully, using some of the remaining space), would at least put notice of impending space exhaustion on the disk where it could be found by operators later.

comment:2 Changed at 2009-12-27T04:10:46Z by davidsarah

  • Keywords reliability availability added

If the fix causes other operations than share upload to respect the reserved_space setting, then there should still be enough space to log the failure. (There can be a slightly smaller reserved_space limitation for writing to the logfile.)

comment:3 Changed at 2009-12-27T04:58:03Z by warner

well, unless the full disk is the result of some other non-Tahoe process altogether, which is completely ignorant of tahoe's reserved_space concept. Gotta plan for the worst..

comment:4 Changed at 2009-12-27T21:18:03Z by zooko

So your idea is to make all of the node operations respect the reserved_space parameter except for the logging operations, and then add a (high severity) log message showing that the reserved_space limit has been reached? That sounds good. Oh, yeah but as you say, what should the node do when there isn't any disk space? What would be complicated about triggering a very high-severity incident when an out-of-disk-space condition is detected? That sounds straightforward to me, and as far as I understand foolscap, an investigator who later connected with a flogtool tail would then see that high-severity incident report, right?

comment:5 Changed at 2009-12-27T21:55:03Z by warner

Yes to all of that. I hadn't been thinking of two separate messages, but maybe that makes sense.. one when reserved_space is exceeded the first time, another when, hm, well when disk_avail==0 (or disk_avail<REALLYSMALL) but since we'd already be guarding all our writes with reserved_space, I don't know where exactly we'd be checking for the second threshold.

Anyways, the requirement on Foolscap is that its "Incident Reporter" (which is the piece that tries to write the .flog file into BASEDIR/logs/incidents/) must survive the out-of-disk condition without breaking logging or losing the in-RAM copy of the incident. As long as that incident is in RAM, a flogtool tail process should see it later. (I just added foolscap#144 to track this requirement).

The only other thing I'd want to think about is how to keep the message (or messages) from being emitted over and over. The obvious place to put this message would in in the storage server (where it tests disk-space-remaining against reserved_space) and the cachey thing (where we're going to add code to do the same). But should there be a flag of some sort to say "only emit this message once". But, if something resolves the too-full condition and then, a month later, it gets full again, would we want the message to be re-emitted?

It almost seems like we'd want a switch that the operator resets when they fix the overfull condition, sort of like the "Check Engine" light on your car's dashboard that stays on until the mechanic fixes everything. (or, assuming your mechanic is a concurrency expert and has a healthy fear of race conditions, they'll turn off the light and *then* fix everything).

Maybe the rule should be that if you see this incident, you should do something to free up space, and then restart the node (to reset the flag).

comment:6 Changed at 2009-12-27T23:27:06Z by davidsarah

The flag should be reset when the free space is observed to be above a threshold (reserved_space plus constant) when we test it. I think there's no need to poll the free space -- testing it when we are about to write something should be sufficient. There's also no need to remember the flag across restarts.

comment:7 follow-up: Changed at 2009-12-27T23:49:15Z by warner

So, something like this?:

did_message = False
def try_to_write():
    if freespace() > reserved_space+hysteresis_constant:
        did_message = False
    if freespace() > reserved_space:
        do_write()
    else:
        if not did_message:
            did_message = True
            log_message()

comment:8 in reply to: ↑ 7 Changed at 2009-12-28T00:09:39Z by davidsarah

Replying to warner:

So, something like this? [...]

Yes. Nitpicks:

  • it should only make the OS call to get free space once.
  • this shouldn't be duplicated code, so it would have to return a boolean saying whether to do the write rather than actually doing it.

comment:9 Changed at 2010-01-16T00:47:16Z by davidsarah

"Making all of the node operations respect the reserved_space parameter" includes #390 ('readonly_storage' and 'reserved_space' not honored for mutable-slot write requests).

comment:10 Changed at 2010-02-01T20:01:18Z by davidsarah

  • Milestone changed from undecided to 1.7.0

comment:11 Changed at 2010-02-15T19:53:10Z by davidsarah

  • Milestone changed from 1.7.0 to 1.6.1

comment:12 Changed at 2010-02-16T05:24:16Z by zooko

  • Milestone changed from 1.6.1 to 1.7.0

Bumping this from v1.6.1 because it isn't a regression and we have other tickets to do in v1.6.1.

comment:13 Changed at 2010-05-15T03:55:49Z by zooko

  • Milestone changed from 1.7.0 to eventually

This is a "reliability" issue, meaning that it is one of those things that developers can get away with ignoring most of the time because most of the time they aren't encountering the conditions which cause this issue to arise.

Therefore, it's the kind of ticket that I value highly so that we don't forget about it and allow users to suffer the consequences. But, v1.7 is over and I'm moving this to "eventually" instead of to v1.8 because I'm not sure of the priority of this ticket vs. the hundreds of other tickets that I'm not looking at right now, and because I don't want the "bulldozer effect" of a big and growing pile of tickets getting pushed from one Milestone to the next. :-)

comment:14 Changed at 2010-12-30T22:34:59Z by davidsarah

  • Milestone changed from eventually to soon

comment:15 Changed at 2011-04-17T19:05:40Z by zooko

See also #1279 which is more about what happens if the disk is full when the server is trying to start.

Note: See TracTickets for help on using tickets.