#891 new defect

web gateway memory grows without bound under load

Reported by: zooko Owned by: warner
Priority: critical Milestone: soon
Component: code-frontend-web Version: 1.5.0
Keywords: reliability scalability memory Cc:
Launchpad Bug:

Description

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a flogtool tail running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.

Attachments (2)

dump.flog.bz2 (84.9 KB) - added by zooko at 2010-01-10T06:18:37Z.
"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life
dump-2.flog.bz2 (31.6 KB) - added by zooko at 2010-01-10T06:26:09Z.
Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents…

Download all attachments as: .zip

Change History (8)

Changed at 2010-01-10T06:18:37Z by zooko

"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life

Changed at 2010-01-10T06:26:09Z by zooko

Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents...

comment:1 Changed at 2010-01-10T06:28:56Z by zooko

So while I was running flogtool tail --save-as=dump.flog I started a second tail, like this: flogtool tail --save-as=dump-2.flog. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.

comment:2 Changed at 2010-02-27T09:07:13Z by davidsarah

  • Keywords memory added
  • Milestone changed from undecided to 1.7.0

comment:3 Changed at 2010-06-16T03:58:49Z by davidsarah

  • Milestone changed from 1.7.0 to soon

comment:4 Changed at 2010-06-19T18:16:05Z by warner

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the flogtool tail command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me.

I'll take a look at the logs as soon as I can.

comment:5 Changed at 2010-06-21T20:35:48Z by zooko

No I ran flogtool tail on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

comment:6 Changed at 2012-05-23T00:14:29Z by warner

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.

Note: See TracTickets for help on using tickets.