id summary reporter owner description type status priority milestone component version resolution keywords cc launchpad_bug 379 very large memory usage when adding to large directories warner somebody "We've seen the webapi servers that are participating in the allmydata migration effort have memory usage spikes that jump to an incredible 3.1GB over the past two days. The CPU usage goes to 100% for about 5 minutes while this happens. This is occurring about once every 15 minutes, causing the migration process to run significantly slower. We've isolated the problem to the following items: * our !RemoteInterface for mutable share writes (i.e. slot_testv_and_readv_and_writev) declares a maximum share data size of 1 MiB, i.e. 1048576 bytes, but the maximum size of a mutable file (3.5 MB) leads to shares that can exceed this. This occurs for directories of about 9600 entries * Adding new children to such a directory causes an outbound foolscap Violation, because the share-size constraint is being violated * the Violation is a Failure object, and Twisted (specifically maybeDeferred) causes the cleanFailure method to be run, which turns the entire stack trace (including local variables for each frame) into a bunch of strings * the 1MB-ish shares are present in the argument list for about a dozen stack frames (in the locals). Every one of these strings gets repr'ed, and since they're binary data, each repr'ing gets a 4x expansion. * the memory usage is coming from the dozens of copies of this expanded share data inside the Failure's .stack and .frames attributes. * in addition, the way we're using {{{DeferredList}}} fails to catch errors in callRemote: servers fail to accept shares, but we don't notice. What we need to do to address this: * remove the size constraint on {{{ShareData}}}. This will stop these particular Violations from happening * Fix twisted ticket [http://twistedmatrix.com/trac/ticket/2466 Twisted#2466], somehow, then require the resulting release of twisted. * add fireOnOneErrback=True to the uses of {{{DeferredList}}} in mutable.py, to properly catch exceptions Any exception in share transmission is likely to consume this sort of memory. Workarounds we could conceivably implement before the Twisted problem gets fixed: * avoid holding large strings in local variables: pass them as instance attributes instead, be careful to delete them from locals as soon as possible * hold large strings in some sort of object (with a repr() that doesn't show the whole thing) instead of a real string, and teach Foolscap (via {{{ISliceable}}}) to serialize them as strings. * have Foolscap trim Failures in callRemote somehow * try to break up the stacks by using fireEventually in more places * rewrite Foolscap's outbound constraint checking to not do everything on the same stack Our current plan is to fix the constraint and then hope that we don't trigger other failures while we contribute to the Twisted ticket and wait for a new release. Later refactorings of share management will probably put more data in instance attributes rather than being passed through method arguments. If necessary, we can ship tahoe with a patched version of Twisted. " defect closed critical 1.1.0 code-dirnodes 1.0.0 fixed memory