#1513 new defect

memory usage in MDMF publish

Reported by: warner Owned by:
Priority: major Milestone: eventually
Component: code-mutable Version: 1.9.0a1
Keywords: mutable mdmf memory-leak performance docs Cc:
Launchpad Bug:

Description (last modified by zooko)

I did a 'tahoe push --mdmf --mutable-type=mdmf foo' of a 210MB file. The client process swelled to 1.15GB RSS, making my entire system pretty unresponsive. The publish eventually succeeded, and the memory usage went back to normal.

I'm guessing that either there's a design problem in which it's trying to upload all segments in parallel, or there's a failure in the Pipeline code such that it's holding all shares in memory at the same time.

Since MDMF is supposed to make it possible to work with large files, I think the memory usage should be similar to CHK files: capped at a small constant times the segsize.

It would be nice to fix this for 1.9, but since MDMF is still experimental, I'm willing to ship without it.

Change History (8)

comment:1 follow-up: Changed at 2011-08-28T22:41:19Z by warner

Hm, there's a tension between reliability and memory-footprint-performance here. When making changes, we want each share to atomically jump from version1 to version2, without it being left in any intermediate state. But that means all of the changes need to be held in memory and applied at the same time.

When we're jumping from "no such share" to version1, those changes are the entire file. The data needs to be buffered *somewhere*. If we were allowed to write one segment at a time to the server's disk, then a server failure or lost connection would leave us in an intermediate state, where the share only had a portion of version1, which would effectively be a corrupt share.

I can think of a couple of ways to improve this:

  • special-case the initial share creation: give the client an API to incrementally write blocks to the new share, and either allow the world to see the incomplete share early, or put the partial share in a separate incoming/ directory and figure out a way to only make it visible to the client that's building it.
  • create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.:
    • edithandle = share.start_editing()
    • edithandle.apply_delta(offset, newdata)
    • edithandle.finish()
    • edithandle.abort()
    • finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster.

If we're willing to tolerate the disk-footprint, we could increase reliability against server crashes by making start_editing() create a full copy of the old share in a sibling directory (like incoming/, not visible to anyone but the edithandle). Then apply_delta() would do normal write()s to the copy, and finish() would atomically move the copy back into place. Everything in the incoming/ directory would be deleted at startup, and the temp copies would also be deleted when the connection to the client was lost. This would slow down the updates for large files (since a lot of data would need to be shuffled around before the edit could begin), and would consume more disk (twice the size of the share), but would allow edits to be spread across separate messages, which reduces the client's memory requirements. It would also reduce share corruption caused by the server being bounced during a mutable write.

comment:2 in reply to: ↑ 1 Changed at 2011-08-28T23:14:38Z by davidsarah

Replying to warner:

  • create an API to build a new version of the share one change at a time, then a second API call to finalize the change (and make the new version visible to the world). It might look something like the immutable share-building API.:
    • edithandle = share.start_editing()
    • edithandle.apply_delta(offset, newdata)
    • edithandle.finish()
    • edithandle.abort()
    • finish() is the test-and-set operation: it might fail if some other writer has completed their own start_editing()/apply_delta()/finish() sequence faster.

I prefer this option: it allows the client to apply the deltas to all servers and confirm that those operations succeed, and only then send finish to all servers. But note that there needs to be an edithandle.truncate(new_size) operation, or alternatively .finish(new_size).

comment:3 Changed at 2011-09-22T22:06:50Z by davidsarah

  • Keywords memory-leak performance added
  • Priority changed from minor to major

There are some memory usage measurements on the duplicate #1523. Particularly concerning is that there seems to be a rather large memory leak; it's not just high transient memory usage.

comment:4 Changed at 2011-09-22T23:09:59Z by zooko

Let's call it a memory "leak" if doing some operation repeatedly results in progressively greater memory usage, such that if you do that operation enough times it will use up all the memory in your system. Let's not call it a memory "leak" if it uses up way too much RAM. Note that last time I heard, CPython never releases memory back to the operating system: http://www.evanjones.ca/memoryallocator/

It sounds to me like there is a major problem here, which is that Tahoe-LAFS uses up way too much memory. I don't see evidence that there is a "leak" per se, and I don't consider it to be a major problem that CPython never releases memory back to the operating system.

comment:5 Changed at 2011-09-22T23:10:37Z by zooko

  • Keywords docs added

We need to document this in 1.9 release's docs/performance.rst if it isn't fixed.

comment:6 Changed at 2011-09-23T00:11:06Z by davidsarah

Subject to fragmentation issues, CPython does return memory to the OS: http://bugs.python.org/issue1123430. I tried to test whether uploading a second file resulted in the same additional memory usage (suggesting a leak) or less (suggesting that not returning memory is part of the problem), but couldn't complete the test because my machine became unresponsive. I'll try again when I have more free memory.

Note that it's RSS that we're measuring, not virtual memory. Memory pages that aren't being used shouldn't be counted in RSS (eventually).

comment:7 Changed at 2011-10-13T17:07:00Z by warner

  • Milestone changed from 1.9.0 to 1.10.0

not making it into 1.9

comment:8 Changed at 2013-08-09T18:18:49Z by zooko

  • Description modified (diff)
  • Milestone changed from 1.11.0 to eventually

This isn't going to make it into 1.10.0. I think it requires a deep change. Ultimately I think it actually requires end-to-end two-phase-commit (#1755)!

Let's see, does the docs/performance.rst already document this issue? trunk/docs/performance.rst?rev=514fb096be50464ce78933f4db48db4de40e7265#publishing-an-a-byte-mutable-file. Yes! Good.

Version 0, edited at 2013-08-09T18:18:49Z by zooko (next)
Note: See TracTickets for help on using tickets.