Opened at 2011-05-05T20:02:27Z
Closed at 2011-08-02T20:59:40Z
#1395 closed defect (fixed)
error doing a check --verify on files bigger than about 1Gbyte
Reported by: | sickness | Owned by: | zooko |
---|---|---|---|
Priority: | critical | Milestone: | 1.9.0 |
Component: | code-encoding | Version: | 1.8.2 |
Keywords: | memory verify error performance test-needed | Cc: | |
Launchpad Bug: |
Description (last modified by davidsarah)
on a grid with 10 storage nodes, all on win32, and a gateway/client on win32, with all the nodes up and all the shares ok, I can download all the files ok, and check all of them ok. only thing that doesn't work is check --verify of files bigger than 1Gbyte
here's the error log of the gateway, and after it of a storage node:
######################################################################
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] [Failure instance: Traceback: <type 'exceptions.IOError'>: [Errno 22] invalid mode ('wb') or filename: u'C:\\tahoeclient\\logs\\incidents\\incident-2011-05-05--18:33:34Z-w2qn32q.flog' 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\defer.py:106:maybeDeferred 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\src\allmydata\web\common.py:302:renderHTTP 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\src\allmydata\web\root.py:124:render_POST 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:210:msg 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] --- <exception caught here> --- 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:265:add_event 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\incident.py:35:event 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:278:declare_incident 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\incident.py:85:incident_declared 2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] ] 2011-05-05 20:42:33+0200 [HTTPChannel,43,10.100.9.71] Unhandled Error Traceback (most recent call last): File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\rend.py", line 569, in _renderHTTP return self.flattenFactory(doc, ctx, writer, finisher) File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\rend.py", line 528, in <lambda> flattenFactory = lambda self, *args: flat.flattenFactory(*args) File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\__init__.py", line 14, in flattenFactory return deferflatten(stan, ctx, writer).addCallback(finisher) File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\twist.py", line 63, in deferflatten _drive(iterable, finished) --- <exception caught here> --- File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\twist.py", line 24, in _drive next = iterable.next() File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\ten.py", line 85, in iterflatten straccum.append(item) exceptions.MemoryError:
######################################################################
now for the storage log:
######################################################################
2011-05-05 20:29:01+0200 [Negotiation,8,10.100.26.102] Unhandled Error Traceback (most recent call last): File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\context.py", line 37, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\selectreactor.py", line 146, in _doReadOrWrite why = getattr(selectable, method)() File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\tcp.py", line 180, in doWrite return Connection.doWrite(self) File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\tcp.py", line 431, in doWrite result = abstract.FileDescriptor.doWrite(self) File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\abstract.py", line 101, in doWrite self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer) exceptions.MemoryError:
###################################################################### the gateway/client process quickly grow over 1.5 Gbyte in ram, then it shrunk to 180 Mbyte, the storage process grow to 300 Mbyte and stayed there
Attachments (4)
Change History (42)
comment:1 Changed at 2011-05-05T20:13:44Z by sickness
- Description modified (diff)
comment:2 Changed at 2011-05-05T21:45:16Z by davidsarah
comment:3 Changed at 2011-05-05T21:55:00Z by sickness
yeah, I can confirm that C:\tahoeclient\logs\incidents\ exists
comment:4 follow-up: ↓ 5 Changed at 2011-05-06T15:50:59Z by davidsarah
I see the problem. ':' is not valid in a Windows filename. Will file another ticket for that.
comment:5 in reply to: ↑ 4 Changed at 2011-05-06T16:49:29Z by davidsarah
Replying to davidsarah:
I see the problem. ':' is not valid in a Windows filename. Will file another ticket for that.
comment:6 Changed at 2011-05-06T16:53:21Z by davidsarah
- Component changed from unknown to code-encoding
- Description modified (diff)
- Keywords memory verify added
- Summary changed from error doing a check --verify on files bigger than 1Gb on win32 to error doing a check --verify on files bigger than about 1Gbyte
comment:7 Changed at 2011-05-06T16:55:06Z by davidsarah
- Keywords error added
How was this reported by tahoe check --verify?
comment:8 Changed at 2011-05-06T18:14:56Z by sickness
it simply reported that 0 shares could be find for that file
comment:9 Changed at 2011-05-06T18:36:13Z by zooko
So both the gateway and the storage node had MemoryError. Can you reproduce it and tell what the Process Explorer says about how much memory those two different processes are using at different times in the operation? Are they using a large amount of memory continuously, or does it go up just before this failure, or what?
Thank you!
comment:10 Changed at 2011-05-06T19:18:10Z by davidsarah
#1229 is another checker-related memory issue. I've no evidence that it's the same issue, but it's possible that they might interact (i.e. the fact that we need memory proportional to the file size might increase the size of the #1229 memory leak, or the leak might increase the memory required for a single check).
Anyway, we should try to reproduce this on a non-Windows system to confirm that it's not win32-specific.
comment:11 Changed at 2011-05-06T19:23:33Z by davidsarah
Note that performance.rst claims:
memory footprint: N/K*S
for verifying a file (where S is the segment size). That seems to be contradicted by the behaviour in this bug.
comment:12 follow-up: ↓ 13 Changed at 2011-05-07T18:33:32Z by warner
Not that it's likely to be the problem here, but we should update that claim to be "N/K*S times a small multiple". I think the multiple is currently about 2 or 3. During encryption, we hold both a plaintext share and a ciphertext share in RAM at the same time (so 2*S), then we drop the plaintext. During erasure-coding, we hold a whole S of ciphertext in memory at the same time as the N/K*S shares, then we drop the ciphertext before pushing. We also pipeline the sends a little bit, I think 10kB or 50kB per server, to get better utilization out of a non-zero-latency wire.
Also Python's memory-management strategy interacts weirdly. Dropping the plaintext segment may not be enough: Python might not re-use that memory space for anything else right away. Although I'd expect it to de-fragment or coalesce free blocks before asking the OS for so much memory that it crashed.
comment:13 in reply to: ↑ 12 Changed at 2011-05-08T01:43:40Z by zooko
Replying to warner:
Not that it's likely to be the problem here, but we should update that claim to be "N/K*S times a small multiple". I think the multiple is currently about 2 or 3.
I would be willing to update these docs to be more precise or more correct, but I'm not entirely sure what you want them to say. (N/K)*S*3+50KB? (But only for immutable repair.)
Note that some of the other numbers in there are marked as approximate by a preceding tilde ~, e.g. performance.rst "Repairing an A-byte file". Maybe we should use the computer science tradition of ignoring constant factors which are independent of the variables (K, N, S, A, B, and G). However, I would rather follow that tradition only when the constant factors that we're ignoring are sufficiently small that our users will be willing to ignore them too. :-)
So in short: +0 from me, but you would need to write a patch for performance.rst yourself. Your attention to that document would be much appreciated by me because I would like for your admirable concern for precision in resource usage to be better represented there.
The pipeline size, which applies only to immutable objects and only to uploads, is 50 KB: immutable/layout.py WriteBucketProxy.
comment:14 Changed at 2011-05-08T12:47:51Z by zooko
I've opened #1398 (make docs/performance.rst more precise and accurate) for the documentation issues and attached a patch there. I do hope Brian looks at it, because my previous pass seems to have left a lot of incorrect stuff in that document and I would like the next version of it to be less wrong.
Changed at 2011-05-08T20:47:00Z by warner
patch to serialize block-fetches in Verifier. Multiple shares are still done in parallel.
comment:15 Changed at 2011-05-08T21:02:54Z by warner
I think I see the problem. The immutable Verifier code (here in checker.py) is overly parallelized. It uses a DeferredList to work on all shares in parallel, and each share worker uses a DeferredList to work on all blocks in parallel. The result is that every single byte of every single share is fetched at the same time, completely blowing our memory budget. As to why the server is crashing, I suspect that when the server gets a gigantic batch of requests for every single byte of the file, it responds to all of them, queueing a massive amount of data in the output buffers, which blows the memory space. A separate issue is to protect our servers against this sort of DoS, but I'm not sure how (we'd need to delay responding to a request if there were more than a certain number of bytes sitting in the output queue for that connection, which jumps wildly across the abstraction boundaries).
The Verifier should work just like the Downloader: one segment at a time, all blocks for a single segment being fetched in parallel. That approach gives a memory footprint of about S*N/k (whereas regular download is about S). We could reduce the footprint to S/k (at the expense of speed) by doing just one block at a time (i.e. completely verify share 1 before touching share 2, and within share 1 we completely verify block 1 before touching block 2), but I think that's too much.
I've attached a patch which limits parallelism to approximately the right thing, given the slightly funky design of the Verifier (the verifier iterates primarily over *shares*, not segments). The patch continues to verify all shares in parallel. However, within each share, it serializes the handling of blocks, so that each share-handler will only look at one block at a time.
The patch needs tests, which should verify a moderate-size artificially-small-segment (thus high-number-of-segment) file, probably with N=1 for simplicity. It needs to confirm that one block is completed before the next begins: I don't know an easy way to do that.. probably needs some instrumentation in checker.py. My manual tests just added some printfs, one just before the call to vrbp.get_block(), another inside _discard_result(), and noticed that there were lots of get_blocks without interleaved _discard_results.
comment:16 Changed at 2011-05-08T21:51:14Z by davidsarah
- Keywords performance added
Could this patch cause a performance regression for small files, or is each server going to serialize the requests to it anyway?
comment:17 Changed at 2011-05-28T21:09:36Z by zooko
- Milestone changed from undecided to 1.9.0
This is a major bug. I really think we should fix this for v1.9.0.
comment:18 Changed at 2011-05-28T21:58:57Z by davidsarah
+1 on fixing this for 1.9.0.
comment:19 Changed at 2011-05-28T21:59:21Z by davidsarah
- Priority changed from minor to major
comment:20 Changed at 2011-06-10T11:01:45Z by sickness
YAY, this patch solves my problem! (and also speeds up the testing and greatly reduces the memory usate of the storage node =)
comment:21 Changed at 2011-06-10T14:35:42Z by ChosenOne
Cannot reproduce the bug on my virtual machine: Windows 7 x64, 1G RAM. Test files were 1G and 2G.
comment:22 Changed at 2011-06-11T18:50:08Z by warner
ok, so sickness' report suggests this was the right problem to fix. Now to figure out how to write a test for it.
comment:23 Changed at 2011-06-11T19:39:11Z by warner
- Keywords review-needed added
Ok, that latest patch should be ready: it adds a test which includes a tiny bit of instrumentation on the immutable verifier (to measure how many block fetches are active at any one time), and it fails without the verifier fix, and passes once that fix is applied. Ready for review.
comment:24 Changed at 2011-06-11T19:44:25Z by zooko
- Owner changed from nobody to zooko
- Status changed from new to assigned
comment:25 Changed at 2011-06-12T14:59:27Z by davidsarah
comment:7 "How was this reported by tahoe check --verify?"
comment:8 "it simply reported that 0 shares could be find for that file"
That seems like an error-reporting bug that wouldn't be fixed by attachment:1395-overparallel.diff, even if it would happen in fewer cases.
comment:26 Changed at 2011-06-15T17:40:46Z by warner
davidsarah: hmm, if the storage server got clobbered by all the parallel fetches, then it would throw MemoryError (and either crash completely, or at least return Failures to the remote_read() calls). From the client's point of view, this is identical to there being no shares available: crashing servers are not providing shares.
Maybe the CLI command output could be extended to mention how many servers were contacted, and how many answered, so the user could distinguish crashing-servers (or no-network) from solid NAKs.
This patch is still important to apply, though.. zooko: please don't defer reviewing it because of the CLI deficiency. It'll be easier to test a CLI change with special-purpose test code (building a client with no servers, or with intentionally-failing servers) than by trying to provoke a MemoryError.
Changed at 2011-06-17T06:09:26Z by zooko
comment:27 Changed at 2011-06-17T06:11:20Z by zooko
Okay, I've reviewed the patch and it is good. Please review this alternate patch — attachment:not-too-parallel-test-by-patching.diff — in which the test patches the ValidatedReadBucketClass instead of the code under test (Checker) having code in it for testing purposes. Unfortunately this patch depends on a variant of mock.py named mockutil.py which knows what to do when a test case returns a Deferred instance, so I don't think we should commit this patch as is. I've submitted a patch to the mock project (http://code.google.com/p/mock/issues/detail?id=96 ) and I'll proceed to work up a patch for Tahoe-LAFS which uses a lower-tech patching tool. But please review this patch because it most clearly shows why I like this approach. (If you ignore the whole mockutil.py part.)
Changed at 2011-06-17T06:15:36Z by zooko
comment:28 Changed at 2011-06-17T06:18:12Z by zooko
- Owner changed from zooko to warner
- Status changed from assigned to new
Okay here's a version that doesn't depend on mockutil.py: attachment:not-too-parallel-test-by-low-tech-patching.diff. As you can see it makes the test code a little uglier and it doesn't handle exceptions (it'll leave the ValidatedReadBucketProxy patched out for the rest of the trial process if there is an exception raised from the test). I guess I'm a bit on the fence about which of these two patches I prefer. What's your preference?
comment:29 Changed at 2011-06-20T20:31:49Z by zooko
To clarify, the original patch attachment:1395-overparallel.diff passes my review and I approve of applying it to trunk. I also think that attachment:not-too-parallel-test-by-low-tech-patching.diff is better, and hope you (warner) will consider it as a replacement for attachment:1395-overparallel.diff. I also like attachment:not-too-parallel-test-by-patching.diff , but it has the flaw that it will probably break if someone uses it with a newer version of mock.py, so we should probably avoid it.
comment:30 Changed at 2011-06-23T21:39:09Z by warner
- Keywords review-needed removed
- Owner changed from warner to zooko
I'm -1 on a test that leaves the code in a broken/patched state after an exception.. too much opportunity for confusion and hair-tearing debugging of other problems later. And I'm -1 on a test that depends deeply upon internals of a library that we don't include a copy of (note that I'm historically inconsistent about this one, q.v. Foolscap vs Twisted, but I'm trying to mend my ways).
At first, I found your -low-tech-patching test hard to follow, but looking more closely, I'm growing fond of it. The original code isn't too hard to patch in the right way, so the mock subclass isn't too too weird. It might be nice if the new variable names were shorter (and since they only appear in the context of the TooParallel unit test, they don't need to be as fully-qualified as if they were in the original checker.py). But after getting used to it, I think I'm ok with the way it stands.
I'd be ok with simple partial fix to the cleanup-after-exception problem. Instead of using setUp/tearDown or pushing the whole middle of the test method into a sub-method, just put the set_up_grid part into a small function so you can get it into the Deferred chain. Something like:
d = defer.succeed(None) def _start(ign): self.set_up_grid(num_servers=4) c0 = self.g.clients[0] c0.DEFAULT_ENCODING_PARAMETERS = { "k": 1, "happy": 4, "n": 4, "max_segment_size": 5, } self.c0 = c0 self.uris = {} DATA = "data" * 100 # 400/5 = 80 blocks return c0.upload(Data(DATA, convergence="")) d.addCallback(_start) def _do_check(ur): n = self.c0.create_node_from_uri(ur.uri) return n.check(Monitor(), verify=True) d.addCallback(_do_check)
Since the "critical region" where exceptions could cause problems doesn't start until after the ...checker.ValidatedReadBucketProxy = make_mock_VRBP line, it's good enough to just capture the first chunk of the code after that in a function which is run from within the Deferred chain. Any exceptions in set_up_grid or c0.upload (which are fairly big and complex) will still get caught, so the cleanup can happen.
comment:31 Changed at 2011-06-30T14:59:54Z by zooko
- Status changed from new to assigned
I'll modify the low-tech patcher approach (attachment:not-too-parallel-test-by-low-tech-patching.diff) to clean up in case of exception, like Brian suggested.
Note that Michael Foord, author of mock.py, has proposed a patch for mock.py that would make it do everything we need. He just wants code-review and unit tests. See the mock.py ticket to help with that.
Once a version of mock.py is out with that functionality and we are willing to make Tahoe-LAFS depends on that version of mock then we can have the more succinct attachment:not-too-parallel-test-by-patching.diff style of mockery in our unit tests.
comment:32 Changed at 2011-07-07T05:46:25Z by zooko
- Keywords test-needed added
- Status changed from assigned to new
comment:33 Changed at 2011-07-18T21:41:21Z by zooko
- Status changed from new to assigned
comment:34 Changed at 2011-08-02T06:44:56Z by zooko@…
- Resolution set to fixed
- Status changed from assigned to closed
In f426e82287c11b11:
comment:35 Changed at 2011-08-02T18:58:09Z by david-sarah@…
In 9f8d34e63aa1aeeb:
(The changeset message doesn't reference this ticket)
comment:36 Changed at 2011-08-02T19:20:10Z by zooko
- Priority changed from major to critical
- Resolution fixed deleted
- Status changed from closed to reopened
comment:37 Changed at 2011-08-02T19:39:48Z by david-sarah@…
In c7f65ee8ad254f3f:
(The changeset message doesn't reference this ticket)
comment:38 Changed at 2011-08-02T20:59:40Z by davidsarah
- Resolution set to fixed
- Status changed from reopened to closed
c7f65ee8ad254f3f seems to have fixed the failures of allmydata.test.test_repairer.Verifier.test_corrupt_sharedata on the builders. (Some remaining nondeterministic failures are due to #1084.)
I suspect that this isn't specific to win32, although the limit on usable process address space for 32-bit Windows might be smaller than for other 32-bit platforms.
There was also an error while reporting the incident:
but it isn't clear whether that's due to the low-memory condition or an independent issue. 'wb' is a valid mode, so probably the path is incorrect. Does C:\tahoeclient\logs\incidents\ exist?