[tahoe-lafs-trac-stream] [tahoe-lafs] #1395: error doing a check --verify on files bigger than about 1Gbyte
tahoe-lafs
trac at tahoe-lafs.org
Sun May 8 14:02:54 PDT 2011
#1395: error doing a check --verify on files bigger than about 1Gbyte
-------------------------------+---------------------------------
Reporter: sickness | Owner: nobody
Type: defect | Status: new
Priority: minor | Milestone: undecided
Component: code-encoding | Version: 1.8.2
Resolution: | Keywords: memory verify error
Launchpad Bug: |
-------------------------------+---------------------------------
Comment (by warner):
I think I see the problem. The immutable Verifier code ([http://tahoe-
lafs.org/trac/tahoe-
lafs/browser/trunk/src/allmydata/immutable/checker.py?rev=5002#L620 here]
in checker.py) is overly parallelized. It uses a !DeferredList to work on
all shares in parallel, and each share worker uses a !DeferredList to work
on all blocks in parallel. The result is that every single byte of every
single share is fetched at the same time, completely blowing our memory
budget. As to why the server is crashing, I suspect that when the server
gets a gigantic batch of requests for every single byte of the file, it
responds to all of them, queueing a massive amount of data in the output
buffers, which blows the memory space. A separate issue is to protect our
servers against this sort of DoS, but I'm not sure how (we'd need to delay
responding to a request if there were more than a certain number of bytes
sitting in the output queue for that connection, which jumps wildly across
the abstraction boundaries).
The Verifier should work just like the Downloader: one segment at a time,
all blocks for a single segment being fetched in parallel. That approach
gives a memory footprint of about {{{S*N/k}}} (whereas regular download is
about {{{S}}}). We could reduce the footprint to {{{S/k}}} (at the expense
of speed) by doing just one block at a time (i.e. completely verify share
1 before touching share 2, and within share 1 we completely verify block 1
before touching block 2), but I think that's too much.
I've attached a patch which limits parallelism to approximately the right
thing, given the slightly funky design of the Verifier (the verifier
iterates primarily over *shares*, not segments). The patch continues to
verify all shares in parallel. However, within each share, it serializes
the handling of blocks, so that each share-handler will only look at one
block at a time.
The patch needs tests, which should verify a moderate-size artificially-
small-segment (thus high-number-of-segment) file, probably with N=1 for
simplicity. It needs to confirm that one block is completed before the
next begins: I don't know an easy way to do that.. probably needs some
instrumentation in {{{checker.py}}}. My manual tests just added some
printfs, one just before the call to {{{vrbp.get_block()}}}, another
inside {{{_discard_result()}}}, and noticed that there were lots of
{{{get_block}}}s without interleaved {{{_discard_result}}}s.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1395#comment:15>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list