[tahoe-lafs-trac-stream] [tahoe-lafs] #1395: error doing a check --verify on files bigger than about 1Gbyte

Sun May 8 14:02:54 PDT 2011

#1395: error doing a check --verify on files bigger than about 1Gbyte
-------------------------------+---------------------------------
     Reporter:  sickness       |      Owner:  nobody
         Type:  defect         |     Status:  new
     Priority:  minor          |  Milestone:  undecided
    Component:  code-encoding  |    Version:  1.8.2
   Resolution:                 |   Keywords:  memory verify error
Launchpad Bug:                 |
-------------------------------+---------------------------------

Comment (by warner):

 I think I see the problem. The immutable Verifier code ([http://tahoe-
 lafs.org/trac/tahoe-
 lafs/browser/trunk/src/allmydata/immutable/checker.py?rev=5002#L620 here]
 in checker.py) is overly parallelized. It uses a !DeferredList to work on
 all shares in parallel, and each share worker uses a !DeferredList to work
 on all blocks in parallel. The result is that every single byte of every
 single share is fetched at the same time, completely blowing our memory
 budget. As to why the server is crashing, I suspect that when the server
 gets a gigantic batch of requests for every single byte of the file, it
 responds to all of them, queueing a massive amount of data in the output
 buffers, which blows the memory space. A separate issue is to protect our
 servers against this sort of DoS, but I'm not sure how (we'd need to delay
 responding to a request if there were more than a certain number of bytes
 sitting in the output queue for that connection, which jumps wildly across
 the abstraction boundaries).

 The Verifier should work just like the Downloader: one segment at a time,
 all blocks for a single segment being fetched in parallel. That approach
 gives a memory footprint of about {{{S*N/k}}} (whereas regular download is
 about {{{S}}}). We could reduce the footprint to {{{S/k}}} (at the expense
 of speed) by doing just one block at a time (i.e. completely verify share
 1 before touching share 2, and within share 1 we completely verify block 1
 before touching block 2), but I think that's too much.

 I've attached a patch which limits parallelism to approximately the right
 thing, given the slightly funky design of the Verifier (the verifier
 iterates primarily over *shares*, not segments). The patch continues to
 verify all shares in parallel. However, within each share, it serializes
 the handling of blocks, so that each share-handler will only look at one
 block at a time.

 The patch needs tests, which should verify a moderate-size artificially-
 small-segment (thus high-number-of-segment) file, probably with N=1 for
 simplicity. It needs to confirm that one block is completed before the
 next begins: I don't know an easy way to do that.. probably needs some
 instrumentation in {{{checker.py}}}. My manual tests just added some
 printfs, one just before the call to {{{vrbp.get_block()}}}, another
 inside {{{_discard_result()}}}, and noticed that there were lots of
 {{{get_block}}}s without interleaved {{{_discard_result}}}s.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1395#comment:15>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage