[tahoe-dev] [tahoe-lafs] #616: bug in repairer causes sporadic hangs in unit tests

Tue Feb 10 00:48:40 PST 2009

#616: bug in repairer causes sporadic hangs in unit tests
---------------------------+------------------------------------------------
 Reporter:  zooko          |           Owner:       
     Type:  defect         |          Status:  new  
 Priority:  major          |       Milestone:  1.3.0
Component:  code-encoding  |         Version:  1.2.0
 Keywords:                 |   Launchpad_bug:       
---------------------------+------------------------------------------------
 There is a bug in {{{DownUpConnector._satisfy_reads_if_possible()}}}:

 [source:src/allmydata/immutable/repairer.py at 20090112214120-e01fd-
 7d241072d30b14d3e243829e952e8c8440e6c461#L127]

 It should be putting {{{leftover}}} bytes back into the {{{self.bufs}}}
 and the rest into the result, not putting all-but-{{{leftover}}} bytes
 back and the rest into the result!  In cases where the input chunks have
 come in different sizes than the read requests, this bug could lead to a
 read request getting more or fewer bytes than it requested.  This could
 lead to data corruption (although not irreversibly so -- it would then
 upload the same sequence of bytes but in different-sized blocks, which
 would screw up the integrity checking code but not the ciphertext).

 Fortunately, in our current code, the writes and the read requests are
 always of the same sizes (the block size), so this doesn't happen in
 practice.  I've added an assertion in [20090210054605-92b7f-
 81c751b4418ffa63b4b2b43a459318ea3659ad90] just to make it fail safely if
 this were to happen in practice.  I have started writing unit tests for
 {{{DownUpConnector._satisfy_reads_if_possible()}}} -- it turns out that we
 need unit tests in addition to the functional tests that I already wrote:
 [source:src/allmydata/test/test_repairer.py].

 This explains the sporadic "lost progress" failure in the functional
 tests.  Hm...  Could it also explain the "lost progress" behavior that
 Brian and I witnessed on the testgrid when this code was newly committed
 to trunk?  I hope not, because that would mean that I am wrong about the
 writes and reads always having the same sizes.  But I'm pretty sure I am
 right about that.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/616>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid