#211 closed defect (fixed)

mutable: tolerate mixed corrupt/good shares from any given peer

Reported by: warner Owned by: warner
Priority: major Milestone: 0.7.0
Component: code Version: 0.6.1
Keywords: Cc:
Launchpad Bug:

Description

The current mutable file Retrieve code has a control flow problem that causes it to respond to a corrupt share by ignoring any remaining shares from the same peer. This causes unnecessary problems for small grids, because it makes fewer shares available for use. In the worst case, this could make files unavailable.

This worst case is only likely to be exercised in a unit test, but that's what is happening in our test_mutable, where we use 5 nodes, 10 shares (of which 7 are corrupt), 3-of-10 encoding.

To fix this, we need to modify the control flow in Retrieve._got_results to allow a CorruptShareError? to allow processing of the remaining shares but still raise the exception at the end of the loop (to notify _query_failed, which cares about the peerid but not the share number).

The current workaround is to use 10 nodes in that test instead of 5. Once we fix this control flow, test_system.SystemTest?.test_mutable should be restored to using 5 nodes intead of 10, because the memory footprint of a 10-node test is considerably larger than a 5-node test (233MB instead of 77MB).

Change History (2)

comment:1 Changed at 2007-11-15T20:58:34Z by warner

The workaround was introduced in 59d6c3c8229d8457 to fix #209 in time for the 0.7.0 release.

comment:2 Changed at 2007-11-16T23:14:52Z by warner

  • Milestone changed from 0.7.1 to 0.7.0
  • Resolution set to fixed
  • Status changed from new to closed

Fixed, in e3037a7541d2a37c. I also reduced the test case back down to 5 nodes: to exercise the recent resource.setrlimit code in node.py, you'll want to raise that back up to 10 briefly.

Note: See TracTickets for help on using tickets.