#661 closed enhancement (duplicate)

Dynamic share migration to maintain file health

Reported by: mmore Owned by:
Priority: major Milestone: undecided
Component: code-encoding Version: 1.3.0
Keywords: repair preservation availability Cc:
Launchpad Bug:

Description (last modified by daira)

Dynamic share repair to maintain file health. based on the following features already exist in Allmydata-Tahoe1.3 we can improve automatic repair:

  1. Foolscap provides the knowledge of the alive nodes.
  1. Verification of file availability can be delegated to other node through read-cap or a verify-cap without security risk.

The proposed auto repair process:

  1. Using memory-based algorithm, because client know where the file shares exist so we can keep tack of alive file shares, for simplicity we consider that share availability from its node availability.
  1. repair process triggered automatically from the repairer, repair responsibility has many technique based repair cost ; network bandwidth and fault tolerant.
  1. time out , we can use lazy repair technique to avoid node temporary node failure, i.e waiting for a certain time before repair process starts.
  1. reintegration, using memory-based repair technique remembering failed storage servers, who come back to life, will help in reducing Tahoe grid resources such as network bandwidth and storage space.
  1. repairer, selection of repair responsibly takes many issues into consideration: security , repairer location , repairer resources.

Change History (5)

comment:1 Changed at 2009-03-11T22:07:16Z by zooko

  • Description modified (diff)

I reformatted the original description so that trac will represent the numbered items as a list.

comment:2 Changed at 2009-03-12T21:03:22Z by warner

  • Description modified (diff)

re-reformatted it: I think trac requires the leading space to trigger the "display as list" formatter

comment:3 Changed at 2009-06-12T00:56:32Z by warner

  • Component changed from dev-infrastructure to code-encoding
  • Owner somebody deleted

comment:4 Changed at 2010-03-25T03:27:24Z by davidsarah

  • Keywords repair preservation availability added

The following clump of tickets are closely related:

  • #450 Checker/repair agent
  • #483 Repairer service
  • #543 Rebalancing manager
  • #643 Automatically schedule repair service
  • #661 Dynamic share migration to maintain file health
  • #864 Automated migration of shares between storage servers

Actually there are probably too many overlapping tickets here.

Part of the redundancy is due to distinguishing repair from rebalancing. But when #614 and #778 are fixed, a healthy file will by definition be balanced across servers, so there's no need to make that distinction. Perhaps there will also be a "super-healthy" status that means shares are balanced across the maximum number of servers, i.e. N. (When we support geographic dispersal / rack-awareness, the definitions of "healthy" and "super-healthy" will presumably change again so that they also imply that shares have the desired distribution.)

There are basically four options for how repair/rebalancing could be triggered:

  • a webapi operation performed by a gateway, and triggered by CLI commands. We already have this. Scheduling this operation automatically is #643.
  • triggered by write operations on a particular file. This is #232 and #699.
  • moving a server's shares elsewhere when it is about to be decommissioned or is running out of space. This is #864.
  • a more autonomous repair/rebalancing service that would run continuously.

The last option does not justify 4 tickets! (#450, #483, #543, #661) Unless anyone objects, I'm going to merge these all into #483.

Version 0, edited at 2010-03-25T03:27:24Z by davidsarah (next)

comment:5 Changed at 2014-12-29T20:21:04Z by daira

  • Description modified (diff)
  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #543.

Note: See TracTickets for help on using tickets.