Opened at 2010-09-23T23:38:26Z
Last modified at 2013-06-18T00:49:19Z
#1209 assigned defect
repair of mutable files/directories should not increment the sequence number
Reported by: | gdt | Owned by: | davidsarah |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-mutable | Version: | 1.8.0 |
Keywords: | repair mutable preservation space-efficiency | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
Particularly with my root directory, I often find that 9 shares of seqN are available compared to 10 desired. I do a repair, and this results in 10 shares of seqN+1 and then 9 are deleted. Then the next day there are 9 of seqN+1 and 1 of seqN, and the file is again not healthy. This repeats daily.
It seems that the missing seqN shares should be generated and placed, and then when servers churn more it's likely that 10 can still be found, and no unrecoverable versions. Perhaps I don't get something, but the current behavior is not stable with intermittent servers.
I have observed this problem with directories, but it seems likely that it applies to all immutable files.
Change History (23)
comment:1 Changed at 2010-09-29T12:22:35Z by zooko
- Owner changed from somebody to gdt
- Version changed from 1.8β to 1.8.0
comment:2 Changed at 2010-09-29T13:35:29Z by gdt
- Description modified (diff)
- Summary changed from repair of immutable files should not increment the sequence number to repair of directories (all immutable files) should not increment the sequence number
Sorry, I really meant directories in particular. I have edited the summary and description.
comment:3 follow-up: ↓ 4 Changed at 2010-09-29T13:36:31Z by gdt
- Description modified (diff)
- Summary changed from repair of directories (all immutable files) should not increment the sequence number to repair of directories (all immutable files?) should not increment the sequence number
Clarify that the ticket is really about directories, but that it likely applies to all immutable files.
comment:4 in reply to: ↑ 3 Changed at 2010-09-30T00:19:16Z by zooko
Replying to gdt:
Clarify that the ticket is really about directories, but that it likely applies to all immutable files.
You mean it likely applies to all mutable files, right? :-) Directories are normally mutable, although it is also possible to have immutable directories. But immutable directories don't have sequence numbers. :-)
comment:5 Changed at 2010-10-09T17:02:50Z by davidsarah
- Keywords repair mutable added
- Milestone changed from undecided to soon
comment:6 Changed at 2010-10-09T17:04:13Z by davidsarah
- Description modified (diff)
- Summary changed from repair of directories (all immutable files?) should not increment the sequence number to repair of mutable files/directories should not increment the sequence number
comment:7 Changed at 2011-01-14T18:11:43Z by davidsarah
- Milestone changed from soon to 1.9.0
- Owner changed from gdt to davidsarah
- Status changed from new to assigned
comment:8 Changed at 2011-01-14T18:50:53Z by davidsarah
- Keywords preservation space-efficiency added
- Priority changed from minor to major
#1210 was a duplicate. Its description was:
If there are 1 shares of seqN and 10 shares of seqM, M>N, the file is not healthy. The fix should be to remove the seqN share and not molest the seqM shares, instead of incrementing the version. This contributes to instability.
comment:9 Changed at 2011-08-14T01:14:33Z by davidsarah
- Milestone changed from 1.9.0 to soon
comment:10 Changed at 2012-03-18T01:07:05Z by warner
- Component changed from code to code-mutable
comment:11 Changed at 2012-03-18T01:08:51Z by warner
This is a great idea, especially since one of the failure modes for mutable files (when a share is migrated to a new server, causing the write-enabler to become wrong, causing the share to get "stuck" at some old version) is that it's never going to be able to get rid of that old share, so the file will always appear "unhealthy". In that case, constantly clobbering the perfectly-valid most-recent-version shares is obviously wrong.
comment:12 Changed at 2012-03-29T20:20:43Z by davidsarah
Brian wrote on tahoe-dev:
I haven't looked at that code for a long time, but it occurs to me that what it wants is a checker-results flag that says whether the unhealthy status is due to the presence of multiple versions or not. In terms of the ServerMap object, I think that's just:
len(sm.recoverable_versions()) + len(sm.unrecoverable_versions()) > 1
We only need to do the full download+upload cycle (which increments the version number) if there are multiple versions present. If we're just missing a couple of shares (or some are corrupted and could be replaced), then the number of versions ==1 and we should do a non-incrementing form of repair.
I think we'll need new Repairer code to do that, though. Something to set the new version equal to the existing version, to avoid sending updates to existing correct shares, and something to make sure the generated test-vectors are ok with all that. Not trivial, but not a huge task either.
comment:13 Changed at 2012-03-29T20:22:53Z by davidsarah
Greg Troxel wrote:
More than that - if we have 1 share of M and all shares of N, for N > M, then we really just want to purge (or ignore?) the M share, and not molest the N shares.
The test for this should be like:
set up 5 servers S upload some files while 20 iterations for s in S take s off line run repair take s back
With the current code, you get repair every time and 100 increments, I think. With your proposal, I think it's still true.
The above test is how the pubgrid feels to me, or used to.
comment:14 follow-up: ↓ 15 Changed at 2012-03-29T20:25:49Z by davidsarah
Brian replied:
On 3/29/12 12:01 PM, Greg Troxel wrote:
More than that - if we have 1 share of M and all shares of N, for N > M, then we really just want to purge (or ignore?) the M share, and not molest the N shares.
Ah, good point. We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen. Repair is tricky in that case anyways, since at the Tahoe level we can't do an automatic merge, so we're certainly losing information (if it's just a directory modification, then the directory.py code can re-apply the modification, so that one case might be safe).
Hm, ServerMap.needs_merge() is pretty close already, but it only looks at recoverable versions (it tells you that an update will lose information that would have been recoverable if you'd read the individual versions first.. there are alternate cases where it doesn't matter because the other versions weren't recoverable anyways).
We should add a method to ServerMap that tells us whether a new version is needed or not.
The above test is how the pubgrid feels to me, or used to.
Yup, that test looks right.
comment:15 in reply to: ↑ 14 Changed at 2012-03-29T20:33:59Z by davidsarah
Replying to Brian:
We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen.
Counterexample: suppose there is a recoverable version with sequence number S, and an unrecoverable version with sequence number S+1. (There are no versions with sequence number >= S+2.) Then, the new sequence number needs to be S+2, in order for clients not to use the shares from S+1. Deleting the shares with sequence number S+1 is also possible but inferior, since sequence numbers should be monotonically increasing to minimize race conditions.
We should add a method to ServerMap that tells us whether a new version is needed or not.
+1.
comment:16 Changed at 2012-03-29T20:38:22Z by davidsarah
- Milestone changed from soon to 1.10.0
comment:17 Changed at 2012-03-29T20:56:00Z by davidsarah
Proposed algorithm:
- Find the highest recoverable sequence number, R.
- Find the highest sequence number for which there exist any shares, S.
- If R == S, repair version R using the algorithm in ticket:1130#comment:12 . Otherwise, download version R and upload it to version S+1.
This would also fix #1004, #614, and #1130. IIUC, an implementation of the ticket:1130#comment:12 algorithm is being worked on in #1382.
comment:18 follow-up: ↓ 19 Changed at 2012-11-20T01:43:00Z by gdt
I think davidsarah's proposed algorithm is a good choice. A few comments:
- if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem.
- In the case where R is repaired, stray shares of a lower version should be removed.
- in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed.
- if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later.
comment:19 in reply to: ↑ 18 Changed at 2012-11-20T07:23:58Z by davidsarah
Replying to gdt:
I think davidsarah's proposed algorithm is a good choice. A few comments:
- if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem.
- In the case where R is repaired, stray shares of a lower version should be removed.
- in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed.
I agree. To make this more explicit:
- Find the highest recoverable sequence number, R. If there is no recoverable sequence number, abort.
- Find the highest sequence number for which there exist any shares, S.
- If R == S,
- Repair version R using the algorithm in ticket:1130#comment:12 . Set Recovered := R.
- Download version R and upload it to version S+1. Set Recovered := S+1.
- If the client's happiness threshold is met for shares of sequence number Recovered, remove all known shares with sequence numbers < Recovered.
(The reason to only do the removal when the Recovered version is happy, is in case of a partition where different clients can see different subsets of servers. In that case, removing shares is a bad idea unless we know that the Recovered version has been stored reliably, not just recoverably. Also, we shouldn't remove shares from servers that weren't considered in steps 1 and 2, if they have connected in the meantime.)
- if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later.
The algorithm says to upload to version S+1 in that case. I think this is the right thing.
comment:20 Changed at 2013-02-18T15:52:21Z by PRabahy
I believe that davidsarah's algorithm should help for most cases. Does it make sense to use a vector clock (http://en.wikipedia.org/wiki/Vector_clock) for even more robustness. I don't believe this should happen often (ever), but if there is a partial brain split where several nodes can connect to more than S storage servers but less than half the total storage servers in the grid there could still be significant churn. (Did that last sentence make any sense?) I'm probably over-thinking this solution so feel free to ignore me.
comment:21 Changed at 2013-02-18T16:32:42Z by gdt
Regarding vector clock, I wouldn't say overthinking, but I think that tahoe needs to have a design subspace within the space of all network and user behaviors. To date, tahoe has been designed for (by observing what's been done) the case where all servers are almost always connected. I'm talking about a case where at any given time most servers are connected, and most servers are connected almost all the time. In that world, not being connected is still unusual and to be avoided, but when you have 20 servers, the odds that one is missing are still pretty high.
So I think this ticket should stay focused on the mostly-connected case. If you want to work on a far more distributed less available use case, good luck, but I think it's about 50x the work of fixing ticket:1209.
comment:22 Changed at 2013-02-19T06:25:41Z by davidsarah
The vector clock algorithm is designed to ensure causal ordering (assuming cooperative peers) in a general peer-to-peer message passing system. It's inefficient and overcomplicated for this case, and the assumptions it relies on aren't met in any case.
comment:23 Changed at 2013-06-18T00:49:19Z by daira
- Description modified (diff)
Incidentally, the comment:19 algorithm never makes things worse even in the case of partition, because it only deletes shares of a lower version than a version that is stored happily, even when run on an arbitrary partition of a grid.
gdt: did you mean "mutable" instead of "immutable"? Immutable files don't have a sequence number!