Context Navigation

if a stored share has a corrupt header, other shares held by that server for the file should still be accessible to clients

Reported by:	davidsarah	Owned by:	zooko
Priority:	major	Milestone:	1.10.1
Component:	code-storage	Version:	1.9.0b1
Keywords:	corruption preservation storage	Cc:
Launchpad Bug:

Description (last modified by markberger)

When a storage server receives a remote_get_buckets or remote_slot_testv_and_readv_and_writev request, it will try to create share objects for each of the shares it stores under that SI that are wanted by the client. If any of those shares have a corrupt header (typically resulting in a UnknownMutableContainerVersionError, UnknownImmutableContainerVersionError, or struct.error from the share class constructor), the whole request will fail, even though the server might hold other shares that are not corrupted.

Unfortunately there is no way in the current storage protocol to report success for some shares and a failure for others. The options are:

the status quo -- no shares in the shareset are accessible;
shares with corrupt headers are ignored on read requests;
if all shares are corrupted then report one of the errors, but if only some shares in a shareset have corrupted headers, ignore them and allow access to the rest.

I found this bug when working on the branch for #999, but I think it also applies to trunk.

Change History (15)

comment:1 Changed at 2011-10-18T18:14:46Z by davidsarah

Owner set to davidsarah
Status changed from new to assigned

[posted description as a comment by mistake]

Last edited at 2011-10-18T18:15:45Z by davidsarah (previous) (diff)

comment:2 Changed at 2011-10-18T18:16:23Z by davidsarah

Description modified (diff)

I will write a test.

comment:3 follow-up: ↓ 4 Changed at 2011-10-18T21:52:58Z by warner

I think the server should not report anything to the client about shares with corrupt headers. It should file a local corruption report (as if the remote client had called remote_advise_corrupt_share(), but with a flag that shows this was discovered locally, and probably in a part of the share that the client wouldn't be able to read anyways).

I'd be ok with reporting "nope, no shares here" if all shares are corrupt, rather than returning an error. I don't think the client is in a good position to do much with information about shares that are so corrupt that it can't even see part of their contents (which is what header corruption would mean). The only interaction I can think of is how the client might try to repair this file: they can't replace the corrupt-header share (the server admin will need to delete it locally), they can't modify it's contents, so they ought to just put a new copy somewhere else. They'll probably try to put a new copy on this server, which it needs to refuse (refusal doesn't always mean no-space-available, although the client might have bugs which makes it interpret refusal that way).

comment:4 in reply to: ↑ 3 ; follow-up: ↓ 6 Changed at 2011-10-18T23:36:01Z by davidsarah

Keywords design-review-needed added
Milestone changed from undecided to 1.10.0
Owner changed from davidsarah to warner
Status changed from assigned to new

Replying to warner:

I think the server should not report anything to the client about shares with corrupt headers. It should file a local corruption report (as if the remote client had called remote_advise_corrupt_share(), but with a flag that shows this was discovered locally, and probably in a part of the share that the client wouldn't be able to read anyways).

I'd be ok with reporting "nope, no shares here" if all shares are corrupt, rather than returning an error. I don't think the client is in a good position to do much with information about shares that are so corrupt that it can't even see part of their contents (which is what header corruption would mean). The only interaction I can think of is how the client might try to repair this file: they can't replace the corrupt-header share (the server admin will need to delete it locally), they can't modify it's contents, so they ought to just put a new copy somewhere else. They'll probably try to put a new copy on this server, which it needs to refuse (refusal doesn't always mean no-space-available, although the client might have bugs which makes it interpret refusal that way).

Good, that's essentially what I'd implemented in [5477/ticket999-S3-backend]. Please review! (I think that changeset is understandable independently of the other pluggable-backends changes. Comments particularly wanted on the behaviour when attempting to write shares with corrupted headers.)

comment:5 Changed at 2011-10-18T23:36:41Z by davidsarah

Keywords zookos-opinion-wanted added

comment:6 in reply to: ↑ 4 Changed at 2011-10-18T23:43:06Z by davidsarah

Replying to davidsarah:

Replying to warner:

I think the server should not report anything to the client about shares with corrupt headers. It should file a local corruption report (as if the remote client had called remote_advise_corrupt_share(), but with a flag that shows this was discovered locally, and probably in a part of the share that the client wouldn't be able to read anyways).

[...]

Good, that's essentially what I'd implemented in [5477/ticket999-S3-backend].

... except for filing a corruption report, which I agree is a good idea.

comment:7 Changed at 2012-03-30T23:37:08Z by kraeftig

I'm not going to re-assign this, as I don't feel completely confident that my review is sturdy enough to have properly understood this patch. None the less, I feel like I have a good understanding of the impact the patch is trying to achieve and the abstract approach (thanks to the readability of the code). I pretty much ignored syntax and looked only at the discourse that I could understand.

It looks as though the ability to access the server's other shares, even if a corrupt header exists for one of those shares, is functional. And one would be able to retrieve other files on that corrupted server, even with a return of corrupted headers.

AFAIK, the application now identifies corrupt headers, circumvents them for the action, and then reports what headers were corrupt.

Yay, my first review.

comment:8 Changed at 2012-03-31T00:40:29Z by davidsarah

Keywords design-review-needed zookos-opinion-wanted removed
Owner changed from warner to davidsarah
Status changed from new to assigned

kraeftig: thanks.

I understand what the behaviour should be now, and will make sure that it works that way before merging the pluggable backends code. (Currently on the ticket999-S3-backends branch, the disk backend works as described by warner, while the S3 backend does not. But the S3 backend is going to be rewritten for RAIC before it gets merged, in any case.)

comment:9 Changed at 2012-06-05T03:17:12Z by davidsarah

The tests on the cloud branch have been updated to reflect the desired behaviour. They pass for the disk backend, and are marked TODO for the cloud backend.

comment:10 Changed at 2013-05-01T02:46:30Z by daira

Fixed in fd819cea and ea335f74 on the cloud-rebased2 branch.

comment:11 Changed at 2013-05-01T02:46:58Z by daira

Keywords review-needed added
Owner changed from davidsarah to zooko
Status changed from assigned to new

comment:12 follow-up: ↓ 13 Changed at 2013-07-12T18:10:22Z by markberger

Description modified (diff)

It looks good to me, except for one line I have a question about. I left a comment on fd819cea about it.

Does the server file a local corruption report like Brian suggested? I can't seem to find code anywhere that does this.

comment:13 in reply to: ↑ 12 Changed at 2013-07-17T13:23:48Z by daira

Replying to markberger: