#303 closed enhancement (fixed)
mutable-file retrieval: choose highest available version, not first-seen; read k+epsilon
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | 1.1.0 |
Component: | code-encoding | Version: | 0.7.0 |
Keywords: | mutable | Cc: | |
Launchpad Bug: |
Description
It is time to implement the "epsilon" part of mutable-file Retrieval's k+epsilon design.
The current Retrieval code keeps asking servers in succession until it gets enough shares (of any version) to download a complete+verified ciphertext. It stops as soon as it gets a complete set of 'k' shares for the same version.
We're currently vulnerable to rollback attacks: both intentional (a malicious storage server replaces the new shares with an older version, to cause the reader to see an older version of their directory, perhaps to delete a file that was just added), or accidental (a storage server was offline when a new version was published, and now it's back online again). The problem is worse with lower values of 'k'. I'd like to lower 'k' to 1 to speed up directory lookups, but I'm concerned about the rollback problem.
To avoid this, we need three things. The first is that we must talk to more servers than are strictly necessary: the minimum 'k' servers plus some number 'epsilon' for verification purposes. The second is that we should choose these extra servers somewhat at random (the attacker's job is harder if they don't know which servers they need to compromise). The third is that we should try to return the highest version found, rather than the first version which becomes available.
Zooko and I sketched out a refactoring of the mutable.Retrieve class that would make this easier to implement.
(note, of course, that the dirnode lookup speed is directly related to k+epsilon rather than just 'k', but I expect 1+epsilon to be smaller than 3+epsilon).
Change History (5)
comment:1 Changed at 2008-03-30T18:37:21Z by warner
- Milestone changed from undecided to 1.0.1
- Priority changed from major to critical
- Summary changed from mutable-file retrieval: choose highest available version, not first-seen to mutable-file retrieval: choose highest available version, not first-seen; read k+epsilon
comment:2 Changed at 2008-03-31T19:48:57Z by warner
Just documenting the failure we observed a bit more.. the unwriteable directory that Mike hit (prodnet SI b3h2hvaz4hkfhgcwbzz5cf64qe) reported the following sharemap during the "reconnaissance phase" of the Publish process:
15:27:570.861 L10 []#792 sharemap: sh0 on (5plhneww:#71:R=lmrs), sh1 on (odddkmp5:#71:R=lmrs), sh2 on (3fbcmylr:#72:R=iimc/p5raly75:#71:R=lmrs), sh3 on (3fbcmylr:#72:R=iimc), sh4 on (xy7xqa2y:#72:R=iimc/hlbmghg4:#71:R=lmrs), sh5 on (6ja7gns2:#71:R=lmrs/5txtefoo:#72:R=iimc), sh6 on (lc2u2rg7:#71:R=lmrs), sh7 on (xy7xqa2y:#72:R=iimc), sh8 on (5txtefoo:#72:R=iimc), sh9 on (pvndl3y2:#71:R=lmrs) 15:27:570.861 L10 []#793 we are planning to push new seqnum=#72 15:27:570.861 L30 []#794 the sequence number has changed unexpectedly
It becomes a bit easier to analyze if we convert this into the "servermap" form:
5plh: sh0:#71:lmrs oddd: sh1:#71:lmrs p5ra: sh2:#71:lmrs 3fbc: sh3:#72:iimc, sh2:#72:iimc hlbm: sh4:#71:lmrs 6ja7: sh5:#71:lmrs lc2u: sh6:#71:lmrs xy7x: sh7:#72:iimc, sh4:#72:iimc 5txt: sh8:#72:iimc, sh5:#72:iimc pvnd: sh9:#71:lmrs
Zooko and I concluded that the previous update (which put 72:iimc in place) was performed while the node was connected to only a subset of the servers. It was attached to 3 of the original 10 servers (specifically 3fbc/xy7x/5txt), because those servers had version 72 instead of 71. It was not attached to the other 7 original servers (5plh/oddd/p5ra/hlbm/6ja7/lc2u/pvnd), since those are still stuck at 71. It also had connections to between 2 and 4 other servers that were outside the original range, most likely 2, based upon the doubling-up of shares we observe here.
If it had connections to more than 4 other servers, it wouldn't have had to place multiple shares on 3fbc, xy7x, or 5txt. If it had fewer than 2 other servers, it would have had to place three shares on at least one of these nodes. We don't see these other nodes in the sharemap because the reconnaissance phase didn't bother to ask more than 10 servers about their current version.
I expect that there is a server out there with shares 0+6, and another one with 1+9.
comment:3 Changed at 2008-04-01T19:18:55Z by warner
- Owner set to warner
- Priority changed from critical to major
- Status changed from new to assigned
My plan:
- Retrieve needs to be arranged more like a state machine than a chain of Deferreds. The incoming events are responses to readv() messages.
- We'll create a new ServerMap object, which remembers which versions
are on which servers. This object will be used for both retrieval and
publishing purposes.
- MutableFileNode.update_servermap() will create one of these (or take an old one and update it). We might add some parameters later to say "update if it is too old" or something.
- ServerMap.highest_recoverable_version() will provide the version identifier for the "best" recoverable version. SDMF/MDMF uses (seqnum,roothash) tuples for the identifier.
- MutableFileNode.download_version() will take a servermap and a version identifier, and returns a string
- MutableFileNode.read_range() takes a servermap, a version identifier, and a byterange, and returns a string. Each time a segment is read from the remote server, the version checkstring should be read too.
- Publish operations also take a servermap, which means "do this publish if
and only if the servers mentioned in the map still have the same
versions". This provides test-and-set semantics. The layer above the
MutableFileNode should pass the servermap used by retrieve back into
publish, to avoid clobbering anyone else's changes.
- MutableFileNode.publish() takes a servermap and a string, and produces+publishes a new version. It updates the servermap.
- (someday) MutableFileNode.write_range() takes a servermap, a byterange, and a string.
- The MutableFileNode is allowed to cache response data, indexed by version identifier. This will allow two subsequent calls to read_range to avoid extra network traffic. We expect the node to only store one segment at a time.
- The servermap is passed into/outof API calls (rather than being cached inside the MutableFileNode for safety: if two uncoordinated parties hold a reference to the same node, they will need two distinct servermaps to avoid clobbering each other.
- keeping a MutableFileNode around for a while will improve performance (since the data cache can kept around for a limited time), at the expense of memory. For the immediate future, we're planning to have a weakref table in the Client (or in mutable.MutableWatcher) to avoid auto-collisions by ensuring that we only have a single node per write-cap. In the future we might consider adding a timer or LRU cache to this table.
- likewise, keeping a ServerMap around for a while can improve performance (since the map can be reused, avoiding redundant version queries), at the expense of updateness.
Also, since the #374 thundering-herd-reconnect fix is in, this problem is less likely to occur. So I'm lowering the priority, and giving myself more time to implement some longer-term refactoring improvements.
comment:4 Changed at 2008-04-23T18:50:17Z by warner
- Resolution set to fixed
- Status changed from assigned to closed
mutable file operations have now been refactored according to the plan described above, with the following exceptions:
- MutableFileNode has a small cache, but it only records the responses to a servermap update query, which means it will only contain a few kB of data. There is nothing present right now to record more than that.
- there is no weakref table to keep MutableFileNode instances around, and we have not yet implemented the auto-collision-avoidance
I've opened a new ticket for the auto-collision-avoidance issue (#391), and am closing the rest of this ticket.
comment:5 Changed at 2008-05-05T21:08:36Z by zooko
- Milestone changed from 1.0.1 to 1.1.0
Milestone 1.0.1 deleted
Mike hit this case, probably because his client was offline for a while, and the Reconnectors had ratched back to the point that some of them had reconnected and others hadn't.
I'm raising the priority of this one, since the failure is easier to hit than I'd previously thought.