[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2875: Multiples storage servers can be registered with the same FURL (UncoordinatedWriteError on I2P grid)

Tue Jul 4 19:08:38 UTC 2017

#2875: Multiples storage servers can be registered with the same FURL
(UncoordinatedWriteError on I2P grid)
------------------------------+------------------------------
     Reporter:  nwks          |      Owner:
         Type:  defect        |     Status:  new
     Priority:  normal        |  Milestone:  undecided
    Component:  code-network  |    Version:  1.12.1
   Resolution:                |   Keywords:  i2p introduction
Launchpad Bug:                |
------------------------------+------------------------------

Comment (by warner):

 Thanks for the detailed investigation!

 Storage servers in Tahoe are primarily identified by their !Ed25519 public
 verifying key, and the FURL is supposed to be an attribute of the storage
 server (rather than an identifier). The idea going forward is that some
 storage servers won't even speak Foolscap, so they won't have FURLs
 (they'll probably have `http://` URLs instead).

 It wasn't always that way. In the beginning, we were so focused on
 Foolscap that we used the "tubid" portion of the Foolscap FURL at a node
 identifier (and we were so focused on P2P that we frequently called it a
 "peerid"). But when we started thinking about non-Foolscap nodes, we
 realized that was a mistake.

 In 1.12 we finally switched from using the FURL-derived tubid to using the
 new !Ed25519 pubkey. Servers have signed their Introducer announcements
 using the corresponding private signing key since the 1.10 release.
 Clients build local objects (instances of
 `allmydata.storage_client.NativeStorageServer`) to represent them, and
 those `NativeStorageServer` objects are indexed by a "serverid".

 I did a big pass to hand these server objects around instead of string-ish
 identifiers, and then to consistently use the word "serverid" in the code
 when we did need to talk about the server objects. And then we made a
 second change to build this from the !ed25519 key instead of the tubid,
 except in a backwards-compatible case where the (older) server wasn't
 publishing an !ed25519 key. In 1.12, we removed support for these old
 unsigned announcements, so all the client-side server objects ought to
 have a pubkey-based serverid now, and this backwards-compatibility code
 was removed.

 So, misconfigured servers that announce the wrong FURL are a problem, but
 it should be only the server's problem (clients won't be able to connect
 to that server: it'd be like you giving me the wrong email address and
 then me wondering why my emails weren't getting through, or going to the
 wrong person). But if the client is getting confused by that, that's
 something we need to fix.

 So the client thinks it is talking to two distinct servers, but they both
 happen to be connecting to the same one. So like you said, then client
 will send two shares to the same server, and they'll also see both shares
 as appearing at both servers.

 One of those shares is wrong: the lease renew/cancel secrets will be
 computed for the wrong server, and the "write enabler" secret (which
 authorizes mutations of mutable shares) will be wrong. So when the client
 attempts to modify both of the shares that it sees, it will get an
 exception for one of them (or the request will be ignored: it's a critical
 distinction, but I forget how the code currently behaves).

 But most critically, as you discovered, writing to share 1 on "server A"
 will cause share 1 on "server B" to spontaneously change, which looks
 exactly like an `UncoordinatedWriteError`.

 The shallow solution might be to have the client watch for duplicate
 FURLs, and complain, or reject the second one as a duplicate. That's not
 very satisfying though, because a server that wants to cause problems
 could duplicate the FURL of any other known server, and then if it got
 lucky and was processed first, it could prevent you from reaching their
 victim server. The malicious server can't pretend to be someone else
 (because of the keys), but it *can* (sometimes) prevent you from talking
 to someone else, which isn't great.

 The deeper solution, which I guess I should have thought about when I
 first implemented the !ed25519 key scheme, is that the server needs to
 prove its control over that key when the client connects to it, instead of
 merely when it publishes the announcement. Or, at least the server needs a
 way to let the client know which server they just connected to, and the
 client should check this before sending any shares.

 I need to think about this more.. getting a strong two-way binding between
 serverid (the !ed25519 pubkey) and the FURL sounds tricky, but also I
 don't think we strictly need it. My immediate thought was to have the
 server (at the announced FURL) return a signed JSON blob which contains
 the actual FURL to use. But I don't think that solves the problem at all.

 A not-so-strong binding might be sufficient: put the server's !ed25519
 serverid in the VERSION blob that it hands out to all clients that connect
 to the FURL. The client would check this value against the serverid they
 were intending to connect to. So we know that the !ed25519 signing-key
 holder wanted to use this FURL, and we know that this FURL wants to be
 known by the matching verifying key.

 Actually, that second approach *does* sound like a strong bidirectional
 binding. The VERSION blob is controlled by the FURL-owner (the rogue
 server has no way to control the object pointed to by the victim server's
 FURL, it can merely send the client to somebody else's object), and
 Foolscap provides a transport channel bound to the FURL, so nobody else
 can see or modify that blob.

 Ok, so the task is then:

 * update the VERSION blob to return the !ed25519 serverid
 (`remote_get_version`, in src/allmydata/storage/server.py), probably in a
 key named `serverid`
 * update `NativeStorageServer._got_versioned_service` in
 src/allmydata/storage_client.py to compare `rref.version["serverid"]`
 against `self._server_id`, and if they don't match, then.. fail somehow

 What about backwards compatibility? Most servers won't be publishing this
 `serverid` VERSION key yet. We want a way for grids that are having this
 problem to be able to fix it, but without causing a flag day for everyone.
 We could say that the client only checks for equality if the server
 actually publishes its own serverid. Then once the victim server (the one
 whose FURL is being copied by someone else) and the victim clients have
 upgraded, the new clients should fail to establish a working connection to
 the bogus server.

 The other interesting question is how it should fail. We could drop the
 TCP connection outright, but the client will treat that as a network error
 and begin to reconnect right away. We probably need to prevent
 reconnections until we get a new announcement (hopefully with a better
 FURL, although realistically that may never happen). Mainly we need to
 make sure the `StorageFarmBroker` never returns this incomplete/unusable
 `NativeStorageServer` object to the uploader/downloader code, which means
 not setting `self._is_connected = True` in `_got_versioned_service`.

 So maybe it'd be enough to let the connection remain up, but leave
 `_is_connected` at False, so the broker won't actually use it for
 anything. That wouldn't cause any reconnections to happen. It'd be a waste
 of a file descriptor, but probably easier than any other fix. With more
 effort, we could change `NativeStorageServer` to have an additional state
 (beyond "connected", "connecting", "waiting"), something like "no longer
 interested", which shuts down the `Reconnector`. We could also set the
 connection status to "serverid did not match", to tell the user what went
 wrong.

 Open question: is there anything we can do to mitigate this *without* a
 server upgrade? The client can look for duplicate FURLs (the "shallow"
 fix), which will tell us that something is going wrong, but I don't think
 we have enough information to know which connection is the right one, so
 the best we could do is display a warning message somewhere.

 We could have the *server* look for duplicate FURLs, but again the best it
 can do is display a warning somewhere.

 We could have each server subscribe to hear about other servers (they
 currently do anyways, but only because we haven't yet built a "server-
 only" node, which we totally want to do). Then if server A sees someone
 else announce its own FURL, it could complain somehow. But if we're
 changing server code to do that, then we could jump ahead to having the
 server publish its own serverid.

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2875#comment:8>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage