#3457 new defect

The separate introducer servers represent unnecessary complexity in an overall Tahoe-LAFS deployment

Reported by: exarkun Owned by:
Priority: normal Milestone: undecided
Component: code-network Version: n/a
Keywords: Cc:
Launchpad Bug:

Description

A useful Tahoe-LAFS deployment consists of:

  • one or more clients
  • one or more storage servers
  • one or more introduction mechanisms
    • zero or more introducer servers; or
    • zero or more static storage server configuration blobs for clients

At least one client is required or no service is being consumed. At least one storage server is required or not server is being offered.

A client can be put in touch with a storage server either by:

  • receiving static configuration that gives it that storage server's storage announcement
  • receiving static configuration that lets it contact an introducer server; if the storage server is also configured to contact this introducer server then the introducer server will convey the storage server's announcement to the client.

A deployment with no introducer servers has at least two advantages over a deployment with introducers:

  • it is operationally simpler in that it involves fewer pieces of long-running software to operate
  • it does not introduce a point of failure into the system between clients and storage servers (though since multiple introducers are supported it is possible to operate an collection of introducer servers to make sure use of these isn't a *single* point of failure in the system - at the cost of additional operational complexity).

A deployment with one or more introducer servers has at least one advantage over a deployment with none:

  • a mechanism is provided to automatically inform clients of changes to the operating group of storage servers

If we can provide the automatic client updates without operating any introducer servers then this would seem to be a clear win, picking up all of the advantages from both of the current deployment options.

Change History (8)

comment:1 Changed at 2020-10-01T14:33:02Z by exarkun

Here are some tickets that are somewhat related as well:

comment:2 Changed at 2020-10-01T14:39:05Z by exarkun

Here's one idea. Take the introducer functionality of accepting announcements and delivering them to subscribers and allow it to be folded into a Tahoe-LAFS process of another sort - say, a storage server. Thus, operating a storage server would automatically provide introducer functionality. Since storage servers are already an essentially component of a Tahoe-LAFS deployment this does not add any new long-lived processes or operational components to a deployment. Since the introducer functionality is now available from storage servers, the dedicated introducer servers can now be removed. This reduces the overall complexity of a deployment.

This does not solve all of the problems we have with introducers. For example, a client will still need a statically configured list of introducers. At least one of these introducers will have to remain online as long as clients with that configuration continue to operate.

It also does not reduce the load on the introducer component. Each introducer must handle all announcements relevant to the deployment. This adds runtime cost of O(N) to each storage server that is also an introducer.

It does not defend against unreliable storage being announced since it can now just be announced to the introducer-in-storage server instead of the stand-alone introducer server.

It may mitigate the privacy concerns since clients are already going to maintain a long-lived connection to a storage server.

It does not address behavior in the face of misbehaving introducer clients.

comment:3 Changed at 2020-10-01T14:53:38Z by exarkun

Here's another idea. An entity creates and maintains a mutable list of announcements *on a grid*. It sets the encoding parameters for this object so that a full copy exists on every storage server which appears in the list of announcements. Storage clients are given bootstrap configuration which consists of a recent-enough copy of this list. If any storage server in the list is still reachable then the current copy of the list can be retrieved and the client can update its persisted configuration with that copy. Over time the storage client can continue to retrieve updates to this list. As long as the client retrieves an update before *all* storage servers it knew about become unreachable it can always find the latest configuration.

This mostly solves the problem of the static introducer list. There is still a bootstrap configuration but there is a mechanism to update it over time.

It has ... some ... impact on overall load on the system. Every storage server in a grid now has to maintain a copy of the list of announcements. However, the list doesn't have to be updated as frequently anymore. It only needs to be changed when the actual list of storage servers changes (compared to currently where an announcement is sent every time a storage server process *starts*). This probably means the storage requirements are higher (but still essentially negligible) and the runtime requirements are lower. Since there is no notification mechanism it will require clients to poll the storage object to find updates - however this can be quite low frequency and there are a number of reasons we want to *add* a notification mechanism to Tahoe-LAFS anyway, at which point it could be leveraged to remove the polling here.

It defends against unreliable storage being announced since now only "an entity" can control the list of announcements. It does this by completely denying open participation, of course. However, *any* entity could choose to maintain one of these lists containing *any* announcements that entity likes. Clients can pick the list (or lists!) they want to follow.

It should remove all privacy concerns since there is now no longer any difference between consuming storage for normal purposes and obtaining the storage announcement list (as it is the same as any other data on the grid).

It also addresses behavior in the face of misbehaving introducer clients. The entity managing the list might misbehave. In some ways that entity becomes the single point of failure in the system. For example, they might lose their keys and become unable to distribute further storage server updates. This would require a reconfiguration of all clients to follow a mutable object.

comment:4 Changed at 2020-10-01T17:11:38Z by exarkun

I think the core of the latter idea above is that using Tahoe-LAFS storage itself to propagate this information represents the minimum complexity for solving the problem that the introducer servers currently solve. A solution which is simpler in isolation may exist but since the purpose is to let clients use the storage servers, total system complexity cannot fall below that required to make use of those storage servers.

comment:5 Changed at 2020-10-01T19:30:20Z by exarkun

Here's a more concrete elaboration of the latter idea above.

The introducer-v3 manages a mutable directory on a grid (typically the grid for which it is the manager). This is the grid service directory (GSD). It encodes the mutable directory so that any single share from any single server in the grid is sufficient to reconstruct its contents. It holds the writecap for the directory in secret and, for certain operations, shares the readcap.

Into this directory, introducer-v3 will link readcaps for grid service announcement (GSAs) it wishes to share with any of its clients. These readcaps yield objects containing, for example, a storage service announcement.

A client is granted access to a grid by receiving two pieces of information. First, the readcap for the GSD. Second, one or more storage service fURLs. The client connects to any one storage service using the fURLs supplied and reads the GSD and all GSAs. Among the GSAs should be storage service announcements for all storage servers related to the introducer-v3 granting access. At this point the client can connect to and use all of the necessary storage servers. The client will monitor the GSD and GSAs to remain up to date with any reconfigurations which might take place (addition of new GSAs, removal of old GSAs, modifications to existing GSAs).

A manual CLI for this workflow might go something like this.

First, the operator of the introducer-v3 produces the two pieces of information required by the client:

$ introducer-v3 get-introduction-text
<dir readcap>,<storage furl>[,<storage furl>,...]

Then the information is shared out-of-band with the client which enters it into their client:

$ tahoe create-node --introducer-v3 <dir readcap>,<storage furl>[,<storage furl>,...] ...

The given configuration is recorded for use when the node is running.

When the node is started it:

  1. uses one of the <storage furl> values to connect to a storage service
  2. retrieves the GSD (using the <dir readcap>)
  3. fetches each GSA in the GSD
  4. rewrites its configuration to record all storage announcements found in the GSAs

Thereafter, it periodically repeats steps 2-4 to ensure it remains up-to-date with any changes made by its introducer.

A storage server can be taken out of service by unlinking its GSA from the GSD.

$ introducer-v3 remove-announcement <readcap>

A storage server can self-report configuration changes (such as its location hints) by rewriting its own GSA. It takes care to encode the GSA so that every storage server known to it receives a share that is sufficient to reconstruct the GSA.

A new storage server can be commissioned by:

  1. receiving access to the grid in the manner any client would
  2. writing its GSA to a mutable file
  3. sharing the readcap for its GSA with the introducer-v3
  4. the introducer-v3 operator links the readcap into the GSD
$ introducer-v3 add-announcement <readcap>

comment:6 Changed at 2020-10-01T19:35:39Z by exarkun

One thing I don't know is how easily a mutable object can be spread across N servers and then expanded to be available on N+1 servers.

comment:7 Changed at 2020-10-02T19:46:48Z by exarkun

Note fURLs can contain commas so the syntax examples above aren't literally possible.

comment:8 Changed at 2020-10-30T14:37:32Z by exarkun

Here's a draft PR that just has some docs in it with the grid introducer idea - https://github.com/tahoe-lafs/tahoe-lafs/pull/882

Note: See TracTickets for help on using tickets.