[tahoe-dev] Node correlations - [Was] best practice for wanting to setup multiple tahoe instances on a single node

Tue Jan 17 02:50:40 UTC 2012

[Attention Conservation Notice: I'm new to Tahoe, so this may be naïve
prattling.]

On 1/16/12 12:37 PM, Nathan Eisenberg wrote:
> Don't forget the service provider/datacenter model should be accounted for in the naming convention

Several things that I might reasonably want to include in a selector
algorithm aren't really deducible from physical location:

  - are these nodes served by the same network infrastructure?
(correlated connectivity failures)

  - are these nodes operated by the same organization? (correlated
intentional outages: FooCorp suddenly goes out of business or decides to
stop providing storage to this grid)

  - does a given node advertise certain availability/longevity/bandwidth
properties? (eg, a grid may be composed of a mixture of fast-but-faily
nodes on coworkers/friends' disks, slow-but-reliable nodes, long-lived
but intermittently-available nodes, etc.)

These properties don't fit in a hierarchy; they might be correlated with
each other or might not; etc.

Here's another zero-point-one'th-cut proposal:

- The introducer holds an unordered set of key-value pairs for each
storage node, and provides these to clients for use in their selection
algorithms. If this is provided via a Tahoe read-cap (one per introduced
storage node?), then it is up to the individual grid whether these are
immutable files or mutable; and if mutable, who has the write-caps; etc.

- The grid administration defines the set of keys and the set of values
each key may have. Keys might be "city", "racknumber", "AS#", etc..---
depending on the policy and use-cases of that particular grid. The grid
administration makes sure that each node that joins provides a value for
every key. "Unknown"/"undisclosed" may be an acceptable value for a
given key, or it may not, depending on the grid.

- Grid administration could maintain a list of hierarchical inferences
("if you're in datacenter Foo, then you're in this city, on this
network, etc."), and whatever tool is used to configure a storage node
might consult this list. That would be outside the realm of Tahoe
proper, though. Client nodes would just operate on the keys-and-values
provided by the introducer.

The things I like about this setup:

- Having the universe of keys *and* values explicitly defined by the
grid avoids folksonomy-tag-clutter and ambiguous-naming problems
(Nuremburg/Nürnberg) and makes it easier for someone to write a selector
algorithm that will do what they want.

- Hierarchy, if it exists, can be handled by node-administration tools,
if they exist; without complicating the selector algorithms or the Tahoe
implementation.

- It's flexible enough to handle weird experimental selector rules,
paranoid or laissez-faire grid administration, etc., without pushing
that complexity into Tahoe itself.