[tahoe-dev] [tahoe-lafs] #599: maybe add share-metadata: "where-are-the-other-shares" hints
tahoe-lafs
trac at allmydata.org
Wed Jan 6 12:59:30 PST 2010
#599: maybe add share-metadata: "where-are-the-other-shares" hints
--------------------------+-------------------------------------------------
Reporter: warner | Owner:
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: code-storage | Version: 1.2.0
Keywords: download | Launchpad_bug:
--------------------------+-------------------------------------------------
Comment(by warner):
I was thinking more about this last night, in the context of #872, #447,
#573, and other enhanced peer-selection goals. Many of these tickets
express
a desire to improve certain properties of the distributed files, typically
reliability (by improving geographical diversity), reduced repair cost
(getting k+1 shares into each datacenter), increased download speed
(placing
several shares near each downloader), and increased fairness of
reliability
over time (by trying to fill all servers at the same time instead of at
the
same rate, flattening out the entropy(peer-selection-process)-vs-time
curve
as described in comment:ticket:872:4).
All of these goals are hard to accomplish right now, because we want small
fixed readcaps for each file, and that doesn't leave the downloader with a
lot of information available to do the matching peer-selection algorithm.
The
fact that readcaps are immutable means that the repairer/rebalancer (which
need to add shares to new places, and sometimes remove them from old ones)
are changing the serverlist, and yet the downloader must be able to find
the
new shares efficiently using the old (fixed) readcap. Even if the file
isn't
repaired and all shares exist in the same places, the system needs to be
tolerant of server churn.
The Tahoe2 selection algorithm provides small fixed readcaps, handles
share
movement due to repair (assuming the repairer uses Tahoe2 too), and has
reasonable tolerance of server churn (adding 10% new servers will increase
the downloader's search distance by an average of 10%). And it provides
fill-at-same-rate load-balancing behavior, which is easy to understand and
happens to give you fill-at-same-time behavior if you have homogeneous
servers. But it is hard to accomodate any of the other goals contained in
the
tickets above, or to trade off some goals against others.
"All problems in computer science can be solved by introducing another
layer
of abstraction". In the context of this ticket, we could accomplish more
of
these goals simultaneously by introducing an intermediate
where-are-the-shares table. This table would contain a mutable mapping
from
storage index to:
* verifycap
* share locations: list of (serverid, confidence level)
* other serverlist locations: list of serverids
The serverid strings would merely need to be short prefixes of the full
serverids, perhaps as little as 4 bytes, since they would be expanded by
looking for matches in the full serverlist, and infrequent collisions
would
merely increase the search distance slightly. (this approach, rather than
including a full FURL, trades off space against continued dependence on an
Introducer and/or a notion of enumerable grid membership).
Each storage server would offer access to "location slots" in this table,
just like they currently offer mutable-share slots. However, the slots
contain plaintext, and would use have writecaps and readcaps. Instead, the
remote operations would look like:
* allocate location slot
* later, set storage index and verifycap (once only)
* propose locations (serverid1, serverid2, ...)
* propose removal (serverid1, serverid2, ...)
* fetch share location list
* fetch other serverlist location list
The storage servers would be responsible for confirming the presence or
absence of shares themselves. To this end, each table entry has a
"confidence
level", which the server uses to keep track of how much validation it has
performed. When a "propose location" message arrives, the serverids are
added
with the lowest confidence level (0). The server sends do-you-have-share
messages to those servers, and if they claim "YES", the level is raised to
1.
Later, when bandwidth allows, the server can fetch a couple of blocks at
random, and the hash chains, and make sure everything verifies up to the
locally-held verifycap, and then raises the level to 2. Eventually, the
server can fetch and verify the entire share, and raise the level to 3.
Upon receipt of a "propose removal" message, the server marks the entry as
"possibly removed", and queries the indicated server directly. If and only
if
the server response to the DYHB with a "NO", then the entry is actually
removed.
The server would need to run a service to manage this verification
process,
and to allocate a certain amount of bandwidth to it. This manager should
also
query the other serverlist holders, to update each other on the current
share
locations.
The idea is that shares would be uploaded to completely arbitrary
locations,
to accomplish whatever placement goals were desired (load balancing,
download
speed, datacenter diversity, etc). Then the location slots would be
allocated
on well-known servers (using the normal Tahoe2 algorithm), and filled with
the locations of the shares. The location slot holders would tell each
other
about the shares to flood this information out to all slots.
A downloader would use Tahoe2 to find at least one location slot, read the
serverlist out of it, and then query those servers for shares. They could
read from multiple location slots if they somehow got out of sync. The
likely
implementation would be to send "fetch share location list" messages in
parallel to the first N servers in the permuted list, and upon receipt of
the
first positive response, send DYHB messages in parallel to everything it
mentions.
After the repairer uploaded new shares, it would locate the location slots
and inform them about the new share. Through flooding, eventually all
location slots would learn about the new share. When a share is removed
from
a server (via lease expiration, or mutable rebalancing), the "propose
removal" message would be sent, and the location slot would confirm.
If the repairer notices that there are too few location slots, it can
allocate and fill new ones. It would be appropriate to have "N" location
slots. Since we're using 1-of-N encoding here (pure replication), the
reliability of the location slots should be much higher than the
reliability
of the target file, and the extra layer of indirection should not
significantly reduce reliability.
Having the location slot servers perform their own verification removes
the
need for an extra layer of writecaps (which would be needed by the
repairer
to update the location list, making it harder to have storage servers
drive
the repair process themselves). However it does increase their workload
significantly, making this look more like the old Tahoe1 scheme as
mentioned
above. On the other hand, this could provide the basis for server-driven
repair, where their periodic check on each other could provoke repair of
unhealthy files without client involvement.
What will prevent us from wanting to manipulate the choice of location-
slot
holders in the same way that those tickets want to manipulate the choice
of
share holders? Location slots are very small, so we'd be unlikely to want
to
balance the storage load as we do with (large) shares. They'd use 1-of-N
encoding, so certain performance and reliability concerns are reduced. But
for the others, we'd just have to hope that providing arbitrary location
of
shares is enough to get folks to tolerate these fixed location of
location-slot holders.
There would be a nontrivial performance hit to use location-slots.
Uploaders
would require one extra RTT, because the "propose location" messages could
not be sent until all shares had been closed. Servers would spend
considerable bandwidth and CPU checking up on each others shares. And
downloaders would incur an extra RTT to query the location-slot holders in
addition to the existing DYHB queries, however the DYHB queries would be
highly likely to succeed, making the combination potentially faster in
certain not-ideally-Tahoe2-distributed server selection schemes.
Other ideas:
* use a limited subset of servers for the location slots (i.e. nodes
might
offer share-storage, or location-slot-storage, but not both)
* alternatively, try to put the location-slots on the same servers as the
shares, so they share fate and we avoid the reliability hit
Anyways, this feels like a lot of moving parts (and the server load
worries
me), but a layer like this would let us use arbitrary server-selection
policies. It would also help us scale to millions of storage nodes, if the
Tahoe2 permutation and frequent queries only needed to be done on a few
hundred location-slot servers instead of the entire set of storage
servers.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/599#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list