[tahoe-lafs-trac-stream] [Tahoe-LAFS] #1640: the mutable publisher should try harder to place all shares

Thu Sep 11 22:22:43 UTC 2014

#1640: the mutable publisher should try harder to place all shares
------------------------------------+----------------------------
     Reporter:  kevan               |      Owner:  nobody
         Type:  defect              |     Status:  new
     Priority:  major               |  Milestone:  soon
    Component:  code-peerselection  |    Version:  1.9.0
   Resolution:                      |   Keywords:  mutable upload
Launchpad Bug:                      |
------------------------------------+----------------------------
Changes (by warner):

 * component:  unknown => code-peerselection

Old description:

> If a connection error is encountered while pushing a share to a storage
> server, the mutable publisher forgets about the writer object associated
> with the (share, server) placement; this is consistent with the pre-1.9
> publisher, and, in high level terms, means that the publisher views that
> share placement as probably invalid, associating the error with a server
> failure or something like it. The pre-1.9 publisher attempts to find
> another home for the share placed on the broken server. The current
> publisher doesn't.
>
> When I first wrote the publisher, I wanted to support streaming upload of
> mutable files. That made it hard to find a new home for a share placed on
> a broken storage server, since we wouldn't necessarily have all of the
> parts of the share we generated and placed before the failure available
> to upload to a new server. We ended up ditching streaming uploads due to
> other concerns; instead, we write a share all at once, and we have
> everything we will write to a storage server available to us when we
> write. Given this, there's no compelling reason that the publisher
> couldn't attempt to find a new home for shares placed on broken servers.
> Ensuring that all shares are placed if at all possible makes it more
> likely that there will be a recoverable version of the mutable file
> available after an update.
>
> In practical terms, this increases the chance of data loss somewhat,
> proportional to the number of servers that fail during a publish
> operation. If too many storage servers fail during the upload process and
> too much of the initial share placement is lost due to these failures,
> the newly-placed mutable file might not be recoverable. A fix would
> involve a way to change the server associated with a writer after the
> writer is created, and probably some control flow changes to ensure that
> write failures result in shares being reassigned.

New description:

 If a connection error is encountered while pushing a share to a storage
 server, the mutable publisher forgets about the writer object associated
 with the (share, server) placement; this is consistent with the pre-1.9
 publisher, and, in high level terms, means that the publisher views that
 share placement as probably invalid, associating the error with a server
 failure or something like it. The pre-1.9 publisher attempts to find
 another home for the share placed on the broken server. The current
 publisher doesn't.

 When I first wrote the publisher, I wanted to support streaming upload of
 mutable files. That made it hard to find a new home for a share placed on
 a broken storage server, since we wouldn't necessarily have all of the
 parts of the share we generated and placed before the failure available to
 upload to a new server. We ended up ditching streaming uploads due to
 other concerns; instead, we write a share all at once, and we have
 everything we will write to a storage server available to us when we
 write. Given this, there's no compelling reason that the publisher
 couldn't attempt to find a new home for shares placed on broken servers.
 Ensuring that all shares are placed if at all possible makes it more
 likely that there will be a recoverable version of the mutable file
 available after an update.

 In practical terms, this increases the chance of data loss somewhat,
 proportional to the number of servers that fail during a publish
 operation. If too many storage servers fail during the upload process and
 too much of the initial share placement is lost due to these failures, the
 newly-placed mutable file might not be recoverable. A fix would involve a
 way to change the server associated with a writer after the writer is
 created, and probably some control flow changes to ensure that write
 failures result in shares being reassigned.

--

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1640#comment:5>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage