#873 new defect

upload: tolerate lost or unacceptably slow servers

Reported by: warner Owned by: kevan
Priority: major Milestone: eventually
Component: code-encoding Version: 1.5.0
Keywords: upload preservation availability performance hang error Cc:
Launchpad Bug:

Description

As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is.

To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.

Attachments (1)

logs.tgz (23.6 KB) - added by kmarkley86 at 2009-12-29T21:36:27Z.
Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.

Download all attachments as: .zip

Change History (12)

comment:1 Changed at 2009-12-27T04:52:18Z by warner

  • Keywords upload added

comment:2 Changed at 2009-12-27T16:06:42Z by davidsarah

  • Keywords preservation availability performance added

comment:3 Changed at 2009-12-29T19:07:26Z by davidsarah

  • Keywords hang added

Changed at 2009-12-29T21:36:27Z by kmarkley86

Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.

comment:4 Changed at 2009-12-29T21:44:25Z by kmarkley86

I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions:

allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

comment:5 Changed at 2009-12-30T00:03:22Z by davidsarah

Kyle wrote:

My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything:

  • twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup)
  • dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing)

(I'm on the allmydata.com production grid.)

comment:6 Changed at 2010-05-16T05:21:27Z by zooko

  • Milestone changed from undecided to 1.8.0
  • Owner set to zooko
  • Status changed from new to assigned

comment:7 Changed at 2010-06-12T23:44:48Z by davidsarah

  • Keywords error added

comment:8 Changed at 2010-07-24T05:38:14Z by zooko

  • Milestone changed from 1.8.0 to eventually

It was impulsive of me to put this ticket into the 1.8 Milestone. This ticket will probably get fixed in a complete rewrite of the upload code at some point.

comment:9 Changed at 2010-07-29T04:53:25Z by zooko

  • Summary changed from upload: tolerate lost or missing servers to upload: tolerate lost or unacceptably slow servers

comment:10 Changed at 2011-04-21T14:52:28Z by davidsarah

#1394 is a near-duplicate for the server selection stage of upload. There's a tension between this ticket and #362 ('enhance upload to search longer and more completely for shares'), which I'm not sure how to resolve.

comment:11 Changed at 2011-07-22T13:34:03Z by zooko

  • Owner changed from zooko to kevan
  • Status changed from assigned to new

Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in comment:10, that might be good

Note: See TracTickets for help on using tickets.