#1590 closed defect (fixed)

S3 backend: intermittent "We encountered an internal error. Please try again." from S3

Reported by: davidsarah Owned by: secorp
Priority: critical Milestone: undecided
Component: code-storage Version: 1.9.0
Keywords: s3-backend reliability availability preservation error Cc:
Launchpad Bug:

Description

Traceback from CLI (tahoe backup):

Traceback (most recent call last):
  File "/home/zooko/playground/tahoe-lafs/trunk/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.9.0-r5375', 'console_scripts', 'tahoe')()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/cli.py", line 569, in backup
    rc = tahoe_backup.backup(options)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 65, in _ebDeferred
    self.checkForFinished()
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 53, in checkForFinished
    self.errback(self.failure)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 388, in errback
    self._startRunCallbacks(fail)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/encode.py\", line 504, in _remove_shareholder
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s).
 We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.:
 [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>:
 <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>:
 <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/tcp.py\", line 277, in connectionLost
    protocol.connectionLost(reason)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/web/client.py\", line 191, in connectionLost
    self.factory._disconnectedDeferred.callback(None)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 362, in callback
    self._startRunCallbacks(result)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/s3/client.py\", line 30, in s3_error_wrapper
    error_wrapper(error, S3Error)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/client/base.py\", line 46, in error_wrapper
    raise fallback_error
txaws.s3.exception.S3Error: Error Message: We encountered an internal error. Please try again.
]'>
])>
]
"

Incident report: https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2

Change History (7)

comment:1 Changed at 2012-02-09T21:06:42Z by zooko

Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.".

Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse.

(In addition to all that, we should probably go ahead and retry the failed request...)

I searched the AWS forums, its S3 sub-forum, for the following search terms constrained to the year 2012:

Searching for the year 2011:

comment:2 Changed at 2012-02-18T06:16:58Z by davidsarah

#1589 should improve the logging of these errors.

comment:3 Changed at 2012-02-24T05:56:20Z by zooko

  • Priority changed from major to critical
  • Version changed from 1.9.0b1 to 1.9.0

comment:4 Changed at 2012-03-09T05:40:45Z by davidsarah

These errors are transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.

comment:5 Changed at 2012-03-09T06:16:01Z by davidsarah

  • Owner set to secorp

Assigning to secorp to report how well that retrying change went.

comment:6 Changed at 2012-05-25T06:27:29Z by zancas

Here's a link to the code that probably fixed this issue:

https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend

comment:7 Changed at 2012-07-02T09:14:34Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

We're pretty satisfied with this fix/workaround ([20120510215310-93fa1-a61c9bd9f506b1f2366233e6e338683029c78e94]), which is in production use in the Tahoe-LAFS-on-S3 product of Least Authority Enterprises.

Note: See TracTickets for help on using tickets.