Opened at 2011-11-19T02:43:28Z
Closed at 2012-07-02T09:14:34Z
#1590 closed defect (fixed)
S3 backend: intermittent "We encountered an internal error. Please try again." from S3
Reported by: | davidsarah | Owned by: | secorp |
---|---|---|---|
Priority: | critical | Milestone: | undecided |
Component: | code-storage | Version: | 1.9.0 |
Keywords: | s3-backend reliability availability preservation error | Cc: | |
Launchpad Bug: |
Description
Traceback from CLI (tahoe backup):
Traceback (most recent call last): File "/home/zooko/playground/tahoe-lafs/trunk/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0-r5375', 'console_scripts', 'tahoe')() File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/cli.py", line 569, in backup rc = tahoe_backup.backup(options) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error "Traceback (most recent call last): File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 65, in _ebDeferred self.checkForFinished() File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 53, in checkForFinished self.errback(self.failure) File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 388, in errback self._startRunCallbacks(fail) File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/encode.py\", line 504, in _remove_shareholder raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/tcp.py\", line 277, in connectionLost protocol.connectionLost(reason) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/web/client.py\", line 191, in connectionLost self.factory._disconnectedDeferred.callback(None) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 362, in callback self._startRunCallbacks(result) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 458, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 545, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/s3/client.py\", line 30, in s3_error_wrapper error_wrapper(error, S3Error) File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/client/base.py\", line 46, in error_wrapper raise fallback_error txaws.s3.exception.S3Error: Error Message: We encountered an internal error. Please try again. ]'> ])> ] "
Incident report: https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2
Change History (7)
comment:1 Changed at 2012-02-09T21:06:42Z by zooko
comment:2 Changed at 2012-02-18T06:16:58Z by davidsarah
#1589 should improve the logging of these errors.
comment:3 Changed at 2012-02-24T05:56:20Z by zooko
- Priority changed from major to critical
- Version changed from 1.9.0b1 to 1.9.0
comment:4 Changed at 2012-03-09T05:40:45Z by davidsarah
These errors are transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.
comment:5 Changed at 2012-03-09T06:16:01Z by davidsarah
- Owner set to secorp
Assigning to secorp to report how well that retrying change went.
comment:6 Changed at 2012-05-25T06:27:29Z by zancas
Here's a link to the code that probably fixed this issue:
https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend
comment:7 Changed at 2012-07-02T09:14:34Z by zooko
- Resolution set to fixed
- Status changed from new to closed
We're pretty satisfied with this fix/workaround ([20120510215310-93fa1-a61c9bd9f506b1f2366233e6e338683029c78e94]), which is in production use in the Tahoe-LAFS-on-S3 product of Least Authority Enterprises.
Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.".
Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse.
(In addition to all that, we should probably go ahead and retry the failed request...)
I searched the AWS forums, its S3 sub-forum, for the following search terms constrained to the year 2012:
Searching for the year 2011: