Context Navigation

← Previous Ticket
Next Ticket →

Opened at 2011-11-19T02:43:28Z

Closed at 2012-07-02T09:14:34Z

#1590 closed defect (fixed)

S3 backend: intermittent "We encountered an internal error. Please try again." from S3

Reported by:	davidsarah	Owned by:	secorp
Priority:	critical	Milestone:	undecided
Component:	code-storage	Version:	1.9.0
Keywords:	s3-backend reliability availability preservation error	Cc:
Launchpad Bug:

Description

Traceback from CLI (tahoe backup):

Traceback (most recent call last):
  File "/home/zooko/playground/tahoe-lafs/trunk/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.9.0-r5375', 'console_scripts', 'tahoe')()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/cli.py", line 569, in backup
    rc = tahoe_backup.backup(options)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 65, in _ebDeferred
    self.checkForFinished()
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 53, in checkForFinished
    self.errback(self.failure)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 388, in errback
    self._startRunCallbacks(fail)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/encode.py\", line 504, in _remove_shareholder
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s).
 We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.:
 [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>:
 <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>:
 <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/tcp.py\", line 277, in connectionLost
    protocol.connectionLost(reason)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/web/client.py\", line 191, in connectionLost
    self.factory._disconnectedDeferred.callback(None)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 362, in callback
    self._startRunCallbacks(result)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/s3/client.py\", line 30, in s3_error_wrapper
    error_wrapper(error, S3Error)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/client/base.py\", line 46, in error_wrapper
    raise fallback_error
txaws.s3.exception.S3Error: Error Message: We encountered an internal error. Please try again.
]'>
])>
]
"

Incident report: https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2

Change History (7)

comment:1 Changed at 2012-02-09T21:06:42Z by zooko

Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.".

Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse.

(In addition to all that, we should probably go ahead and retry the failed request...)

I searched the AWS forums, its S3 sub-forum, for the following search terms constrained to the year 2012:

"503": 0 hits
"500": 3 hits (that were about the 500 error code instead of, say, 500 ms):
- https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- AWS tech says you oughta retry, and asks for more information about the pattern of the failure
- https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- not clear to me if that is a 500 from S3 or from CloudFront or something

Searching for the year 2011:

"503": 2 hits
- https://forums.aws.amazon.com/thread.jspa?messageID=260249 -- Code: SlowDown / Message: Please reduce your request rate.
- https://forums.aws.amazon.com/thread.jspa?messageID=236572 -- sudden storm of 503's, concentrated in Europe region, no explanation, but the users didn't post follow-ups complaining more so it must have been resolved
- https://forums.aws.amazon.com/thread.jspa?messageID=260617 -- Code: SlowDown / Message: Please reduce your request rate.
"500": 12 hits
- https://forums.aws.amazon.com/thread.jspa?messageID=297376 -- storm of failures including spurious "access denied" and objects just disappearing after upload
- https://forums.aws.amazon.com/thread.jspa?messageID=284078 -- turned out to be wrong credentials
- https://forums.aws.amazon.com/thread.jspa?messageID=260843 -- was an internal error in S3 that was subsequently fixed by AWS
- https://forums.aws.amazon.com/thread.jspa?messageID=265875 -- unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=217704 -- unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- unexplained, AWS tech says "Receiving this error more frequently than 1 in 5000 requests may indicate an error."
- https://forums.aws.amazon.com/thread.jspa?messageID=227851 -- "current service issue"
- https://forums.aws.amazon.com/thread.jspa?messageID=249349 -- trying to upload large file, unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=260788 -- temporary service failure of AWS
- https://forums.aws.amazon.com/thread.jspa?messageID=215866 -- bug in AWS triggered by deleting many objects
- https://forums.aws.amazon.com/thread.jspa?messageID=272033 -- apparently the same bug
- https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- unclear, possibly service failure

comment:2 Changed at 2012-02-18T06:16:58Z by davidsarah

#1589 should improve the logging of these errors.

comment:3 Changed at 2012-02-24T05:56:20Z by zooko

Priority changed from major to critical
Version changed from 1.9.0b1 to 1.9.0

comment:4 Changed at 2012-03-09T05:40:45Z by davidsarah

These errors are transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.

comment:5 Changed at 2012-03-09T06:16:01Z by davidsarah

Owner set to secorp

Assigning to secorp to report how well that retrying change went.

comment:6 Changed at 2012-05-25T06:27:29Z by zancas

Here's a link to the code that probably fixed this issue:

https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend

comment:7 Changed at 2012-07-02T09:14:34Z by zooko

Resolution set to fixed
Status changed from new to closed

We're pretty satisfied with this fix/workaround ([20120510215310-93fa1-a61c9bd9f506b1f2366233e6e338683029c78e94]), which is in production use in the Tahoe-LAFS-on-S3 product of Least Authority Enterprises.

Note: See TracTickets for help on using tickets.

Download in other formats: