[tahoe-lafs-trac-stream] [tahoe-lafs] #1590: S3 backend: intermittent "We encountered an internal error. Please try again." from S3
tahoe-lafs
trac at tahoe-lafs.org
Thu Feb 9 21:06:42 UTC 2012
#1590: S3 backend: intermittent "We encountered an internal error. Please try
again." from S3
-------------------------+-------------------------------------------------
Reporter: | Owner:
davidsarah | Status: new
Type: defect | Milestone: undecided
Priority: major | Version: 1.9.0b1
Component: code- | Keywords: s3-backend reliability availability
storage | preservation error
Resolution: |
Launchpad Bug: |
-------------------------+-------------------------------------------------
Comment (by zooko):
Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3
do happen, but usually indicate a bug or failure on the AWS side and not a
"normal" transient error that should just be ignored. One AWS tech gave a
clue when he wrote "Receiving this error more frequently than 1 in 5000
requests may indicate an error.".
Conclusion for Least Authority Enterprise's purposes: we should log as
much data as we can about each failure, and we should aggregate the
occurrences of these failures to generate statistics and look for
patterns, and we should have monitoring and alerting in place to show us
the historical record of these failures and to call us if it gets worse.
(In addition to all that, we should probably go ahead and retry the failed
request...)
I searched [https://forums.aws.amazon.com the AWS forums], its S3 sub-
forum, for the following search terms constrained to the year 2012:
* "503": 0 hits
* "500": 3 hits (that were about the 500 error code instead of, say, 500
ms):
* https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- AWS tech
says you oughta retry, and asks for more information about the pattern of
the failure
* https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- not clear
to me if that is a 500 from S3 or from !CloudFront or something
Searching for the year 2011:
* "503": 2 hits
* https://forums.aws.amazon.com/thread.jspa?messageID=260249 -- Code:
!SlowDown / Message: Please reduce your request rate.
* https://forums.aws.amazon.com/thread.jspa?messageID=236572 -- sudden
storm of 503's, concentrated in Europe region, no explanation, but the
users didn't post follow-ups complaining more so it must have been
resolved
* https://forums.aws.amazon.com/thread.jspa?messageID=260617 -- Code:
!SlowDown / Message: Please reduce your request rate.
* "500": 12 hits
* https://forums.aws.amazon.com/thread.jspa?messageID=297376 -- storm of
failures including spurious "access denied" and objects just disappearing
after upload
* https://forums.aws.amazon.com/thread.jspa?messageID=284078 -- turned
out to be wrong credentials
* https://forums.aws.amazon.com/thread.jspa?messageID=260843 -- was an
internal error in S3 that was subsequently fixed by AWS
* https://forums.aws.amazon.com/thread.jspa?messageID=265875 --
unexplained
* https://forums.aws.amazon.com/thread.jspa?messageID=217704 --
unexplained
* https://forums.aws.amazon.com/thread.jspa?messageID=313830 --
unexplained, AWS tech says "Receiving this error more frequently than 1 in
5000 requests may indicate an error."
* https://forums.aws.amazon.com/thread.jspa?messageID=227851 -- "current
service issue"
* https://forums.aws.amazon.com/thread.jspa?messageID=249349 -- trying
to upload large file, unexplained
* https://forums.aws.amazon.com/thread.jspa?messageID=260788 --
temporary service failure of AWS
* https://forums.aws.amazon.com/thread.jspa?messageID=215866 -- bug in
AWS triggered by deleting many objects
* https://forums.aws.amazon.com/thread.jspa?messageID=272033 --
apparently the same bug
* https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- unclear,
possibly service failure
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1590#comment:1>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list