[tahoe-lafs-trac-stream] [tahoe-lafs] #1590: S3 backend: intermittent "We encountered an internal error. Please try again." from S3

Thu Feb 9 21:06:42 UTC 2012

#1590: S3 backend: intermittent "We encountered an internal error. Please try
again." from S3
-------------------------+-------------------------------------------------
     Reporter:           |      Owner:
  davidsarah             |     Status:  new
         Type:  defect   |  Milestone:  undecided
     Priority:  major    |    Version:  1.9.0b1
    Component:  code-    |   Keywords:  s3-backend reliability availability
  storage                |  preservation error
   Resolution:           |
Launchpad Bug:           |
-------------------------+-------------------------------------------------

Comment (by zooko):

 Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3
 do happen, but usually indicate a bug or failure on the AWS side and not a
 "normal" transient error that should just be ignored. One AWS tech gave a
 clue when he wrote "Receiving this error more frequently than 1 in 5000
 requests may indicate an error.".

 Conclusion for Least Authority Enterprise's purposes: we should log as
 much data as we can about each failure, and we should aggregate the
 occurrences of these failures to generate statistics and look for
 patterns, and we should have monitoring and alerting in place to show us
 the historical record of these failures and to call us if it gets worse.

 (In addition to all that, we should probably go ahead and retry the failed
 request...)

 I searched [https://forums.aws.amazon.com the AWS forums], its S3 sub-
 forum, for the following search terms constrained to the year 2012:
 * "503": 0 hits
 * "500": 3 hits (that were about the 500 error code instead of, say, 500
 ms):
  * https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- AWS tech
 says you oughta retry, and asks for more information about the pattern of
 the failure
  * https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- not clear
 to me if that is a 500 from S3 or from !CloudFront or something

 Searching for the year 2011:
 * "503": 2 hits
   * https://forums.aws.amazon.com/thread.jspa?messageID=260249 -- Code:
 !SlowDown / Message: Please reduce your request rate.
   * https://forums.aws.amazon.com/thread.jspa?messageID=236572 -- sudden
 storm of 503's, concentrated in Europe region, no explanation, but the
 users didn't post follow-ups complaining more so it must have been
 resolved
   * https://forums.aws.amazon.com/thread.jspa?messageID=260617 -- Code:
 !SlowDown / Message: Please reduce your request rate.
 * "500": 12 hits
   * https://forums.aws.amazon.com/thread.jspa?messageID=297376 -- storm of
 failures including spurious "access denied" and objects just disappearing
 after upload
   * https://forums.aws.amazon.com/thread.jspa?messageID=284078 -- turned
 out to be wrong credentials
   * https://forums.aws.amazon.com/thread.jspa?messageID=260843 -- was an
 internal error in S3 that was subsequently fixed by AWS
   * https://forums.aws.amazon.com/thread.jspa?messageID=265875 --
 unexplained
   * https://forums.aws.amazon.com/thread.jspa?messageID=217704 --
 unexplained
   * https://forums.aws.amazon.com/thread.jspa?messageID=313830 --
 unexplained, AWS tech says "Receiving this error more frequently than 1 in
 5000 requests may indicate an error."
   * https://forums.aws.amazon.com/thread.jspa?messageID=227851 -- "current
 service issue"
   * https://forums.aws.amazon.com/thread.jspa?messageID=249349 -- trying
 to upload large file, unexplained
   * https://forums.aws.amazon.com/thread.jspa?messageID=260788 --
 temporary service failure of AWS
   * https://forums.aws.amazon.com/thread.jspa?messageID=215866 -- bug in
 AWS triggered by deleting many objects
   * https://forums.aws.amazon.com/thread.jspa?messageID=272033 --
 apparently the same bug
   * https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- unclear,
 possibly service failure

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1590#comment:1>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage