Opened at 2013-11-26T04:32:53Z
Last modified at 2014-04-15T00:49:57Z
#2116 closed defect
xml parse error from S4 service — at Version 18
Reported by: | zooko | Owned by: | daira |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-storage | Version: | 1.9.0-s3branch |
Keywords: | s3 error txaws retry LeastAuthority | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
zooko@spark ~/research/papers $ tahoe backup . $WRITABLE_DIRCAP: Traceback (most recent call last): File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner rc = cli.dispatch[command](so) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup rc = tahoe_backup.backup(options) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) HTTPError: Error during file PUT: 500 Internal Server Error "Traceback (most recent call last): File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose self.request.fail(f) File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail self.deferred.errback(why) File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback self._startRunCallbacks(fail) File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response return self._loop() File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop return self._failed(msg) File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: server selection failed for <Tahoe2ServerSelector for upload qd7vq>: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file. (placed 0 shares out of 1 total (1 homeless), want to place shares on at least 1 servers such that any 1 of them have enough shares to recover the file, sent 1 queries to 1 servers, 0 queries placed some shares, 1 placed none (of which 0 placed none due to the server being full and 1 placed none due to an error)) (last failure (from <ServerTracker for server 6dwd2q and SI qd7vq>) was: [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost protocol.connectionLost(reason) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost self.factory._disconnectedDeferred.callback(None) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback self._startRunCallbacks(result) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket root = XML(xml_bytes) File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML return parser.close() File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close self._raiseerror(v) File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 ]'> ]) "
Change History (18)
comment:1 Changed at 2013-11-26T04:34:59Z by zooko
comment:2 Changed at 2013-11-26T17:00:40Z by daira
comment:3 Changed at 2013-11-26T17:01:10Z by daira
Actually the response from S3 seems to have been empty.
comment:4 follow-up: ↓ 6 Changed at 2013-11-26T17:08:41Z by daira
- Keywords s3 error txaws retry LeastAuthority added
- Owner set to daira
- Status changed from new to assigned
- Version changed from 1.10.0 to 1.9.0-s3branch
Hmm, maybe an empty S3 response should trigger a retry?
comment:5 Changed at 2013-11-26T17:13:13Z by daira
Sigh, I wish we could reduce that error report to something more concise -- although I suppose we should be grateful that the actual error was in there somewhere.
comment:6 in reply to: ↑ 4 Changed at 2013-11-27T19:59:27Z by zooko
Replying to daira:
Hmm, maybe an empty S3 response should trigger a retry?
I guess. :-/
This is what I think of as "snowshoes". Having our upper layer perform some redundancy or retrying just to work-around what appear to be bugs or unreliability from the lower layer. Ick.
comment:7 follow-up: ↓ 8 Changed at 2013-11-27T20:01:57Z by daira
Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.
comment:8 in reply to: ↑ 7 Changed at 2013-11-27T20:03:38Z by zooko
Replying to daira:
Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.
Yeah, I agree that we should go ahead and add a limited number of exponential-backoff-retries for this.
comment:9 Changed at 2013-11-27T20:04:42Z by daira
Note that this isn't the first time I've seen an empty response from S3. They are rare, but they've come up before - we just haven't recorded that problem until now.
comment:10 Changed at 2013-11-28T12:52:35Z by zooko
I just a similar failure again:
Traceback (most recent call last): File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner rc = cli.dispatch[command](so) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup rc = tahoe_backup.backup(options) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) HTTPError: Error during file PUT: 500 Internal Server Error "Traceback (most recent call last): File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose self.request.fail(f) File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail self.deferred.errback(why) File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback self._startRunCallbacks(fail) File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response return self._loop() File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop return self._failed(msg) File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: server selection failed for <Tahoe2ServerSelector for upload s7uqw>: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file. (placed 0 shares out of 1 total (1 homeless), want to place shares on at least 1 servers such that any 1 of them have enough shares to recover the file, sent 1 queries to 1 servers, 0 queries placed some shares, 1 placed none (of which 0 placed none due to the server being full and 1 placed none due to an error)) (last failure (from <ServerTracker for server 6dwd2q and SI s7uqw>) was: [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost protocol.connectionLost(reason) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost self.factory._disconnectedDeferred.callback(None) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback self._startRunCallbacks(result) File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket root = XML(xml_bytes) File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML return parser.close() File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close self._raiseerror(v) File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 ]'> ]) "
comment:11 Changed at 2013-11-29T05:42:21Z by zooko
Hm, I seem to get this every time that I run a full tahoe backup on my data now. ☹ It seems like the LeastAuthority S4 service is suffering badly from this.
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket root = XML(xml_bytes) File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML return parser.close() File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close self._raiseerror(v) File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 ]'> ]) "
comment:12 Changed at 2013-11-30T20:55:30Z by daira
- Priority changed from normal to major
comment:13 Changed at 2013-12-01T16:21:22Z by zooko
I just got this same error again, but this time on read instead of on upload. Manually retrying a couple of times (by hitting C-r in my web browser) got it to work.
comment:14 Changed at 2014-03-04T19:29:52Z by daira
CyberAxe just saw the same error: http://sebsauvage.net/paste/?e301277924eb38dc#6W5WPJkf1HgW3+rdFVvQ/2BqzYIZ0ZwQyPUb08+kY0A=
comment:15 Changed at 2014-03-04T19:30:43Z by daira
I'm going to make it retry.
comment:16 follow-up: ↓ 17 Changed at 2014-03-10T21:45:28Z by daira
Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML. The fix might need to be in txAWS as well.
It's a bit difficult to test this, because the tests intentionally do not depend on txAWS (currently they only test the mock cloud backend, not the service container implementations).
comment:17 in reply to: ↑ 16 Changed at 2014-03-11T15:55:09Z by daira
Replying to daira:
Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML.
No, this is a wrong inference. The error occurs in txaws.s3.client._parse_get_bucket, which should not be reached if the S3 server responded with a 500 error. (The 500 error in the stack trace is from the Tahoe-LAFS gateway, not S3.) So it looks like S3 returned success with an empty body.
comment:18 Changed at 2014-03-17T22:03:07Z by daira
- Description modified (diff)
I tried the same command-line again, and this time it worked. I had started by tahoe-lafs gateway a few second before I got this xml parse error, so I suspect that this is a case of #719.