#2116 closed defect

xml parse error from S4 service — at Version 18

Reported by: zooko Owned by: daira
Priority: major Milestone: undecided
Component: code-storage Version: 1.9.0-s3branch
Keywords: s3 error txaws retry LeastAuthority Cc:
Launchpad Bug:

Description (last modified by daira)

zooko@spark ~/research/papers $ tahoe backup . $WRITABLE_DIRCAP:
Traceback (most recent call last):
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner
    rc = cli.dispatch[command](so)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup
    rc = tahoe_backup.backup(options)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose
    self.request.fail(f)
  File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail
    self.deferred.errback(why)
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback
    self._startRunCallbacks(fail)
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response
    return self._loop()
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop
    return self._failed(msg)
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: server selection failed for 
<Tahoe2ServerSelector for upload qd7vq>: shares could be placed or found on 
only 0 server(s). We were asked to place shares on at least 1 server(s) such 
that any 1 of them have enough shares to recover the file. (placed 0 shares out 
of 1 total (1 homeless), want to place shares on at least 1 servers such that 
any 1 of them have enough shares to recover the file, sent 1 queries to 1 
servers, 0 queries placed some shares, 1 placed none (of which 0 placed none 
due to the server being full and 1 placed none due to an error)) (last failure 
(from <ServerTracker for server 6dwd2q and SI qd7vq>) was:
[Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>:
<RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost
    protocol.connectionLost(reason)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost
    self.factory._disconnectedDeferred.callback(None)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback
    self._startRunCallbacks(result)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> --- 
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
    root = XML(xml_bytes)
  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
    return parser.close()
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
    self._raiseerror(v)
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"

Change History (18)

comment:1 Changed at 2013-11-26T04:34:59Z by zooko

I tried the same command-line again, and this time it worked. I had started by tahoe-lafs gateway a few second before I got this xml parse error, so I suspect that this is a case of #719.

comment:2 Changed at 2013-11-26T17:00:40Z by daira

I don't think it is a case of #719: the S3 container code in the server got as far as making an S3 request and trying to parse the result, which was not valid XML. #719 refers to the case where the client can't reach enough servers.

Last edited at 2013-11-26T19:18:17Z by daira (previous) (diff)

comment:3 Changed at 2013-11-26T17:01:10Z by daira

Actually the response from S3 seems to have been empty.

comment:4 follow-up: Changed at 2013-11-26T17:08:41Z by daira

  • Keywords s3 error txaws retry LeastAuthority added
  • Owner set to daira
  • Status changed from new to assigned
  • Version changed from 1.10.0 to 1.9.0-s3branch

Hmm, maybe an empty S3 response should trigger a retry?

comment:5 Changed at 2013-11-26T17:13:13Z by daira

Sigh, I wish we could reduce that error report to something more concise -- although I suppose we should be grateful that the actual error was in there somewhere.

Last edited at 2013-11-26T17:13:26Z by daira (previous) (diff)

comment:6 in reply to: ↑ 4 Changed at 2013-11-27T19:59:27Z by zooko

Replying to daira:

Hmm, maybe an empty S3 response should trigger a retry?

I guess. :-/

This is what I think of as "snowshoes". Having our upper layer perform some redundancy or retrying just to work-around what appear to be bugs or unreliability from the lower layer. Ick.

comment:7 follow-up: Changed at 2013-11-27T20:01:57Z by daira

Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.

comment:8 in reply to: ↑ 7 Changed at 2013-11-27T20:03:38Z by zooko

Replying to daira:

Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.

Yeah, I agree that we should go ahead and add a limited number of exponential-backoff-retries for this.

comment:9 Changed at 2013-11-27T20:04:42Z by daira

Note that this isn't the first time I've seen an empty response from S3. They are rare, but they've come up before - we just haven't recorded that problem until now.

comment:10 Changed at 2013-11-28T12:52:35Z by zooko

I just a similar failure again:

Traceback (most recent call last):
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner
    rc = cli.dispatch[command](so)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup
    rc = tahoe_backup.backup(options)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose
    self.request.fail(f)
  File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail
    self.deferred.errback(why)
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback
    self._startRunCallbacks(fail)
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response
    return self._loop()
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop
    return self._failed(msg)
  File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: server selection failed for <Tahoe2ServerSelector 
for upload s7uqw>: shares could be placed or found on only 0 server(s). We were asked to place 
shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file. 
(placed 0 shares out of 1 total (1 homeless), want to place shares on at least 1 servers such 
that any 1 of them have enough shares to recover the file, sent 1 queries to 1 servers, 0 
queries placed some shares, 1 placed none (of which 0 placed none due to the server being full 
and 1 placed none due to an error)) (last failure (from <ServerTracker for server 6dwd2q and SI 
s7uqw>) was:
[Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost
    protocol.connectionLost(reason)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost
    self.factory._disconnectedDeferred.callback(None)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback
    self._startRunCallbacks(result)
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
    root = XML(xml_bytes)
  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
    return parser.close()
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
    self._raiseerror(v)
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"
Last edited at 2014-03-17T22:00:44Z by daira (previous) (diff)

comment:11 Changed at 2013-11-29T05:42:21Z by zooko

Hm, I seem to get this every time that I run a full tahoe backup on my data now. ☹ It seems like the LeastAuthority S4 service is suffering badly from this.

  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
    root = XML(xml_bytes)
  File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
    return parser.close()
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
    self._raiseerror(v)
  File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"

comment:12 Changed at 2013-11-30T20:55:30Z by daira

  • Priority changed from normal to major

comment:13 Changed at 2013-12-01T16:21:22Z by zooko

I just got this same error again, but this time on read instead of on upload. Manually retrying a couple of times (by hitting C-r in my web browser) got it to work.

comment:15 Changed at 2014-03-04T19:30:43Z by daira

I'm going to make it retry.

comment:16 follow-up: Changed at 2014-03-10T21:45:28Z by daira

Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML. The fix might need to be in txAWS as well.

It's a bit difficult to test this, because the tests intentionally do not depend on txAWS (currently they only test the mock cloud backend, not the service container implementations).

comment:17 in reply to: ↑ 16 Changed at 2014-03-11T15:55:09Z by daira

Replying to daira:

Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML.

No, this is a wrong inference. The error occurs in txaws.s3.client._parse_get_bucket, which should not be reached if the S3 server responded with a 500 error. (The 500 error in the stack trace is from the Tahoe-LAFS gateway, not S3.) So it looks like S3 returned success with an empty body.

comment:18 Changed at 2014-03-17T22:03:07Z by daira

  • Description modified (diff)
Note: See TracTickets for help on using tickets.