Help understanding an UploadUnhappinessError
Kyle Markley
kyle at arbyte.us
Fri Jan 23 08:16:43 UTC 2015
On 12/11/14 14:39, Zooko O'Whielacronx wrote:
> Hi Kyle!
>
> Sorry that I didn't see this before now.
And I apologize for taking more than a month to finally follow up on it! :)
> I'm sorry that the error message is so frustrating. Fixing that is the
> topic of https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2101 (improve
> error messages from failed uploads). The best solution would be for it
> to just show, visually, a complete map of which shares went to which
> servers, and which shares failed when it attempted to send them to
> which servers, and how each one failed.
For what it's worth, I've tended to be satisfied by error messages that
(1) identify the specific thing that failed (e.g. a system call), (2)
provide the specific error (e.g. errno or strerror(errno)), and (3) for
extra credit, provide arguments that may have contributed to the error.
Example poor error message:
Bad command or file name
Example good error message:
open() path "/tmp/yQsOve" error #2: No such file or directory
So for example, when you fix the error messages, if there's a hiccup in
the network, I want to know that my socket got an ECONNRESET
communicating with some specific server. Don't just say "network
problem". :)
> The issue sounds like #1382. Perhaps you could try the #1382 branch
> (https://github.com/zooko/tahoe-lafs/tree/1382-rewrite-3) and see if
> that fixes it! Since it is so nicely reproducible for you.
I have not tried this yet. I didn't think the sounded like 1382 because
it happens for files that have no existing shares anywhere on the grid.
The problem seems connected to file size, not to share distribution.
>
>> When this error occurs, it has a side-effect of killing any other operation
>> occurring on the same node at the same time. (If I'm doing two "tahoe
>> backup" operations, they /both/ die.) But the problem is 100% reproducible
>> with a simple "tahoe put" operation running by itself.
> This is really interesting! I really want to understand this. I can't
> think of any reason why separate operations would effect one another.
>
> Could you please trigger an incident report on the gateway and also on
> the server?
Attached. This is the incident report from the gateway, which is one of
the 5 storage servers on the grid.
Here's the relevant excerpt from twistd.log:
2015-01-22 18:38:43-0800 [-] Log opened.
2015-01-22 18:38:43-0800 [-] twistd 13.0.0 (/usr/local/bin/python 2.7.8)
starting up.
2015-01-22 18:38:43-0800 [-] reactor class:
twisted.internet.pollreactor.PollReactor.
2015-01-22 18:38:43-0800 [-] Listener starting on 12756
2015-01-22 18:38:43-0800 [-] NevowSite starting on 2000
2015-01-22 18:38:43-0800 [-] Starting factory <nevow.appserver.NevowSite
instance at 0x1077dbf271b8>
2015-01-22 18:38:43-0800 [-] My pid: 29283
2015-01-22 18:38:43-0800 [-] DatagramProtocol starting on 33814
2015-01-22 18:38:43-0800 [-] Starting protocol
<twisted.internet.protocol.DatagramProtocol instance at 0x1077dbda9710>
2015-01-22 18:38:43-0800 [-] (UDP Port 33814 Closed)
2015-01-22 18:38:43-0800 [-] Stopping protocol
<twisted.internet.protocol.DatagramProtocol instance at 0x1077dbda9710>
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 17 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 16 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds
And here are the command line error messages. The first is from the
tahoe backup sending the gigantic file. The second is from a parallel,
independent backup operation that somehow gets killed along with the
gigantic file operation.
...
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 118, in run
new_backup_dircap = self.process(options.from_dir)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 188, in process
childcap = self.process(childpath)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 188, in process
childcap = self.process(childpath)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 188, in process
childcap = self.process(childpath)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 194, in process
childcap, metadata = self.upload(childpath)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 305, in upload
raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
Traceback (most recent call last):
File
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py",
line 679, in _done
self.request.complete(res)
File
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py",
line 60, in complete
self.deferred.callback(res)
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 380, in callback
self._startRunCallbacks(result)
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 488, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 575, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 604, in _got_response
return self._loop()
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 455, in _loop
return self._failed("%s (%s)" % (failmsg,
self._get_progress_message()))
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or
found on only 1 server(s). We were asked to place shares on at least 4
server(s) such that any 4 of them have enough shares to recover the
file. (placed all 10 shares, want to place shares on at least 4 servers
such that any 4 of them have enough shares to recover the file, sent 6
queries to 5 servers, 2 queries placed some shares, 4 placed none (of
which 0 placed none due to the server being full and 4 placed none due
to an error))
...
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 118, in run
new_backup_dircap = self.process(options.from_dir)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 188, in process
childcap = self.process(childpath)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 211, in process
newdircap = mkdir(create_contents, self.options)
File
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py",
line 47, in mkdir
raise HTTPError("Error during mkdir", resp)
HTTPError: Error during mkdir: 500 Internal Server Error
Traceback (most recent call last):
File
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py",
line 679, in _done
self.request.complete(res)
File
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py",
line 60, in complete
self.deferred.callback(res)
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 380, in callback
self._startRunCallbacks(result)
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 488, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py",
line 575, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 604, in _got_response
return self._loop()
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 455, in _loop
return self._failed("%s (%s)" % (failmsg,
self._get_progress_message()))
File
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py",
line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or
found on only 3 server(s). We were asked to place shares on at least 4
server(s) such that any 4 of them have enough shares to recover the
file. (placed all 10 shares, want to place shares on at least 4 servers
such that any 4 of them have enough shares to recover the file, sent 9
queries to 5 servers, 5 queries placed some shares, 4 placed none (of
which 0 placed none due to the server being full and 4 placed none due
to an error))
After the failures, the "Recent and Active Operations" page shows two
uploads in progress:
Started Type Storage Index Helper? Total Size Progress Status
23:06:33 22-Jan-2015 upload gb2l3m5nflgkkbzbdx4tkgspfy No
1.16GB hash: 100.0%, ciphertext: 0.0%, encode: 0.0% Contacting
Servers [ylkbcys5] (second query), 0 shares left..
23:06:33 22-Jan-2015 upload nu7bxorkeiz7ziwe6loaouyxla No
269B hash: 100.0%, ciphertext: 0.0%, encode: 0.0% Contacting Servers
[ylkbcys5] (second query), 0 shares left..
--
Kyle Markley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: incident-2015-01-23--07-36-03Z-vjl6kta.flog.bz2
Type: application/octet-stream
Size: 32331 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20150123/b6e437f3/attachment-0001.obj>
More information about the tahoe-dev
mailing list