Help understanding an UploadUnhappinessError

Fri Jan 23 08:16:43 UTC 2015

On 12/11/14 14:39, Zooko O'Whielacronx wrote:
> Hi Kyle!
>
> Sorry that I didn't see this before now.

And I apologize for taking more than a month to finally follow up on it!  :)

> I'm sorry that the error message is so frustrating. Fixing that is the
> topic of https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2101 (improve
> error messages from failed uploads). The best solution would be for it
> to just show, visually, a complete map of which shares went to which
> servers, and which shares failed when it attempted to send them to
> which servers, and how each one failed.
For what it's worth, I've tended to be satisfied by error messages that 
(1) identify the specific thing that failed (e.g. a system call), (2) 
provide the specific error (e.g. errno or strerror(errno)), and (3) for 
extra credit, provide arguments that may have contributed to the error.

Example poor error message:
   Bad command or file name

Example good error message:
   open() path "/tmp/yQsOve" error #2: No such file or directory

So for example, when you fix the error messages, if there's a hiccup in 
the network, I want to know that my socket got an ECONNRESET 
communicating with some specific server.  Don't just say "network 
problem".  :)

> The issue sounds like #1382. Perhaps you could try the #1382 branch
> (https://github.com/zooko/tahoe-lafs/tree/1382-rewrite-3) and see if
> that fixes it! Since it is so nicely reproducible for you.

I have not tried this yet.  I didn't think the sounded like 1382 because 
it happens for files that have no existing shares anywhere on the grid.  
The problem seems connected to file size, not to share distribution.

>
>> When this error occurs, it has a side-effect of killing any other operation
>> occurring on the same node at the same time.  (If I'm doing two "tahoe
>> backup" operations, they /both/ die.)  But the problem is 100% reproducible
>> with a simple "tahoe put" operation running by itself.
> This is really interesting! I really want to understand this. I can't
> think of any reason why separate operations would effect one another.
>
> Could you please trigger an incident report on the gateway and also on
> the server?

Attached.  This is the incident report from the gateway, which is one of 
the 5 storage servers on the grid.

Here's the relevant excerpt from twistd.log:
2015-01-22 18:38:43-0800 [-] Log opened.
2015-01-22 18:38:43-0800 [-] twistd 13.0.0 (/usr/local/bin/python 2.7.8) 
starting up.
2015-01-22 18:38:43-0800 [-] reactor class: 
twisted.internet.pollreactor.PollReactor.
2015-01-22 18:38:43-0800 [-] Listener starting on 12756
2015-01-22 18:38:43-0800 [-] NevowSite starting on 2000
2015-01-22 18:38:43-0800 [-] Starting factory <nevow.appserver.NevowSite 
instance at 0x1077dbf271b8>
2015-01-22 18:38:43-0800 [-] My pid: 29283
2015-01-22 18:38:43-0800 [-] DatagramProtocol starting on 33814
2015-01-22 18:38:43-0800 [-] Starting protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x1077dbda9710>
2015-01-22 18:38:43-0800 [-] (UDP Port 33814 Closed)
2015-01-22 18:38:43-0800 [-] Stopping protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x1077dbda9710>
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 17 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 16 seconds
2015-01-22 23:06:49-0800 [-] disconnectTimeout, no data for 15 seconds

And here are the command line error messages.  The first is from the 
tahoe backup sending the gigantic file.  The second is from a parallel, 
independent backup operation that somehow gets killed along with the 
gigantic file operation.

...
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 118, in run
     new_backup_dircap = self.process(options.from_dir)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 188, in process
     childcap = self.process(childpath)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 188, in process
     childcap = self.process(childpath)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 188, in process
     childcap = self.process(childpath)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 194, in process
     childcap, metadata = self.upload(childpath)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 305, in upload
     raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
Traceback (most recent call last):
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py", 
line 679, in _done
     self.request.complete(res)
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py", 
line 60, in complete
     self.deferred.callback(res)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 380, in callback
     self._startRunCallbacks(result)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 488, in _startRunCallbacks
     self._runCallbacks()
--- <exception caught here> ---
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 575, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 604, in _got_response
     return self._loop()
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 455, in _loop
     return self._failed("%s (%s)" % (failmsg, 
self._get_progress_message()))
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 617, in _failed
     raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or 
found on only 1 server(s). We were asked to place shares on at least 4 
server(s) such that any 4 of them have enough shares to recover the 
file. (placed all 10 shares, want to place shares on at least 4 servers 
such that any 4 of them have enough shares to recover the file, sent 6 
queries to 5 servers, 2 queries placed some shares, 4 placed none (of 
which 0 placed none due to the server being full and 4 placed none due 
to an error))

...
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 118, in run
     new_backup_dircap = self.process(options.from_dir)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 188, in process
     childcap = self.process(childpath)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 211, in process
     newdircap = mkdir(create_contents, self.options)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", 
line 47, in mkdir
     raise HTTPError("Error during mkdir", resp)
HTTPError: Error during mkdir: 500 Internal Server Error
Traceback (most recent call last):
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py", 
line 679, in _done
     self.request.complete(res)
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.7.0-py2.7.egg/foolscap/call.py", 
line 60, in complete
     self.deferred.callback(res)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 380, in callback
     self._startRunCallbacks(result)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 488, in _startRunCallbacks
     self._runCallbacks()
--- <exception caught here> ---
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.6-amd64.egg/twisted/internet/defer.py", 
line 575, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 604, in _got_response
     return self._loop()
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 455, in _loop
     return self._failed("%s (%s)" % (failmsg, 
self._get_progress_message()))
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 617, in _failed
     raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or 
found on only 3 server(s). We were asked to place shares on at least 4 
server(s) such that any 4 of them have enough shares to recover the 
file. (placed all 10 shares, want to place shares on at least 4 servers 
such that any 4 of them have enough shares to recover the file, sent 9 
queries to 5 servers, 5 queries placed some shares, 4 placed none (of 
which 0 placed none due to the server being full and 4 placed none due 
to an error))

After the failures, the "Recent and Active Operations" page shows two 
uploads in progress:

Started    Type    Storage Index    Helper?    Total Size Progress    Status
23:06:33 22-Jan-2015    upload    gb2l3m5nflgkkbzbdx4tkgspfy No    
1.16GB    hash: 100.0%, ciphertext: 0.0%, encode: 0.0% Contacting 
Servers [ylkbcys5] (second query), 0 shares left..
23:06:33 22-Jan-2015    upload    nu7bxorkeiz7ziwe6loaouyxla No    
269B    hash: 100.0%, ciphertext: 0.0%, encode: 0.0% Contacting Servers 
[ylkbcys5] (second query), 0 shares left..

-- 
Kyle Markley

-------------- next part --------------
A non-text attachment was scrubbed...
Name: incident-2015-01-23--07-36-03Z-vjl6kta.flog.bz2
Type: application/octet-stream
Size: 32331 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20150123/b6e437f3/attachment-0001.obj>