[tahoe-dev] [tahoe-lafs] #529: Implement Halt and Catch Fire

Tue Feb 24 19:37:09 PST 2009

#529: Implement Halt and Catch Fire
---------------------+------------------------------------------------------
 Reporter:  zandr    |           Owner:  nobody   
     Type:  defect   |          Status:  new      
 Priority:  major    |       Milestone:  undecided
Component:  unknown  |         Version:  1.2.0    
 Keywords:           |   Launchpad_bug:           
---------------------+------------------------------------------------------

Comment(by warner):

 Zandr and I were just talking about this one. The basic idea is that it
 would
 be nice if an HTTP load-balancer (which is sitting in front of a farm of
 webapi nodes) could cheaply detect that a given webapi node was not in a
 good
 state, and switch traffic to other nodes instead.

 To begin with, we could define what it means to be in good state. We could
 put a bit of code inside the node, maybe
 {{{client.is_fully_functional()}}},
 with some configurable criteria, maybe one or more of the following:

  * connected to Introducer
  * connected to at least N storage servers
  * connected to all blessed (#466) storage servers

 Then, we could define how we want the webapi interface to behave when
 these
 criteria are not met, one of:

  * webapi port stops listening completely
  * webapi port returns errors on all /uri requests (both reads and writes)
  * return error on all writes (POST or PUT to /uri)
  * return some special value to GETs of one specific status url

 The first (stop listening entirely) is most useful for the load balancer,
 because these devices typically assume that if a server responds at all,
 then
 it will be able to respond correctly. It would, however, make it difficult
 for us to solve the problem, since many of the diagnostic tools we would
 use
 are themselves pages in the webapi. Any of the other options would improve
 diagnosability, but would obligate the load-balancer to either look more
 carefully at the response (start diverting traffic when it sees 500
 Internal
 Server Errors coming back, or use special probe requests to hit the status
 URL on a periodic basis).

 We also kicked around the idea of having two webapi ports, one which turns
 itself off if the node were not fully functional, and a second which stays
 on
 all the time. With this sort of scheme, the load-balancer could point at
 the
 first port, and we'd use the second port for diagnostics.

 A tangentially-related issue is that sometimes the node can appear to
 start,
 'tahoe start' returns with success, but the node is in fact impaired in
 some
 fatal way. I believe that a node which is unable to open the webapi
 listening
 port will exhibit this behavior. I think there was a change to node
 startup
 recently (the implementation of 'tahoe start') which makes this
 troublesome,
 in which the bind() call is taking place after the fork(), whereas it used
 to
 be before the fork(). #602 and #71 probably relate to this one, as well as
 #371.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/529#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid