#502 closed defect (fixed)

remove peers from client's table upon DeadReferenceError

Reported by: warner Owned by:
Priority: major Milestone: 1.3.0
Component: code-network Version: 1.2.0
Keywords: Cc:
Launchpad Bug:

Description

it looks like, once a peer has disconnected, the client doesn't ever remove the RemoteReference from its table. As a result, we keep getting new DeadReferenceErrors forever, which creates a lot of Incidents and log traffic.

There is a related problem, in which a silent partition (one in which you have to wait for TCP to notice the problem) can result in dozens of remote messages being queued (especially if these are the readvs from a deep-stats or deep-check). All of these messages will errback at about the same time (in sequential turns) when the disconnect is finally discovered, and removing the peer from the table won't help this. We'd need some significant changes to the connection model (perhaps a completely non-connection-oriented model), and some layering violations, to allow the higer level to say "nevermind: I don't need this message delivered anymore".

Change History (4)

comment:1 Changed at 2008-08-25T22:04:29Z by warner

Also, DeadReferenceError should probably be logged below the level of WEIRD.. that was really there to catch even stranger stuff, like Violations or non-partitioning exceptions.

comment:2 Changed at 2008-08-26T02:05:03Z by warner

I changed the logging to put DeadReferenceError at UNUSUAL, which doesn't trigger an incident.

It looks like the introducer client code has the necessary notifyOnDisconnect stuff to forget about a server that we've lost the connection to, so I need to reexamine the logs and see if that what was really happening, or if there is some other problem.

comment:3 Changed at 2008-08-27T01:02:01Z by warner

  • Milestone changed from undecided to 1.2.1
  • Resolution set to fixed
  • Status changed from new to closed

ok, I think the actual problem was just a large number of outstanding queries that were waiting for lost peer to finally get disconnected. The peer in question (at least on the last batch of Incidents that I examined) happened to be zooko's laptop 'ootles', and laptops are just the sort of node that tend to disappear silently (when they get unplugged or suspended or powered down quickly). It can take 15-30 minutes for TCP to notice that a peer hasn't responded in a while.

Once TCP gives up on a connection, our IntroducerClient (via a foolscap notifyOnDisconnect hook) will remove the peer from the list of known peers, so operations that occur after this point will not attempt to use the peer.

While we're in that state, the following tahoe messages will be sent and remain outstanding until TCP finally gives up on them:

  • immutable do-you-have-share queries (download will stall until all queries are retired)
  • immutable allocate_buckets queries (upload will stall)
  • mutable readv, for servermap update (mapupdate will complete as soon as enough shares are found)

For mutable operations, the mapupdate phase is the most vulnerable part: for 15-30 minutes after a partition, every mapupdate operation will result in an outstanding message for each lost peer. If there are a lot of directory reads or writes (say, because of a deep-size operation), there will be a lot of outstanding messages (I counted 102 in this most recent batch). These messages will finally be retired in one big batch when TCP gives up. Until 3b06aa6a79dd26cc each retired message would trigger a log.WEIRD -level event, which triggered an Incident, which took two seconds to write out to disk. As a result, there was 200s of intense CPU and disk activity (and no other messages being received) between foolscap noticing the connection loss and the tahoe IntroducerClient seeing it. This was the source of all the incidents.

Note that the mapupdate frequently completes promptly even when a few of the servers are slow in responding (or giving up). Particularly for retrieve, which only requires k+epsilon shares, and will simply ignore the slowpokes.

The mutable publish/retrieve is less vulnerable, since the only way to get them to use a partitioned server is to lose the connection after the mapupdate and before the main operation, and that is a much smaller time window (certainly less than a second).

For immutable publish, if the peer selector tries to use a partitioned server, the peer selection phase will stall until that message is retired. For download, the _get_all_shareholders call will query *all* servers, so it will stall if *any* servers are in this partitioned state.

Change 3b06aa6a79dd26cc does two things:

  • reduce the severity of DeadReferenceError-failing mutable queries to log.UNUSUAL, so that they do not trigger incidents
  • removes logging of responses that arrive after the operation is complete

I think this will solve the too-many-incidents problem. Other things that need to be done:

  • change foolscap to collapse multiple simultaneous incidents into one
  • change immutable publish to give up on peers after a reasonable delay (perhaps computed by noticing how quickly the other peers respond)
  • change immutable download to start the download process as soon as k shares have been located. Lingering queries should be used to populate an 'alternates list', and slow responses to get_block should cause the downloader to switch to an alternate. This would also help fix stalls when we lose a server in the middle of a download.

comment:4 Changed at 2008-09-03T01:17:59Z by warner

  • Milestone changed from 1.3.1 to 1.3.0
Note: See TracTickets for help on using tickets.