[tahoe-dev] [tahoe-lafs] #653: introducer client: connection count is wrong, !VersionedRemoteReference needs EQ

Fri Jul 17 06:07:29 PDT 2009

#653: introducer client: connection count is wrong, !VersionedRemoteReference
needs EQ
--------------------------+-------------------------------------------------
 Reporter:  warner        |           Owner:  warner  
     Type:  defect        |          Status:  assigned
 Priority:  major         |       Milestone:  1.5.0   
Component:  code-network  |         Version:  1.3.0   
 Keywords:                |   Launchpad_bug:          
--------------------------+-------------------------------------------------

Comment(by zooko):

 > I guess something may still be broken. We'll probably have to wait for
 it to become weird again and then look at the introducerclient's internal
 state.

 :-(  I guess I shouldn't have rebooted it then.  Does that mean we should
 go ahead and boot this issue out of v1.5 Milestone?  I have an idea --
 let's add an assertion in the code that the number of connected storage
 servers <= number of known storage servers.  Perhaps even a stronger, more
 detailed assertion if that's feasible.

 Here's what evidence I can gather about the problem that was exhibited.

 I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16
 16:17:18.835Z (on testgrid.allmydata.org's clock).  There was nothing that
 looked too unusual in the {{{twistd.log}}} that day.  There are two
 incidents reported, attached, from that day:
 {{{incident-2009-07-16-002829-gc4xv5y.flog.bz2}}} and
 {{{incident-2009-07-16-002846-pypgfay.flog.bz2}}}.

 Here is a foolscap log-viewer web service showing each of those logfiles:
 http://testgrid.allmydata.org:10000/ http://testgrid.allmydata.org:10001 .
 I have a hard time learning what I want to know from these logfiles.  What
 I want to know (at least at first) is mostly about temporal coincidence.
 For starters, I'd like to be sure that these incidents occurred before I
 rebooted and upgraded the server, not after.  However, the timestamps,
 such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay
 ({'args': (25.734021902084351,), 'format': 'excessive reactor delay
 (%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num':
 23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what
 day it was nor what timezone the timestamp is in.  Checking the status of
 http://foolscap.lothar.com/trac/ticket/90 suggests that the timestamps
 might be in PST, which is UTC-7.  If that's the case then ... No, some of
 the (causally) earliest events in the log are from 18:01.  Perhaps they
 were from 2009-07-15?  Argh, I give up.  Please tell me how to understand
 the timing of events in foolscap incident report files.  I updated
 http://foolscap.lothar.com/trac/ticket/90 to plead for fully-qualified,
 unambiguous timestamps.

 The triggering incident is "download failure", but scarier to me is that
 there was a 26 second reactor delay.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/653#comment:10>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid