[tahoe-dev] [tahoe-lafs] #653: introducer client: connection count is wrong, !VersionedRemoteReference needs EQ
tahoe-lafs
trac at allmydata.org
Fri Jul 17 06:07:29 PDT 2009
#653: introducer client: connection count is wrong, !VersionedRemoteReference
needs EQ
--------------------------+-------------------------------------------------
Reporter: warner | Owner: warner
Type: defect | Status: assigned
Priority: major | Milestone: 1.5.0
Component: code-network | Version: 1.3.0
Keywords: | Launchpad_bug:
--------------------------+-------------------------------------------------
Comment(by zooko):
> I guess something may still be broken. We'll probably have to wait for
it to become weird again and then look at the introducerclient's internal
state.
:-( I guess I shouldn't have rebooted it then. Does that mean we should
go ahead and boot this issue out of v1.5 Milestone? I have an idea --
let's add an assertion in the code that the number of connected storage
servers <= number of known storage servers. Perhaps even a stronger, more
detailed assertion if that's feasible.
Here's what evidence I can gather about the problem that was exhibited.
I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16
16:17:18.835Z (on testgrid.allmydata.org's clock). There was nothing that
looked too unusual in the {{{twistd.log}}} that day. There are two
incidents reported, attached, from that day:
{{{incident-2009-07-16-002829-gc4xv5y.flog.bz2}}} and
{{{incident-2009-07-16-002846-pypgfay.flog.bz2}}}.
Here is a foolscap log-viewer web service showing each of those logfiles:
http://testgrid.allmydata.org:10000/ http://testgrid.allmydata.org:10001 .
I have a hard time learning what I want to know from these logfiles. What
I want to know (at least at first) is mostly about temporal coincidence.
For starters, I'd like to be sure that these incidents occurred before I
rebooted and upgraded the server, not after. However, the timestamps,
such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay
({'args': (25.734021902084351,), 'format': 'excessive reactor delay
(%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num':
23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what
day it was nor what timezone the timestamp is in. Checking the status of
http://foolscap.lothar.com/trac/ticket/90 suggests that the timestamps
might be in PST, which is UTC-7. If that's the case then ... No, some of
the (causally) earliest events in the log are from 18:01. Perhaps they
were from 2009-07-15? Argh, I give up. Please tell me how to understand
the timing of events in foolscap incident report files. I updated
http://foolscap.lothar.com/trac/ticket/90 to plead for fully-qualified,
unambiguous timestamps.
The triggering incident is "download failure", but scarier to me is that
there was a 26 second reactor delay.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/653#comment:10>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list