#1138 new defect

Timeout of Servermap Update

Reported by: eurekafag Owned by: nobody
Priority: major Milestone: soon
Component: code-network Version: 1.7.1
Keywords: servermap update timeout mutable download availability hang foolscap Cc:
Launchpad Bug:

Description (last modified by eurekafag)

I wrote to the mail list but no one answered. Hope this is the better place for such report.

I have one problem with updating the directories. Sometimes a server is dropping out of network for various reasons but still remains connected on welcome page. When I'm trying to access any directory my node starts map updating and it may be very long operation during which all work with directories hangs. It takes from 10 to 14 minutes and I guess it's unacceptable for such system. Where can I set the timeout for this? I set up:

timeout.keepalive = 60
timeout.disconnect = 300

But this doesn't help. One of servers lost internet connection suddenly and that's what I got accessing one of directories (there were 4 servers and only 3 requests succeded):

Started: 15:11:01 26-Jul-2010
Finished: 15:25:51 26-Jul-2010
Storage Index: o4wudooveaakzeqdhvspjsjgtm
Helper?: No
Progress: 100.0%
Status: Finished
Update Results

Timings:
Total: 14 minutes
Initial Queries: 4.5ms
Cumulative Verify: 1.0ms
Per-Server Response Times:
[nfyml6h3]: 3.4ms
[t726lxet]: 67ms
[35jflwuq]: 32ms

Almost 15 minutes! That becomes critical if you mount a directory via sshfs+sftpd, then any process may stuck when lists this mounted dir and the only way is to kill sshfs (you couldn't kill the process itself, it's in D+ state) or wait (not knowing how long). Please, point me to the right timeout option for such requests! 10 to 30 seconds would be very nice.

Change History (4)

comment:1 Changed at 2010-07-28T06:30:41Z by eurekafag

  • Description modified (diff)

comment:2 follow-up: Changed at 2010-07-29T04:57:20Z by zooko

Thank you for the bug report. I think you are right that a ticket is a better way to get answers about this kind of thing than a mailing list message. (Although sending a mailing list message first is often a good way to start.)

Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).

In the short-term, I'm not sure why setting the foolscap timeouts did not cause the upload to complete (whether successfully or failingly) more quickly. It is a mystery to me. Perhaps someone else could dig into the upload code and figure it out. One potentially productive way to do that would be to add more diagnostics to the status page, showing which requests to servers are currently outstanding and for how long they have been.

comment:3 in reply to: ↑ 2 Changed at 2010-07-30T06:19:02Z by davidsarah

  • Component changed from unknown to code-network
  • Keywords mutable download availability hang foolscap added
  • Milestone changed from undecided to soon

Replying to zooko:

Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).

From the description, the case at hand seems to be mutable download.

Will this be addressed by the New Downloader, or does that only handle immutable download?

comment:4 Changed at 2010-08-06T06:26:38Z by warner

The #798 new-downloader is only for immutable files, sorry.

The timeout.disconnect timer, due to the low-overhead way it is implemented, may take up to twice the value to finally sever a connection. So a value of 300 could take up to 10 minutes to disconnect the server connection. But it shouldn't have let the connection stay up for 14 minutes. Two ideas come to mind: the timeout.disconnect clause might have been in the wrong section (it should be in the [node] section), or there might have been other traffic on that connection that kept it alive (but not the response to the mutable read query). Neither seems likely.. the only way I can imagine traffic keeping it alive is if the server were having weird out-of-memory or hardware errors and dropped one request while accepting others (we've seen things like this happen before, but it was on a server that had run out of memory).

It might help to collect some log information from your box after it does this. If you go to the front "Welcome" page, there's a button at the bottom that says "Report An Incident". Push that, and a few seconds later, a new "flogfile" will appear in your BASEDIR/logs/incidents/ directory. Upload and attach that file here: it will contain a record of the important events that occurred up to the moment you hit the button. We're looking for information about any messages sent to the lost server. If there's something weird like an out-of-memory condition, this might show up in the logs.

Note: See TracTickets for help on using tickets.