Opened at 2010-07-28T06:29:57Z
Last modified at 2010-08-06T06:26:38Z
#1138 new defect
Timeout of Servermap Update
Reported by: | eurekafag | Owned by: | nobody |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-network | Version: | 1.7.1 |
Keywords: | servermap update timeout mutable download availability hang foolscap | Cc: | |
Launchpad Bug: |
Description (last modified by eurekafag)
I wrote to the mail list but no one answered. Hope this is the better place for such report.
I have one problem with updating the directories. Sometimes a server is dropping out of network for various reasons but still remains connected on welcome page. When I'm trying to access any directory my node starts map updating and it may be very long operation during which all work with directories hangs. It takes from 10 to 14 minutes and I guess it's unacceptable for such system. Where can I set the timeout for this? I set up:
timeout.keepalive = 60 timeout.disconnect = 300
But this doesn't help. One of servers lost internet connection suddenly and that's what I got accessing one of directories (there were 4 servers and only 3 requests succeded):
Started: 15:11:01 26-Jul-2010 Finished: 15:25:51 26-Jul-2010 Storage Index: o4wudooveaakzeqdhvspjsjgtm Helper?: No Progress: 100.0% Status: Finished Update Results Timings: Total: 14 minutes Initial Queries: 4.5ms Cumulative Verify: 1.0ms Per-Server Response Times: [nfyml6h3]: 3.4ms [t726lxet]: 67ms [35jflwuq]: 32ms
Almost 15 minutes! That becomes critical if you mount a directory via sshfs+sftpd, then any process may stuck when lists this mounted dir and the only way is to kill sshfs (you couldn't kill the process itself, it's in D+ state) or wait (not knowing how long). Please, point me to the right timeout option for such requests! 10 to 30 seconds would be very nice.
Change History (4)
comment:1 Changed at 2010-07-28T06:30:41Z by eurekafag
- Description modified (diff)
comment:2 follow-up: ↓ 3 Changed at 2010-07-29T04:57:20Z by zooko
comment:3 in reply to: ↑ 2 Changed at 2010-07-30T06:19:02Z by davidsarah
- Component changed from unknown to code-network
- Keywords mutable download availability hang foolscap added
- Milestone changed from undecided to soon
Replying to zooko:
Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).
From the description, the case at hand seems to be mutable download.
Will this be addressed by the New Downloader, or does that only handle immutable download?
comment:4 Changed at 2010-08-06T06:26:38Z by warner
The #798 new-downloader is only for immutable files, sorry.
The timeout.disconnect timer, due to the low-overhead way it is implemented, may take up to twice the value to finally sever a connection. So a value of 300 could take up to 10 minutes to disconnect the server connection. But it shouldn't have let the connection stay up for 14 minutes. Two ideas come to mind: the timeout.disconnect clause might have been in the wrong section (it should be in the [node] section), or there might have been other traffic on that connection that kept it alive (but not the response to the mutable read query). Neither seems likely.. the only way I can imagine traffic keeping it alive is if the server were having weird out-of-memory or hardware errors and dropped one request while accepting others (we've seen things like this happen before, but it was on a server that had run out of memory).
It might help to collect some log information from your box after it does this. If you go to the front "Welcome" page, there's a button at the bottom that says "Report An Incident". Push that, and a few seconds later, a new "flogfile" will appear in your BASEDIR/logs/incidents/ directory. Upload and attach that file here: it will contain a record of the important events that occurred up to the moment you hit the button. We're looking for information about any messages sent to the lost server. If there's something weird like an out-of-memory condition, this might show up in the logs.
Thank you for the bug report. I think you are right that a ticket is a better way to get answers about this kind of thing than a mailing list message. (Although sending a mailing list message first is often a good way to start.)
Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).
In the short-term, I'm not sure why setting the foolscap timeouts did not cause the upload to complete (whether successfully or failingly) more quickly. It is a mystery to me. Perhaps someone else could dig into the upload code and figure it out. One potentially productive way to do that would be to add more diagnostics to the status page, showing which requests to servers are currently outstanding and for how long they have been.