[tahoe-dev] [tahoe-lafs] #928: start downloading as soon as you know where to get K shares

Wed Jan 27 12:53:55 PST 2010

#928: start downloading as soon as you know where to get K shares
-----------------------------------------------+----------------------------
 Reporter:  zooko                              |           Owner:  zooko
     Type:  defect                             |          Status:  new  
 Priority:  major                              |       Milestone:  1.6.0
Component:  code-peerselection                 |         Version:  1.5.0
 Keywords:  download availability performance  |   Launchpad_bug:       
-----------------------------------------------+----------------------------

Comment(by zooko):

 I figured your Downloader rewrite would.  Also, probably lots of other
 improvements such as more pipelining.  Is there a list of your intended
 improvements to Downloader?  I have vague recollections of lots of good
 ideas you had for Downloader improvements.

 I still think it is a good idea to apply this patch now, however, (after
 thorough tests and code review) because:

 1.  Now I understand that the bad behavior I've been seeing (especially on
 the allmydata.com prod grid) in which downloads hang is ''not'' caused
 solely by a server failing ''during'' a download, as I formerly thought
 (#287), but is caused by there being any server connected to the network
 which is in the hung state (such that it maintains its TCP connections but
 refuses to answer {{{get_buckets()}}}).  With current trunk, as long as
 there is any such server connected to a grid then all downloads from that
 grid will hang.
 2.  Likewise, with current trunk, the slowest server (even if it isn't
 completely hung) determines the alacrity of beginning an immutable file
 download.  This explains the behavior that I've observed in which all
 downloads take a few seconds to start (because there is one server on that
 grid which is slow or overloaded).
 3.  With this patch, you'll download from the K servers that answered the
 {{{get_buckets}}} first (assuming only one share per server) instead of
 the K servers that have primary shares (or, in the case that you don't get
 K servers with primary shares, random servers with secondary shares).
 This sounds potentially a nice performance improvement, especially for
 heterogeneous and geographically spread-out grids.
 4.  This patch is nicely self-contained, as I hope you (Brian) will take
 the time to verify by reviewing it.  It could be made ''more'' self-
 contained by changing it to callback instead of errback when K buckets
 couldn't be located (as described in comment:4), and I should probably do
 so out of an abundance of caution, but I intend to first examine why the
 errback doesn't do what I expect.  I guess it could also be made smaller
 by taking out the part that changes reporting of status from "responses
 received/queries sent" to "responses received+queries failed/queries
 sent".  I changed that only because it seemed slightly inaccurate to omit
 the queries failed in the reporting, but it isn't really necessary for
 this patch.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/928#comment:6>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid