[tahoe-dev] [tahoe-lafs] #1170: new-downloader performs badly when downloading a lot of data from a file

Mon Aug 16 18:14:04 UTC 2010

#1170: new-downloader performs badly when downloading a lot of data from a file
------------------------------+---------------------------------------------
     Reporter:  zooko         |       Owner:                                           
         Type:  defect        |      Status:  new                                      
     Priority:  critical      |   Milestone:  1.8.0                                    
    Component:  code-network  |     Version:  1.8β                                     
   Resolution:                |    Keywords:  immutable download performance regression
Launchpad Bug:                |  
------------------------------+---------------------------------------------

Comment (by warner):

 Responses to earlier comments:

  * spans.py definitely looks O()-suspicious. Does
    attachment:prof-run-110-dump.txt suggest that in a 100MB file download,
 we
    spent half of the total time in {{{Spans.__init__}}}?
  * yes, the comment about {{{_wanted, _requested, _received}}} is stale.
  * the failure to remove data from {{{_pending}}} upon errback is a bug
  * {{{LayoutInvalid}}} is better than {{{assert}}}, yeah
  * the {{{self._received.add(start,data)}}} is correct: {{{_received}}} is
 a
    {{{DataSpans}}} instance, not {{{Spans}}}, and it holds strings, not
    booleans. {{{_received}}} holds the data that comes back from the
 server
    until the "satisfy" code consumes it. It has methods like {{{get}}} and
    {{{pop}}}, whereas the simpler {{{Spans}}} class merely has methods for
    is-range-in-span.
   * that said, I don't understand how assertions added to spans.py#L47
 would
     ever fire. If those same assertions were added to spans.py#L295, I'd
 get
     it. What types were start/length in your observed assertion failures?
 And
     what was the stack trace?
   * The patch to call {{{self._received.add(start, length)}}} is wrong; it
     must be called with (int,str).
  * all the comments about efficiency improvements in Spans are probably
    correct
  * adding data to {{{_unavailable}}} should be benign: the amount of
    unavailable data is small and constant (if the share is intact, we
 should
    only add to {{{_unavailable}}} during the first few reads if we've
 guessed
    the segsize wrong).

 Now some new ideas. I've found a couple of likely issues.

  * looking at timestamps from flog-run-106, the inter-segment timing
    definitely '''is''' growing over the course of the download. It's
 noisy,
    but it goes from about 0.8s at the start (seg0), to about 1.5s-2.0s at
 the
    end (seg762). I haven't looked at smaller deltas (i.e. only inside the
    "desire" code) to rule out network variations, but it certainly points
 to
    a leak or complexity increase of some sort that gets worse as the
 download
    progresses.
   * comparing downloads of the first 100MB, the middle 100MB, and the last
     100MB would rule out anything that's influenced by the absolute
 segment
     number.
  * looking at the {{{Spans.dump()}}} strings in the flog, I see that two
 of
    the three shares (sh7+sh8) have an ever-growing {{{.received}}}
    {{{DataSpans}}} structure. A cursory check suggests they are growing by
 64
    bytes and one range per segment. By the end of the download (seg762),
 sh7
    is holding 37170 bytes in 378 ranges (whereas sh3 only has 1636 bytes
 in
    22 ranges, and remains mostly constant)
   * This means that we're asking for data which we then don't end up
 using.
     We keep it around in {{{_received}}} because it might be useful later:
     maybe we ask for the wrong data because our guess of the segsize (and
     thus numsegs, and thus the size/placement of the hashtrees) was wrong.
     But later we might take advantage of whatever we fetched by mistake.
   * I have two theories:
    * the {{{IncompleteHashTree.needed_hashes()}}} call, when asked what
      hashes we need to validate leaf 0, might tell us we need the hash for
      leaf 0 too. However, that hash can be computed from the block of data
      that makes up the leaf, so we don't really need to fetch it. (whereas
 we
      *do* need the hash for leaf 1, since it sits on the "uncle chain" for
      leaf0). If the desire-side code is conservatively/incorrectly asking
 for
      the leaf0 hash, but the satisfy-side code doesn't use it, then we'll
 add
      a single 32-byte hash node per segment.
    * ciphertext hash tree nodes: when we start working on a segment, the
      desire-side code will ask for ciphertext hash tree nodes from each
      segment we're using. However, the satisfy-side code will only use the
      hashes from the first response: by the time the second response
 arrives,
      the ciphertext hash tree is satisfied, so that clause isn't reached.
      This means that we'll leave that data in {{{._received}}} forever.
 This
      seems most likely: it would explain why the first share (sh3) doesn't
      grow, whereas the later two shares do, and why I saw a 64-byte
 increment
      (the actual growth would depend upon the segment number, and how many
      new uncle-chain nodes are needed, but 2-nodes is a fairly common
 value).
  * The {{{.received}}} leftover-data issue shouldn't be such a big deal,
    N=378 is not a huge number, but the measured increase in inter-segment
    time suggests that whatever the O() complexity is, N=378 is enough to
    cause problems.

 So I think the next directions to pursue are:

  * build some kind of test framework to exercise a large download without
    using real remote_read calls, ideally 100MB or 1GB in a few seconds.
 This
    would use a {{{Share}}} subclass that returns data immediately (well,
    after a eventual-send) rather than ever touching a server. It might
 also
    need to stub out some of the hashtree checks, but does need real
    {{{needed_hashes}}} computations. Then we fix the code until this test
    finishes in a reasonable amount of time. While I wouldn't have the test
    case assert anything about runtime, I would have it assert things like
    {{{._received}}} doesn't grow over the course of the test.
  * measure the CPU seconds needed to download a 100MB file from both
    old-downloader and new-downloader: if we're looking at a O(n^3^)
 problem,
    it will manifest as a much heavier CPU load. (if we were merely looking
 at
    a pipelining failure, the CPU time would be the same, but wallclock
 time
    would be higher).
  * stare at {{{Spans}}} (and specifically {{{DataSpans}}}) for
    computational-complexity problems. Build some tests of these with
 N=400ish
    and see how efficient they are. They're supposed to be linear wrt
    number-of-ranges, but either they aren't, or they're being called in a
 way
    which makes it worse
  * consider commenting out the dump() calls for a test, or some
    {{{assert_invariants}}} calls, to see if we're hitting that old problem
    where the data structure is efficient unless we leave in the self-
 checks
    or debugging messages
  * (hard) figure out how to drop the unused ciphertext-hash-tree nodes
 from
    the second and later shares
  * look at {{{IncompleteHashTree.needed_hashes}}} and see if we're
 actually
    requesting the leaf node that we don't really need.
  * consider an entirely different {{{DataSpans}}} structure. The
    perhaps-too-clever overlap/merge behavior is mostly just exercised
 during
    the fetch of the first segment, before we're sure about the correct
 number
    of segments (we fetch some data speculatively to reduce roundtrips; if
 we
    guess wrong, we'll get the wrong data, but {{{DataSpans}}} lets us
 easily
    use that data later if it turns out to be what we needed for some other
    purpose). Perhaps a data structure which was less tuned for merging
    adjacent ranges would be better, maybe one which has an explicit
    {{{merge()}}} method that's only called just before the requests are
 sent
    out. Or maybe the value of holding on to that data isn't enough to
 justify
    the complexity.
   * a related thought (mostly to remind me of the idea later): for
 pipelining
     purposes, I'd like to be able to label the bits in a {{{Spans}}} with
     their purpose: if we send parallel requests for both seg2 and seg3,
 I'd
     like the seg2 data to arrive first, so e.g. the hashes needed to
 validate
     seg2 should arrive before the bulk block data for seg3. A label on the
     bits like "this is for seg2" would let us order the requests in such a
     way to reduce our memory footprint. A label like this might also be
     useful for handling the unused-ciphertext-hash-tree-nodes problem, if
 we
     could remove data from a {{{DataSpans}}} that's labelled with an
     already-complete segnum.

 Finally, the bug zooko mentioned in comment:42 is real. I'm still working
 on
 it, but basically it prevents us from using shares that arrive after the
 initial batch of requests: they are not initialized properly and don't get
 a
 correct block hash tree. I'm working on a fix. The symptom is that we fall
 back to the initial shares, but if those have died, the download will
 fail, which is wrong.

 And I'm still working on the new share-selection algorithm. The code
 works,
 and my basic unit tests work, but certain ones require the comment:42 bug
 to
 be fixed before it is safe to use (the bug will hurt current downloads,
 but
 occurs less frequently).

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1170#comment:43>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage