[tahoe-dev] [tahoe-lafs] #1170: new-downloader performs badly when downloading a lot of data from a file
tahoe-lafs
trac at tahoe-lafs.org
Mon Aug 16 18:14:04 UTC 2010
#1170: new-downloader performs badly when downloading a lot of data from a file
------------------------------+---------------------------------------------
Reporter: zooko | Owner:
Type: defect | Status: new
Priority: critical | Milestone: 1.8.0
Component: code-network | Version: 1.8β
Resolution: | Keywords: immutable download performance regression
Launchpad Bug: |
------------------------------+---------------------------------------------
Comment (by warner):
Responses to earlier comments:
* spans.py definitely looks O()-suspicious. Does
attachment:prof-run-110-dump.txt suggest that in a 100MB file download,
we
spent half of the total time in {{{Spans.__init__}}}?
* yes, the comment about {{{_wanted, _requested, _received}}} is stale.
* the failure to remove data from {{{_pending}}} upon errback is a bug
* {{{LayoutInvalid}}} is better than {{{assert}}}, yeah
* the {{{self._received.add(start,data)}}} is correct: {{{_received}}} is
a
{{{DataSpans}}} instance, not {{{Spans}}}, and it holds strings, not
booleans. {{{_received}}} holds the data that comes back from the
server
until the "satisfy" code consumes it. It has methods like {{{get}}} and
{{{pop}}}, whereas the simpler {{{Spans}}} class merely has methods for
is-range-in-span.
* that said, I don't understand how assertions added to spans.py#L47
would
ever fire. If those same assertions were added to spans.py#L295, I'd
get
it. What types were start/length in your observed assertion failures?
And
what was the stack trace?
* The patch to call {{{self._received.add(start, length)}}} is wrong; it
must be called with (int,str).
* all the comments about efficiency improvements in Spans are probably
correct
* adding data to {{{_unavailable}}} should be benign: the amount of
unavailable data is small and constant (if the share is intact, we
should
only add to {{{_unavailable}}} during the first few reads if we've
guessed
the segsize wrong).
Now some new ideas. I've found a couple of likely issues.
* looking at timestamps from flog-run-106, the inter-segment timing
definitely '''is''' growing over the course of the download. It's
noisy,
but it goes from about 0.8s at the start (seg0), to about 1.5s-2.0s at
the
end (seg762). I haven't looked at smaller deltas (i.e. only inside the
"desire" code) to rule out network variations, but it certainly points
to
a leak or complexity increase of some sort that gets worse as the
download
progresses.
* comparing downloads of the first 100MB, the middle 100MB, and the last
100MB would rule out anything that's influenced by the absolute
segment
number.
* looking at the {{{Spans.dump()}}} strings in the flog, I see that two
of
the three shares (sh7+sh8) have an ever-growing {{{.received}}}
{{{DataSpans}}} structure. A cursory check suggests they are growing by
64
bytes and one range per segment. By the end of the download (seg762),
sh7
is holding 37170 bytes in 378 ranges (whereas sh3 only has 1636 bytes
in
22 ranges, and remains mostly constant)
* This means that we're asking for data which we then don't end up
using.
We keep it around in {{{_received}}} because it might be useful later:
maybe we ask for the wrong data because our guess of the segsize (and
thus numsegs, and thus the size/placement of the hashtrees) was wrong.
But later we might take advantage of whatever we fetched by mistake.
* I have two theories:
* the {{{IncompleteHashTree.needed_hashes()}}} call, when asked what
hashes we need to validate leaf 0, might tell us we need the hash for
leaf 0 too. However, that hash can be computed from the block of data
that makes up the leaf, so we don't really need to fetch it. (whereas
we
*do* need the hash for leaf 1, since it sits on the "uncle chain" for
leaf0). If the desire-side code is conservatively/incorrectly asking
for
the leaf0 hash, but the satisfy-side code doesn't use it, then we'll
add
a single 32-byte hash node per segment.
* ciphertext hash tree nodes: when we start working on a segment, the
desire-side code will ask for ciphertext hash tree nodes from each
segment we're using. However, the satisfy-side code will only use the
hashes from the first response: by the time the second response
arrives,
the ciphertext hash tree is satisfied, so that clause isn't reached.
This means that we'll leave that data in {{{._received}}} forever.
This
seems most likely: it would explain why the first share (sh3) doesn't
grow, whereas the later two shares do, and why I saw a 64-byte
increment
(the actual growth would depend upon the segment number, and how many
new uncle-chain nodes are needed, but 2-nodes is a fairly common
value).
* The {{{.received}}} leftover-data issue shouldn't be such a big deal,
N=378 is not a huge number, but the measured increase in inter-segment
time suggests that whatever the O() complexity is, N=378 is enough to
cause problems.
So I think the next directions to pursue are:
* build some kind of test framework to exercise a large download without
using real remote_read calls, ideally 100MB or 1GB in a few seconds.
This
would use a {{{Share}}} subclass that returns data immediately (well,
after a eventual-send) rather than ever touching a server. It might
also
need to stub out some of the hashtree checks, but does need real
{{{needed_hashes}}} computations. Then we fix the code until this test
finishes in a reasonable amount of time. While I wouldn't have the test
case assert anything about runtime, I would have it assert things like
{{{._received}}} doesn't grow over the course of the test.
* measure the CPU seconds needed to download a 100MB file from both
old-downloader and new-downloader: if we're looking at a O(n^3^)
problem,
it will manifest as a much heavier CPU load. (if we were merely looking
at
a pipelining failure, the CPU time would be the same, but wallclock
time
would be higher).
* stare at {{{Spans}}} (and specifically {{{DataSpans}}}) for
computational-complexity problems. Build some tests of these with
N=400ish
and see how efficient they are. They're supposed to be linear wrt
number-of-ranges, but either they aren't, or they're being called in a
way
which makes it worse
* consider commenting out the dump() calls for a test, or some
{{{assert_invariants}}} calls, to see if we're hitting that old problem
where the data structure is efficient unless we leave in the self-
checks
or debugging messages
* (hard) figure out how to drop the unused ciphertext-hash-tree nodes
from
the second and later shares
* look at {{{IncompleteHashTree.needed_hashes}}} and see if we're
actually
requesting the leaf node that we don't really need.
* consider an entirely different {{{DataSpans}}} structure. The
perhaps-too-clever overlap/merge behavior is mostly just exercised
during
the fetch of the first segment, before we're sure about the correct
number
of segments (we fetch some data speculatively to reduce roundtrips; if
we
guess wrong, we'll get the wrong data, but {{{DataSpans}}} lets us
easily
use that data later if it turns out to be what we needed for some other
purpose). Perhaps a data structure which was less tuned for merging
adjacent ranges would be better, maybe one which has an explicit
{{{merge()}}} method that's only called just before the requests are
sent
out. Or maybe the value of holding on to that data isn't enough to
justify
the complexity.
* a related thought (mostly to remind me of the idea later): for
pipelining
purposes, I'd like to be able to label the bits in a {{{Spans}}} with
their purpose: if we send parallel requests for both seg2 and seg3,
I'd
like the seg2 data to arrive first, so e.g. the hashes needed to
validate
seg2 should arrive before the bulk block data for seg3. A label on the
bits like "this is for seg2" would let us order the requests in such a
way to reduce our memory footprint. A label like this might also be
useful for handling the unused-ciphertext-hash-tree-nodes problem, if
we
could remove data from a {{{DataSpans}}} that's labelled with an
already-complete segnum.
Finally, the bug zooko mentioned in comment:42 is real. I'm still working
on
it, but basically it prevents us from using shares that arrive after the
initial batch of requests: they are not initialized properly and don't get
a
correct block hash tree. I'm working on a fix. The symptom is that we fall
back to the initial shares, but if those have died, the download will
fail, which is wrong.
And I'm still working on the new share-selection algorithm. The code
works,
and my basic unit tests work, but certain ones require the comment:42 bug
to
be fixed before it is safe to use (the bug will hurt current downloads,
but
occurs less frequently).
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1170#comment:43>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-dev
mailing list