[tahoe-dev] Observations on Tahoe performance

Mon Aug 24 22:48:31 PDT 2009

Hello,

I was very happy to see this elephant in the room finally discussed.

> Its inability to saturate a good upstream pipe is probably the biggest
reason that I
> sometimes hesitate to recommend it as a serious backup solution to my
friends.

This is exactly what I was forced to admit, except that in my case
recommendations go to my consulting clients. I am still here, as I see
enormous potential in Tahoe, but for it to be realized, this project will
need more users in the real-world scenarios.

And for that to happen, this performance issue should be the absolutely of
highest priority.

Thanks,
Andrej Falout

On Thu, Aug 20, 2009 at 6:20 AM, Brian Warner <warner at lothar.com> wrote:

> Shawn Willden wrote:
> > On Wednesday 12 August 2009 08:02:13 am Francois Deppierraz wrote:
> >> Hi Shawn,
> >>
> >> Shawn Willden wrote:
> >>> The first observation is that when uploading small files it is very
> >>> hard to keep the upstream pipe full, regardless of my configuration.
> >>> My upstream connection is only about 400 kbps, but I can't sustain
> >>> any more than about 200 kbps usage regardless of whether I'm using a
> >>> local node that goes through a helper, direct to the helper, or a
> >>> local node without the helper. Of course, not using the helper
> >>> produces the worst effective throughput, because the uploaded data
> >>> volume stays about the same, but the data is FEC-expanded.
> >> This is probably because Tahoe doesn't use a windowing protocol which
> >> makes it especially sensitive to the latency between nodes.
> >>
> >> You should have a look at Brian's description in bug #397.
> >
> > Very interesting, but I don't think #397 explains what I'm seeing. I
> > can pretty well saturate my upstream connection with Tahoe when
> > uploading a large file.
>
> First off, I must admit that I'm somewhat embarrassed by Tahoe's
> relatively poor throughput+latency performance. Its inability to
> saturate a good upstream pipe is probably the biggest reason that I
> sometimes hesitate to recommend it as a serious backup solution to my
> friends. In the allmydata.com world, we deferred performance analysis
> and improvement for two reasons: most consumer upload speeds were really
> bad (so we could hide behind that), and we don't have any
> bandwidth-management tools to avoid saturating their upstream and thus
> causing problems to other users (so e.g. *not* having a windowing
> protocol could actually be considered a feature, since it left some
> bandwidth available for the HTTP requests to get out).
>
> That said, I can think of a couple of likely slowdowns.
>
> 1: The process of uploading an immutable file involves a *lot* of
> roundtrips, which will hurt small files a lot more than large ones. Peer
> selection is done serially: we ask server #1, wait for a response, then
> ask server #2, wait, etc. This could be done in parallel, to a certain
> extent (it requires being somewhat optimistic about the responses you're
> going to get, and code to recover when you guess wrong). We then send
> share data and hashes in a bunch of separate messages (there is some
> overhead per message, but 1.5.0 has code to pipeline them [see #392], so
> I don't think this will introduce a significant number of roundtrips). I
> think that a small file (1kb) can be uploaded in peer-selection plus two
> RTT (one for a big batch of write() messages, the second for the final
> close() message), but the peer-selection phase will take a minimum of 10
> RTT (one per server contacted).
>
> I've been contemplating a rewrite of the uploader code for a while now.
> The main goals would be:
>
>  * improve behavior in the repair case, where there are already shares
>   present. Currently repair will blindly place multiple shares on the
>   same server, and wind up with multiple copies of the same share.
>   Instead, once upload notices any evidence of existing shares, it
>   should query lots of servers to find all the existing shares, and
>   then generate the missing ones and put them on servers that don't
>   already have any. (#610, #362, #699)
>  * improve tolerance to servers which disappear silently during upload,
>   eventually giving up on the server instead of stalling for 10-30
>   minutes
>  * improve parallelism during peer-selection
>
> 2: Foolscap is doing an awful lot of work to serialize and deserialize
> the data down at the wire level. This part of the work is roughly
> proportional to the number of objects being transmitted (so a list of
> 100 32-byte hashes is a lot slower than a single 3200-byte string).
> http://foolscap.lothar.com/trac/ticket/117 has some notes on how we
> might speed up Foolscap. And the time spent in foolscap gets added to
> your round-trip times (we can't parallelize over that time), so it gets
> multiplied by the 12ish RTTs per upload. I suspect the receiving side is
> slower than the sending side, which means the uploader will spend a lot
> of time twiddling its thumbs while the servers think hard about the
> bytes they've just received.
>
> We're thinking about switching away from Foolscap for share-transfer and
> instead using something closer to HTTP (#510). This would be an
> opportunity to improve the RTT behavior as well: we don't really need to
> wait for an ACK before we send the next block, we just need confirmation
> that the whole share was received correctly, and we need to avoid
> buffering too much data in the outbound socket buffer. In addition, we
> could probably trim off an RTT by changing the semantics of the initial
> message, to combine a do-you-have-share query with a
> please-prepare-to-upload query. Or, we might decide to give up on
> grid-side convergence and stop doing the do-you-have-share query first,
> to speed up the must-upload case at the expense of the
> might-not-need-to-upload case. This involves a lot of other projects,
> though:
>
>  * change the immutable share-transfer protocol to be less object-ish
>   (allocate_buckets, bucket.write) and more send/more/more/done-ish.
>  * change the mutable share design to make shares fully self-validating,
>   with the storage-index as the mutable share pubkey (or its hash)
>  * make the server responsible for validating shares as they arrive,
>   replace the "write-enabler" with a rule that the server accepts a
>   mutable delta if it results in a new share that validates against the
>   pubkey and has a higher seqnum (this will also increase the server
>   load a bit) (this is the biggest barrier to giving up Foolscap's
>   link-level encryption)
>  * replace the renew-lease/cancel-lease shared secrets with DSA pubkeys,
>   again to tolerate the lack of link-level encryption
>  * interact with Accounting: share uploads will need to be signed by a
>   who-is-responsible-for-this-space key, and doing this over HTTP will
>   be significantly different than doing it over foolscap.
>
> So this project is waiting for DSA-based lease management, new DSA-based
> mutable files, and some Accounting design work. All of these are waiting
> on getting ECDSA into pycryptopp.
>
> 3: Using a nearby Helper might help and might hurt. You spend a lot more
> RTTs by using the helper (it effectively performs peer-selection twice,
> plus an extra couple RTTs to manage the client-to-helper negotiation),
> but now you're running encryption and FEC in separate processes, so if
> you have multiple cores or hyperthreading or whatever you can utilize
> the hardware better. The Helper (which is doing more foolscap messages
> per file than the client) is doing slightly less CPU work, so those
> message RTTs might be smaller, which might help. It would be an
> interesting exercise to graph throughput/latency against filesize using
> a local helper (probably on a separate machine but connected via LAN)
> and see where the break-even point is. We've got a ticket (#398) about
> having the client *not* use the Helper sometimes.. maybe a tahoe.cfg
> option to set a threshold filesize could be useful.
>
> > For example, if the local node supplies the SID to the helper, the
> > helper can do peer selection before the upload is completed.
>
> Yeah, currently the Helper basically tries to locate shares, reports
> back the results of that attempt, then forgets everything it's learned
> and starts a new upload. We could probably preserve some information
> across that boundary to speed things up. Maybe the first query should be
> allocate_buckets(), and if the file was already in place, we abort the
> upload, but if it wasn't, we use the returned bucket references. That
> could shave off N*RTT. We should keep this in mind during the Uploader
> overhaul; it'll probably influence the design.
>
> cheers,
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://allmydata.org/pipermail/tahoe-dev/attachments/20090825/d6a1d0cc/attachment.htm