#3787 new task

Batch sizes when uploading immutables are hardcoded

Reported by: itamarst Owned by:
Priority: normal Milestone: HTTP Storage Protocol v2
Component: unknown Version: n/a
Keywords: Cc:
Launchpad Bug:

Description (last modified by itamarst)

Updated issue description: there is a single hardcoded value for batching (formerly known as pipelining) immutable uploads, and it might be better to be dynamic. Or higher, at least.


Initial issue description:

Pipeline class was added in #392, but I really don't understand the reasoning.

It makes a bit more sense if you replace the word "pipeline" with "batcher" when reading the code, but I still don't understand why round-trip-time is improved by this approach.

Change History (5)

comment:1 Changed at 2021-10-04T13:27:33Z by itamarst

Brian provided this highly detailed explanation:

If my dusty memory serves, the issue was that uploads have a number of
small writes (headers and stuff) in addition to the larger chunks
(output of the erasure coding). Also, the "larger" chunks are still
pretty small. And the code that calls _write() is going to wait for the
returned Deferred to fire before starting on the next step. So the
client will send a tiny bit of data, wait a roundtrip for it to be
accepted, then start on the next bit, wait another roundtrip, etc. This
limits your network utilization (the percentage of your continuous
upstream bandwidth that you're actually using): the wire is sitting idle
most of the time. It gets massively worse with the round trip time.

The general fix is to use a windowed protocol that optimistically sends
lots of data, well in advance of what's been acknowledged. But you don't
want to send too much, because then you're just bloating the transmit
buffer (it all gets held up in the kernel, or in the userspace-side
socket buffer). So you send enough data to keep X bytes "in the air",
unacked, and each time you see another ack, you send out more. If you
can keep your local socket/kernel buffer from ever draining to zero,
you'll get 100% utilization of the network.

IIRC the Pipeline class was a wrapper that attempted to do something
like this for a RemoteReference. Once wrapped, the caller doesn't need
to know about the details, it can just do a bunch of tiny writes, and
the Deferred it gets back will lie and claim the write was complete
(i.e. it fires right away), when in fact the data has been sent but not
yet acked. It keeps doing this until the sent-but-not-acked data exceeds
the size limit (looks like 50kB, OMG networks were slow back then), at
which point it waits to fire the Deferreds until something actually gets
acked. Then, at the end, to make sure all the data really *did* get
sent, you have to call .flush(), which waits until the last real call's
Deferred fires before firing its own returned Deferred.

So it doesn't reduce the number of round trips, but it reduces the
waiting for them, which should increase utilization significantly.

Or, it would, if the size limit were appropriate for the network speed.
There's a thing in TCP flow control called "bandwidth delay product"[1],
I forget the details, but I think the rule is that bandwidth times round
trip time is the amount of unacked data you can have outstanding "on the
wire" without 1: buffering anything on your end (consumes memory, causes
bufferbloat) or 2: letting the pipe run dry (reducing utilization). I'm
pretty sure the home DSL line I cited in that ticket was about 1.5Mbps
upstream, and I bet I had RTTs of 100ms or so, for a BxD of 150kbits of
20kB. These days I've got gigabit fiber, and maybe 50ms latency, for a
BxD of 6MB.

As the comments say, we're overlapping multiple shares during the same
upload, so we don't need to pipeline the full 6MB, but I think if I were
using modern networks, I'd increase that 50kB to at least 500kB and
maybe 1MB or so. I'd want to run upload-speed experiments with a couple
of different networking configurations (apparently there's a macOS thing
called "Network Link Conditioner" that simulates slow/lossy network
connections) to see what the effects would be, to choose a better value
for that pipelining depth.

And of course the "right" way to do it would be to actively track how
fast the ACKs are returning, and somehow adjust the pipeline depth until
the pipe was optimally filled. Like how TCP does congestion/flow
control, but in userspace. But that sounds like way too much work.

comment:2 Changed at 2021-10-04T14:01:02Z by itamarst

From the above we can extract two problems:

  1. A need for backpressure.
  2. _write() waiting for Deferred to fire before continuing. If the need for backpressure didn't exist, this would be bad. Given backpressure is necessary... this might be OK. Or not, perhaps there is a better mechanism.

So step 1 is probably figure out how to implement backpressure.

comment:3 Changed at 2021-10-04T14:13:42Z by itamarst

Instead of hardcoding buffer size, we could...

  1. Figure out latency by sending HTTP echo to server.
  2. Start with some reasonable batch buffer size.
  3. Keep increasing buffer size until the latency from sending a batch is higher than minimal expected latency from step 1. This implies that we've hit the bandwidth limit.
Last edited at 2021-10-04T14:14:24Z by itamarst (previous) (diff)

comment:4 Changed at 2022-11-23T15:19:43Z by itamarst

  • Description modified (diff)
  • Milestone changed from HTTP Storage Protocol to HTTP Storage Protocol v2

Once #3939 is fixed, the Pipeline class will no longer be used. However, there will still be a batching mechanism via allmydata.immutable.layout._WriteBuffer, which suffers from basically the same issue of having a single hardcoded number that isn't necessarily adapted to network conditions.

So this still should be thought about based on discussion above, but changing the summary and description.

comment:5 Changed at 2022-11-28T16:15:33Z by itamarst

  • Summary changed from Is the use of Pipeline for write actually necessary? to Batch sizes when uploading immutables are hardcoded
Note: See TracTickets for help on using tickets.