[tahoe-dev] Observations on Tahoe performance

Tue Aug 25 12:05:32 PDT 2009

Shawn Willden wrote:
> 
> Can't you pass a disk-backed buffer-like object to transport.write()?
> Perhaps an mmap object? If there's a reason that doesn't work, then
> transport.write() needs to either accept a file-like object or
> implement disk-based buffering itself.

The short answer is no, transport.write() accepts a bytestring and
nothing else. However, it's pretty easy to build a disk-backed
"Producer" object and attach it to the transport (which implements the
"Consumer" interface), which will be pulled from whenever the buffer has
enough space.. it'd take maybe 10 lines of code (this is what the
Downloader does when invoked by an HTTP GET).

But, now think about where the data goes: you have to write a N/k*1GB to
disk, and then read it all back out again. Sure, you get to retire the
input stream quickly (well, as quickly as you're able to process and
write gigabytes to your disk, which could still take an hour). But now
you add the overhead of doing all of that disk IO, which will probably
result in a slower overall upload as it stalls everything else you're
trying to do (diskio always blocks the process, since nobody has a
usable async filesystem API), as well as slamming the host and slowing
down other processes.

The "Mountain View" system which predated tahoe (derived from Mnet and
the original MojoNation codebase) did this, and performance for large
files was horrible. One problem was that it encoded everything before
starting any part of the upload (so it could retire the input file
faster, and to simplify the "Pusher" uploading code). This hurt small
files: lots of extra steps and disk io for something that could
otherwise be finished in RAM and retired before it fell out of the CPU
cache.

But the biggest problem was that this approach performs the share
transposition on disk. As you know, each share contains one block per
segment (otherwise we'd hit the "chunking problem", where reliability
drops quickly with the size of the file). If you draw the shares as
short+wide horizontal boxes and stack them on top of each other, then
the encoding process is filling in this rectangle from the left
(segment[0], then segment[1], etc). However, the upload process
(depending upon how you do it) is *reading* this rectangle from the top:
sending out share[0], then share[1], etc. The matrix is being transposed
in between the encoding step and the upload step.

This kills the disk: you write a huge amount of data in one order, then
(from the disk's point of view) do a bazillion seeks and read it out in
a completely different order. All of the disk and local fs's readahead
optimizations fail. You could literally hear the disk crunching as the
heads were flying around like mad. The computer sounded angry. People
didn't like this :).

So one of the big early decisions in Tahoe was to move this
transposition into the network. By uploading all the shares at the same
time (instead of storing them to disk first and uploading them one at a
time, as peers became available), we can hide this transposition in the
wire, and allow both sides to do nice clean linear reads/writes to the
source and share files respectively. This was dramatically less painful.
Disk seeks are the most expensive thing you can do (even worse than TCP
setup time).

> Expecting the data to be small enough to be queueable in RAM isn't a
> good idea, even with ubiquitous virtualized memory and gigabytes of
> physical RAM.

Well, this touches on a deeper issue: the "leaky abstraction" of where
the buffering lives. The layers of the network stack like to pretend
that they're independent, and that socket.write() is fire-and-forget
(i.e. once socket.write() returns, the kernel has taken responsibility
and the application doesn't need to think about it anymore). But it's a
big multi-party system, each with their own limitations, and the layers
can't accept unconditional responsibility without incurring unbounded
costs.

There's "backpressure" from one layer to another, like when a twisted
Consumer stalls its Producer, or when socket.write() doesn't accept all
of your data, or when the remote TCP stack doesn't ACK your segment and
open up the window until *their* userspace has socket.read() some of the
inbound data.

Naive programs, with blocking IO calls, usually ignore this
backpressure: things work fine at low volumes, but stall in
unpredictable ways when usually-quick remote calls start taking a long
time.

To avoid unbounded storage usage, you always want to convey this
backpressure as far back as possible (ideally by making the human not
type until there's room for it). But that's both complicated and
annoying: many layers don't like to be told to wait, and are unwilling
to accept an API in which all of their data is not instantly accepted.
To accomodate those layers, the layer below must buffer.

If the insistent layer is not using some sort of backpressure of its own
(e.g. the Encoder waiting for all servers to ACK block receipt before
encoding the next segment), then that buffering can involve an unbounded
amount of storage. Usually what happens is that somebody imposes a limit
on the buffer size and starts throwing errors when it gets full, in the
hopes that this will make the sender slow down. However, without careful
design, it is easy to wind up with runaway buffering issues: the system
gets so busy handling these errors that it can't make any progress on
the existing queue. Network driver and TCP stack design encourages a
"discard early" policy: if you're going to have to drop that inbound
packet (because you don't have anywhere to put it), make the decision
with as little work as possible, so you can conserve your CPU (or memory
bandwidth, or whatever) for working through the data that you *don't*
discard.

Regardless of how it's limited, the buffer storage can be hosted in
memory, or on disk, or on some clever combination of both. On modern
systems, it frequently doesn't matter which an application uses, because
the kernel will effectively combine the two (use enough memory and it'll
get swapped out to disk; or write to disk but really your data gets held
in a RAM-based writecache until it's convenient to write, and reads will
come from RAM, and if you delete it quick enough it'll never actually
touch the spindle). On windows, this didn't work so well (the kernel
writecache wasn't too bright). And when the data size gets large, it
also starts to not work out so well (the kernel wants to avoid unbounded
diskcache storage, so it will stall your file.write() and flush to disk
when it gets too big, so all of that N/k*1GB will actually touch the
disk). But it has to go somewhere, and the larger it gets, the more
trouble it will cause.

So the one-segment-at-a-time design is there to provide that continuous
chain of backpressure: keep the pipe full, but don't encode (too much)
more data than we can currently use. Don't encrypt more data than we can
encode. Don't read more data than we can encrypt. Keep the buffering
requirements bounded. We currently use something like 3x or 4x the
segment size (so probably 0.5MB max) per active upload or download.

To improve our streaming story, we need to push it back one step
further. Using a randomly-generated key (or changes to the way we encode
files, which we can discuss on NewImmutableEncodingDesign) would get us
down to one IO pass. Then the "don't read more data than we can encrypt"
would mean reading from the user (i.e. from the HTTP PUT pipe), rather
than reading from the tempfile where the PUT stashed the request body so
we could do the CHK pass before starting the encoding and upload. This
would finally apply backpressure to the webapi client, which would
probably then avoid reading (from local disk) until needed. This would
complete the job: the plaintext sits nice and quiet on the original
uploader's disk until the last possible moment. The intermediate layers
buffer just enough to keep the pipeline full and avoid delays due to
non-zero round-trip times. Everything happens "Just In Time" and storage
requirements (including the tempfile) are dropped to a minimum.

cheers,
 -Brian