[tahoe-dev] Random access to files

Wed Jan 28 07:22:28 PST 2009

Brian Warner <warner-tahoe at allmydata.com> writes:
> On Tue, 27 Jan 2009 06:59:49 -0700
> Shawn Willden <shawn-tahoe at willden.org> wrote:
>> How hard would it be to add support for random access?
> Not impossible, but not trivial.

Dang.

> In current Tahoe trunk, a webapi GET call with a "Content-Range:"
> header will work as expected (for both mutable and immutable
> files). However, the latency (what we refer to as "alacrity") for
> the immutable file will be equal to the time it takes for the webapi
> node to retrieve all bytes up-to-and-including the end of the range
> from the storage servers (rounded up to the end of a segment).

Hmm.  What I want to do is binary search a sorted file.  Assuming
caching in the node, this is a small win, but only a small one.

> (note that our SDMF rules let us assume that mutable files are
> relatively small, so everything is kept in a single segment, and
> random access always grabs the whole file.

A single segment?  Or a single bucket?  Is this what mutable.txt
refers to when it says:

    "i.e. the write enabler is associated with the "bucket", rather
     than the individual shares."

Let me see if I have the terminology straight here (maybe I should
start a glossary.txt file!):

* slot:     A persistent reference to a mutable file

* segment:  A chunk of an encrypted file.  Segments are the units for
            erasure coding.

* share:    The output of erasure coding is a set of N shares, k of
            which are needed to reconstruct the file.  A share is also
            the unit of upload/download.

* bucket:   A set of shares, each from a different segment of a single
            file, that is stored on a given server.  A bucket is
            usually uploaded and downloaded as a unit, though I don't
            believe that's necessary.

I also see the word "block" floating around, and I'm not sure what
that is.  A synonym for segment, maybe?

> Your code analysis is correct. I know that the AES class didn't
> always provide a way to start the counter at something other than
> zero, but I believe it does now.

It definitely does.  This method from modes.cpp does the job:

void CTR_ModePolicy::SeekToIteration(lword iterationCount)
{
    int carry=0;
    for (int i=BlockSize()-1; i>=0; i--)
    {
        unsigned int sum = m_register[i] + byte(iterationCount) + carry;
        m_counterArray[i] = (byte) sum;
        carry = sum >> 8;
        iterationCount >>= 8;
    }
}

Kinda complicated for what it does, but I suppose that's the price of
not making any assumptions about block size.

> The client-side helper code, in immutable.upload.AssistedUploader or
> EncryptAnUploadable, is another place that needs that sort of
> functionality, since the server-side Helper might tell it to send
> ciphertext starting from the middle of the file.

Yeah, but I don't care about that, I only want to write to the ends of
files anyway :-)

> I'm not sure how deeply you'd like to get into this.

Hmm.  Not that deeply, not now anyway.

> Reducing the number of segments that get fetched is basically enough
> to reduce the amount of work done for a single random-access
> read. You could reduce it further by fetching a subset of the hash
> tree: to validate one segment, you only need to fetch log2(Nsegs)
> hashes, but since the current code assumes it's going to download
> the whole thing, it fetches all (Nsegs) of them ahead of time.

I don't think I see the issue here.  Each segment (rather, the shares
of each segment) contains all the hash tree nodes needed to verify
that segment back to the root, right?

> Note that this basic approach means that two successive random-accesses (even
> of the same byte) will take equal amounts of time. The full project described
> above would fix this, and would allow random-access to be the primary access
> mode (since full-file access could be implemented efficiently in terms of a
> sequence of regular reads).

You could go either way there.  The caching can either be in the
client or in the server.  The truly naive clients will be those that
don't even know they're using a DFS -- normal programs reading through
a FUSE layer, for example.  But in that case, the FUSE layer is the
real client and it knows to be smarter.

>> Actually, it may be better to return the whole sequence of segments and an
>> offset to tell the client where its range begins. That way the client can
>> cache segments.
>
> Eh, I'd be worried about confusing the client: "I asked for 500-600, and
> somehow I got back 400-700".

Again, I think it's reasonable to assume that the client knows what's
going on.  But I like your suggested API better.  It's cleaner.  In
particular my suggestion wouldn't work well at all with the web API.

      Shawn.