[tahoe-dev] 1,000,000+ files?

Brian Warner warner-tahoe at allmydata.com
Tue May 6 19:14:26 PDT 2008


> Thanks. Do you know the exact limitation per file? So I can restrict  
> an upload size.

The relevant limit is at:

 http://allmydata.org/trac/tahoe/browser/src/allmydata/mutable/publish.py?rev=2498#L97

and is 3.5MB (that's 3.5e6 bytes) per mutable file. This limit is fairly
arbitrary. One reason to keep it below infinity is to keep the memory
footprint to a reasonable level: the node will use something like 4x the
segment size to upload a file, and something like 2x to download it. The
other reason is to discourage people from using very large mutable files
while they are in the current inefficient state.

Dirnodes are a particular application of mutable files, which serializes a
table of (name, child URI, other stuff) entries and stuffs the resulting
string in a mutable file. We're seeing each entry consume roughly 330 bytes,
but the exact figure depends upon the length of the child name and the length
of the child URI, so that 330 isn't a hard-and-fast number. As a result,
dirnodes can hold roughly 3.5MB/330 = 10606 children, but the actual limit
will vary. So we tell people to keep it under 10k children and hope they
aren't in the habit of using kilocharacter filenames.

> Right now the mapping is done in a database, then we have a high  
> performance caching layer that uses a distributed hash table for fast  
> lookups based on URL (pretty url -> Tahoe URI).

That sounds ideal. Tahoe dirnodes are a convenience, so that people can use
it like a traditional filesystem without being forced to maintain their own
tables like your database. If some alternative table suits your needs better,
the by all means bypass tahoe's native dirnodes in favor of that table.

(Out of curiosity, what counts as a "pretty URL" in your scheme? We've been
thinking about ways to make tahoe URIs smaller, but we've gone back and
forth about how small we should try to make them (and on how many bits of
unguessability we need to maintain). I'd be interested to know about where
in the design space you've landed.)

The only thing you'll need to be aware of is accounting / garbage collection
(ticket #119). We haven't implemented it yet, but eventually each share will
need to have a lease on it that identifies an account of some sort, and
clients will be responsible for cancelling leases on shares that they are no
longer using. There are a couple of different ways we might approach this,
but most of them require the client to be able to enumerate all of the files
that they wish to keep alive (and either periodically renew a lease on each
one, or keep track of the deltas and cancel a lease on everything that's been
deleted).

For the allmydata.com case, we'll probably wind up doing a simple recursive
walk of each customer's root dirnode. But if you're keeping track of URIs
outside of directory nodes, then you'll need some way to get those URIs into
the set of "things that should be kept alive". You'll probably just be able
to take the list of all the values in your table to implement this.

> Also, for MDMF - would this allow a client locally diff the changes  
> made, so that only that segment could be pushed to the DFS instead of  
> entire file reuploaded? Was curious about the situations where you  
> have large segments. Or would LDMF support that?

One of the goals for the MDMF milestone is exactly that. A one-byte change
(replacement) to a mutable file should not require the node to push more than
one segment (say 128KiB) into the grid (plus an updated hash chain). Making
an N byte change to a file of size M should result in roughly N bytes being
uploaded rather than all M.

We allow large segments (>128KiB) for mutable files right now to allow large
(up to 3.5MB) mutable files. The "we haven't implemented everything that
we've designed" limitation of our SDMF code is that we don't yet support
multiple segments: there is code that needs to loop over each segment and
hash/send each one separately, and I just haven't written that code yet. As a
result, we can only handle a single segment, so the only way to get >128KiB
mutable files (and therefore >~397 child dirnodes) is to allow the segment
size to grow. The next SDMF development milestone is to implement that code
and allow multiple segments.

One of the two LDMF goals is to allow that one-byte change to be an insert or
a delete, rather than just a replace. This is well beyond what POSIX or (to
my limited knowledge) any other traditional disk-based filesystem offers.
Clearly this will require something more than the FUSE API to benefit from.

In the long run (once we remove this one-segment SDMF limitation), we really
want to keep the segment size low (128KiB or 1MiB), both because it keeps the
memory footprint low, and because it improves the "alacrity": basically the
amount of extra data that you need to hash or verify in order to get a single
byte of plaintext. The reason for not making the segment size very low (1KiB)
is the overhead, which is on the order of 128 bytes per segment.

cheers,
 -Brian


More information about the tahoe-dev mailing list