[tahoe-dev] 1,000,000+ files?
Chris Goffinet
cg at chrisgoffinet.com
Tue May 6 16:47:57 PDT 2008
Brian:
Thanks. Do you know the exact limitation per file? So I can restrict
an upload size.
Right now the mapping is done in a database, then we have a high
performance caching layer that uses a distributed hash table for fast
lookups based on URL (pretty url -> Tahoe URI).
Also, for MDMF - would this allow a client locally diff the changes
made, so that only that segment could be pushed to the DFS instead of
entire file reuploaded? Was curious about the situations where you
have large segments. Or would LDMF support that?
On May 6, 2008, at 4:13 PM, Brian Warner wrote:
> On Sun, 4 May 2008 13:31:01 -0700
> Chris Goffinet <cg at chrisgoffinet.com> wrote:
>
>> I remember reading in a ticket that if you had over 9600 files in a
>> folder, you started to notice high cpu load. Has this issue been
>> addressed? If I were to start storing over 1,000,000 files any
>> techniques you recommend? I originally was hoping to just store them
>> in 1 directory as it was the easiest (i wouldn't be doing lookups
>> asking for a list of all files).
>
> Yeah, tahoe directories can't handle massive numbers of children.
>
> The specific memory-consumption problem we observed (#379) was an
> edge-effect: two different size limits, such that a directory got
> past one
> size check but got stopped on the other one. (also, it interacted
> with a
> couple of other problems, the most fundamental of which is inside
> Twisted,
> ticket Twisted#2466).
>
> But the limit remains. Our mutable file implementation is still in its
> infancy: we haven't built everything that we've designed, and we
> haven't
> designed everything that we plan to build. Our mutable-file roadmap
> specifies
> three levels:
>
> SDMF : small distributed mutable files : read/write all-at-once
> MDMF : medium : read/write individual segments
> LDMF : large : versioning, insert/delete span
>
> We've only implement SDMF so far, and we've only implemented a
> subset of it
> (limited to a single segment), which limits the size of a mutable
> file to a
> few megabytes. This is enough for our target requirement of about
> 10000
> children in a directory, but when you try to add children beyond
> that, you'll
> get size-exceeded errors.
>
> The next step for mutable files (now that the ServerMap refactoring
> is in
> place) is to finish multi-segment SDMF, which will remove that size
> limit,
> but which will still obligate users to read and write the whole file
> at once,
> which could be a drag if you have a million entries in a 300MB
> dirnode and
> you only care about reading the first one.
>
> MDMF will let you read just a small portion of it instead of the
> whole thing.
> But dirnodes will need to be revamped to make this useful, by adding
> some
> sort of index at the beginning of the dirnode, so you can figure out
> *which*
> portion to retrieve.
>
>
> So, if you have a million files that you need to put somewhere in
> Tahoe 1.0,
> I can suggest two things:
>
> * use a multiple-level directory hierarchy, to reduce the size of any
> individual directory below the size limits. A two-level structure
> would be
> plenty (using 1001 directories total: a single root dir that
> contains 1000
> subdirectories, then 1000 files in each subdir).
>
> * serialize the mapping from child name to child URI into a single
> file
> yourself, then upload this file. This would be immutable, and
> you'd have
> to handle the serialization/deserialization and interpretation of
> this
> file by yourself. But we can handle foolishly large immutable
> files, so
> ought to at least work.
>
> Remember, a directory is really just a convenient tool that maps
> names to
> URIs. Any tool that helps you organize your URIs will work, and if
> your
> tool's state can be serialized into a sequence of bytes, then it can
> be
> serialized and stored in tahoe somewhere.
>
> hope that helps,
> -Brian
More information about the tahoe-dev
mailing list