[tahoe-lafs-trac-stream] [tahoe-lafs] #1471: Make Crawlers Compatible With Pluggable Backends (was: Make Crawler's Compatible With Pluggable Backends)

Mon Aug 8 11:26:02 PDT 2011

#1471: Make Crawlers Compatible With Pluggable Backends
------------------------------+------------------------
     Reporter:  Zancas        |      Owner:  Zancas
         Type:  enhancement   |     Status:  new
     Priority:  major         |  Milestone:  undecided
    Component:  code-storage  |    Version:  1.8.2
   Resolution:                |   Keywords:  backend S3
Launchpad Bug:                |
------------------------------+------------------------

Comment (by warner):

 (fixed title: http://www.angryflower.com/bobsqu.gif)

 I'd like to point out that the use of a {{{Crawler}}} at all is deeply
 intertwined with the way the shares are being stored. We decided early
 on that we'd prefer a storage scheme in which the share files are the
 primary source of truth, and that anything else is merely a volatile
 performance-enhancing cache that could be deleted at any time without
 long-term information loss. The idea was to keep the storage model
 simple for server-admins, letting them correctly assume that shares
 could be migrated by merely copying sharefiles from one box to another.
 (write-enablers violate this assumption, but we're working on that).

 Those Crawlers exist to manage things like lease-expiration and
 stats-gathering from a bunch of independent sharefiles, both handling
 the initial bootstrap case (i.e. you've just upgraded your storage
 server to a version that knows how to expire leases) and later recovery
 cases (i.e. you've migrated some shares into your server, or you
 manually deleted shares for some reason). It assumes that share metadata
 can be retrieved quickly (i.e. fast local disk).

 If a server is using a different backend, these rules and goals might
 not apply. For example, if shares are being stored in S3, are shares
 stored in a single S3 object each? How important is it that you be able
 to add or remove objects without going through the storage server? It
 may be a lot easier/faster to use a different approach:

  * all shares must be added/removed through the server: manual tinkering
    can knock things out of sync
  * canonical share metadata could live in a separate database, updated
    by the server upon each change (maybe AWS's SimpleDB?)
  * upgrade (introducing a new feature like lease-expiry) could be
    accomplished with an offline process: to upgrade the server, first
    stop the server, then run a program against the backing store and DB,
    then launch the new version of the server. That would reduce the
    size, complexity, and runtime cost of the actual server code.

 Anyawys, my point is that you shouldn't assume a Crawler is the best way
 to do things, and that therefore you must find a way to port the Crawler
 code to a new backend. It fit a specific use-case for local disk, but
 it's pretty slow and resource-intensive, and for new uses (i.e.
 Accounting) I'm seriously considering finding a different approach.
 Don't be constrained by that particular design choice for new backends.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1471#comment:1>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage