[tahoe-dev] Caching and versioning for hypertexts and networked data

Johannes Nix Johannes.Nix at gmx.net
Tue Feb 14 23:18:14 UTC 2012


Hello,

I'd like to share a few thoughts and ask what implementations
might exist along these lines. Perhaps this gives a good extension
to Tahoe.

First, I used git some time and I tried CODA, a distributed
file system which supports a cached disconnected mode.

I got to the conclusion that versioning and caching are 
closely related and both essential. Automatic caching could 
be very useful for Tahoe but it can't be solved without 
versioning. Git is kind of a distributed file system with 
superb versioning and can be used for caching. Its main 
disadvantage is that one has to pull the whole repository 
at once. CODA has extremely good caching but the versioning 
is so unusable and broken that I gave up on it.


Importantly, the point is not to re-invent DCVS for 
source code; solutions like git are highly optimized and 
it would be hard to improve them.

But I have some use cases in mind which do not fit git
but could match Tahoes' qualities extremely well:

1) Using a cached copy of some voluminous data on the grid. Say
I am going to travel a few hours by train and I want
to sort and edit some photos which I have on the grid.
For reasons of space and time, I cannot pull my whole
30-GB photo collection, and I need some way to synchronize
my changes afterwards. Git isn't made for such usage.


2) Collaborative editing of mixed private/shared data.
Say I am editing a collection of research articles
and want some people to review each chapter.
But because this is valuable unpublished material, I want to 
make each chapter only accessible to certain persons
before things go to print. Git is not made for that 
- access is all-or-nothing.



3) I think Tahoe can be great for collaborative
editing of distributed hypertexts. Git and other
DCVS are made for a hierarchically organized
monolithic source tree where strict control is a must. 
Hypertexts without clear limits cannot be
put into such a repository.

But hypertexts like Wikipedia or domain-specific wikis 
could benefit enormously from a distributed peer-to-peer backend. 
In fact, decentralization could solve a lot of problems.

There is some guy at German CCC, Tim Weber, who started
a project along these lines, called "Levitation",
see http://scytale.name/blog/2009/11/announcing-levitation.

It would be great to be able to pull an article collection 
from Wikipedia to the smart-phone, read and edit offline some 
page or another, and sync changes when the device is 
connected again. You see the common points with
use case (1).


I think it would not be too difficult to make some Tahoe
commands which do check-out, pull, commit, and push
with semantics similar to Mercurial. The backup command 
already implements working versioning and I am sure 
some people would be pleased by an easy version retrieval tool.

The more difficult question is how to scale and 
adapt this to large projects and mesh networks which 
have a no fixed root directory.

There are surely a lot of projects I do not even
know about..

Any hints?

Johannes


More information about the tahoe-dev mailing list