[tahoe-dev] Caching and versioning for hypertexts and networked data
Johannes Nix
Johannes.Nix at gmx.net
Tue Feb 14 23:18:14 UTC 2012
Hello,
I'd like to share a few thoughts and ask what implementations
might exist along these lines. Perhaps this gives a good extension
to Tahoe.
First, I used git some time and I tried CODA, a distributed
file system which supports a cached disconnected mode.
I got to the conclusion that versioning and caching are
closely related and both essential. Automatic caching could
be very useful for Tahoe but it can't be solved without
versioning. Git is kind of a distributed file system with
superb versioning and can be used for caching. Its main
disadvantage is that one has to pull the whole repository
at once. CODA has extremely good caching but the versioning
is so unusable and broken that I gave up on it.
Importantly, the point is not to re-invent DCVS for
source code; solutions like git are highly optimized and
it would be hard to improve them.
But I have some use cases in mind which do not fit git
but could match Tahoes' qualities extremely well:
1) Using a cached copy of some voluminous data on the grid. Say
I am going to travel a few hours by train and I want
to sort and edit some photos which I have on the grid.
For reasons of space and time, I cannot pull my whole
30-GB photo collection, and I need some way to synchronize
my changes afterwards. Git isn't made for such usage.
2) Collaborative editing of mixed private/shared data.
Say I am editing a collection of research articles
and want some people to review each chapter.
But because this is valuable unpublished material, I want to
make each chapter only accessible to certain persons
before things go to print. Git is not made for that
- access is all-or-nothing.
3) I think Tahoe can be great for collaborative
editing of distributed hypertexts. Git and other
DCVS are made for a hierarchically organized
monolithic source tree where strict control is a must.
Hypertexts without clear limits cannot be
put into such a repository.
But hypertexts like Wikipedia or domain-specific wikis
could benefit enormously from a distributed peer-to-peer backend.
In fact, decentralization could solve a lot of problems.
There is some guy at German CCC, Tim Weber, who started
a project along these lines, called "Levitation",
see http://scytale.name/blog/2009/11/announcing-levitation.
It would be great to be able to pull an article collection
from Wikipedia to the smart-phone, read and edit offline some
page or another, and sync changes when the device is
connected again. You see the common points with
use case (1).
I think it would not be too difficult to make some Tahoe
commands which do check-out, pull, commit, and push
with semantics similar to Mercurial. The backup command
already implements working versioning and I am sure
some people would be pleased by an easy version retrieval tool.
The more difficult question is how to scale and
adapt this to large projects and mesh networks which
have a no fixed root directory.
There are surely a lot of projects I do not even
know about..
Any hints?
Johannes
More information about the tahoe-dev
mailing list