[tahoe-dev] source control in LAFS (was: split brain/partition tolerance? how handled in tahoe -- docs?)

Thu Aug 9 01:13:18 UTC 2012

[Note: I didn't read the former thread, but my interest filters
triggered on the revision control on top of LAFS topic.]

On Wed, Aug 8, 2012 at 4:19 PM, Zooko Wilcox-O'Hearn <zooko at zooko.com> wrote:
> There are several ways to put your revision control repository into
> LAFS. The obvious one is mount LAFS with FUSE or pyfilesystem and
> point your revision control tool at the mount point. That will
> probably work with every modern revision control tool. If anyone has
> tried that, I would like to hear how it performed.
>
> There are also hacks for specific revision control tools:
>
> • mercurial: https://tahoe-lafs.org/pipermail/tahoe-dev/2010-June/004559.html

I haven't touched this tool for years, but I believe it should still
work for the single feature I had implemented.

That feature is to "publish" a repository into LAFS in a "transparent"
format.  That is, every file and directory in every revision, and the
revision DAG, and the revision metadata are all plain text files
inside LAFS.  Using only a web browser (or tahoe commandline tools) it
is possible to traverse the repository.

What's lacking is the ability to *retrieve* from such a published
repository into a local hg repository.  Because of the "transparent"
format you can do a wget --recursive if you need a DRCS-less
"checkout" of a revision (or many revisions).

I'll discuss my vision for this below.

> • bzr: https://tahoe-lafs.org/trac/tahoe-lafs/wiki/TipsTricks#HostBazaarrepositories
> • perforce: https://tahoe-lafs.org/trac/tahoe-lafs/wiki/TipsTricks#Perforcebackendstorage
>
> Now once you have your revision control repository stored in LAFS,
> there are two basic structures of access control you could have:
>
> Access control pattern 1:
>
> Alice has read and write access to repository A
> Bob has read and write access to repository A
>
> Access control pattern 2:
>
> Alice has read-and-write access to repository A
> Alice has read-only access to repository B
> Bob has read-and-write access to repository B
> Bob has read-only access to repository A

> The latter is a lot like a common pattern that you see nowadays on
> github, bitbucket, and launchpad, where each person has their own
> personal repository and they are the only one with write access to it,
> but they can choose to accept pull request from other people in order
> to let other people's code into their repository.
>
> If you choose access control pattern 1, then LAFS has to decide what
> to do when Alice and Bob both write different contents to the repo at
> the same time. LAFS is not good at that. It will at best just blow
> away one person's changes in favor of the other person's changes. It
> will at worst eat your entire repository.
>
> If you choose access control pattern 2, then LAFS doesn't ever have to
> decide what to do about that, and instead your revision control tool
> has to decide what to do about it. Your revision control isn't good at
> it, either, but at least that's not my problem.
>

Let's make Access Control Pattern 2 our problem by brainstorming
implementation details.

The quick'n'dirty approach is to sync the native, local filesystem
repository state into Tahoe-LAFS.  If users share a write capability
to this kind of sync'd state, this is similar to a local multi-user
DRCS repository.  Most DRCS tools (if they even support this use case)
will get tripped up by the impedance mismatch between "classic" file
system APIs (such as posix) and Tahoe-LAFS.  For example, they may
rely on lock files and stick permission bits or other such tomfoolery.

Let's consider a different approach:  A large family of "CHK-like"
DRCS tools, such as git and mercurial (hg) have this abstraction:

put( blob ) -> hashkey
get( hashkey ) -> blob

Tahoe-LAFS has the same abstraction!  This seems useful.

In *addition* the DRCS tools have implicit or explicit abstractions
for managing mutable local state and transmitting both mutable state
and immutable state.  For example, git references (branches and maybe
tags) are local mutable state, and git helps you transmit these to
other repositories.

Both git and hg (and the others I'm less familiar with) have a
mechanism to transmit the immutable revision state.  An important
feature to notice is that pushing or pulling immutable revisions is
*merging immutable / append only data structures*, which has little
merge complexity and no possibility of conflicts / ambiguity.

What happens if we store each immutable DRCS blob as a separate
Tahoe-LAFS immutable?

In the CHK-like family of DRCS, a revision points to parent revisions
and the immutable file / directory contents and other immutable
metadata.  Therefore, if you have a read-cap to a revision, you can
transitively read all parents.  However, you cannot find *children* of
a revision.  (Parents come first in time; children later as I use
these terms.)

Therefore this simple storage format does not give us the standard
DRCS feature of querying the storage for all heads.  I use "head" in
the hg sense which is any revision node without children.

What if we store a set of heads in a Tahoe-LAFS immutable?

A capability to such a "heads snapshot" is sufficient to recover the
entire immutable state of a DRCS repository.  Because what is stored
in this proposal is all immutable data, there is no concept of
"pushing" or "pulling" (which imply mutating a repository).  Instead
there is merging two repositories to produce a third.

The merge is a straightforward DAG union: Take the two head snapshots,
concatenate the head references, then follow them all to remove any
former head which now has children.

This proposal doesn't store the working directory or other local
mutable state, keeping the same separation between those and the
immutable blob store that the DRCS tools already have.

Now we tie in this scheme to a real live DRCS tool by making an
abstraction which looks like a DRCS's remote blob store, except the
implementation of put() and get() use Tahoe-LAFS immutables.

In addition to the normal local mutable state of the DRCS tool we add
a file containing the "heads snapshot" capability of the "remote
repository".  We hook the blob store abstraction so that every time we
push a new set of revisions, we perform the merge operation described
above.  Then, we upload the merged result and write its capability to
our local heads snapshot file.

When we want to "pull" another repository, in addition to the DRCS's
other addressing schemes it now accepts a new repository address
format which contains a "heads snapshot" capability.  Such a
"repository" is completely immutable, so after one pull, there's no
reason to ever pull again.

What do we have so far?  We now have something that looks a lot like a
classic DRCS situation with the constraint that no code is shared
without first sharing, out-of-band, the heads snapshot capability.
This seems a little bit like a pull request.  I have a new branch I
want the others to use, so I email my new heads snapshot saying
"please pull this".

It is common and convenient to pull from another repository *without*
receiving a pull request.  We can now do this easily:

In addition to an immutable heads snapshot, each repository has a
mutable heads snapshot.  Importantly *each* repository must have their
own (to avoid LAFS write collisions).  We can "push" to this by doing
a merge-then-update:  $a = merge($a, $b)

Now we share the *read* cap with other developers.  Since our supposed
plugin/patch adds a newfangled repository addressing scheme,
developers add these readcaps like they would any other remote
repository (with the bonus that the DRCS tool's config file gives us
pet names).

We now pull from those on demand.

If all the developers use the same convergence secret, then even when
independent devs push identical revisions, only one copy will be
stored.  This also helps because when Alice pulls from Bob, she won't
re-upload all of those revisions.  This leads to the interesting case
where there are forks of a project, which keep their own revisions
secret from each other, but share the storage of common revisions.

This was my vision for hglafs.  So far, instead of implementing
put()/get(), I implemented a separate "publish" that walks the
repository and uploads any new revisions.  This was mainly because I
couldn't cleanly pick out the put()/get() abstraction from the
mercurial source.

A tangential feature/goal of hglafs is to sacrifice efficiency for
transparency, storing every file and directory in every revision as a
separate LAFS node.  This allows fine-grained sharing / linking into
repositories without having to know anything about hg or hglafs, with
standard Tahoe-LAFS clients.

One final note on git:

I don't understand git mutable references very well, so I'm not sure
if there's more complexity this proposal misses.

I'm more familiar with hg, in which there is no mutable (non
append-only) state in the repository which is shared across
repositories.  Push and pull are merges as described in the proposal.
(They literally just append new revisions to an append-only file, so
long as the revisions aren't already stored, I believe.)  Therefore
the "heads" are in the append only DAG, and all metadata such as
branches and tags are also stored there, so if you can traverse from
heads to all parents, you have the complete state.

With git branches, there may be merge conflicts when pushing or
pulling, right?  Alice could say "the tip of branch Foo is now $A" and
Bob could assert it is $B, and those two revisions may not be related
by a one-way DAG path.  They could be forks from a common ancestor,
for example.  I think the way git handles this is by storing "Alice's
Foo" and "Bob's Foo" and then making it easy for Alice to set "Alice's
Foo" to whatever she wants and, by conventient, only setting Bob's Foo
when receiving from Bob's repository.

Whatever the case, I *think* it's possible to store the refs just like
the "heads" in the proposal above.  Can anyone confirm or refute this?

>
> Regards,
>
> Zooko
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev

nejucomo
Real names don't exist.  ;-p