[tahoe-dev] Tahoe-LAFS Weekly Dev Chat, 2012-10-09 (was: Tahoe-LAFS Weekly Dev Chat, 2012-10-02)

Tue Oct 9 18:07:46 UTC 2012

I forgot to take notes during, so this is ex post facto. Other
participants, please feel free to reply and add your recollections!

In attendance: Zooko, CodesInChaos, Andrew, Brian, Ali who popped in
briefly because the hangout was posted as a public chat on G+,
David-Sarah

The topic was mostly dynamic Merkle Trees and LDMFs again. (For those
following along at home, Dynamic Merkle Trees are this idea of using
something like a Red-Black Tree instead of a static Binary Tree as the
structure for your secure-hash-function based "authenticated data
structure". Andrew Miller claims that this is a unifying abstraction
that describes both git and Bitcoin, as well as any future Tahoe-LAFS
Large Distributed Mutable File. See previous week's notes ¹.)

I said that I had been looking at using Tahoe-LAFS as the backend for
other systems, to give them decentralization, fault-tolerance,
integrity-checking, access control, and encryption. Those systems that
I've looked at a little bit include Ward Cunningham's Smallest
Federated Wiki, Dan Whaley's Hypothesis, and Jeff Garzik's
gleam-in-the-eye of a fault-tolerant form to make the Bitcoin forums
DoS-resistant. In each of these cases as well as in others, I have the
feeling that programmers want some sort of search or query language,
and when they find that LAFS doesn't natively offer that, they give up
on using it.

Brian replied that what you have to do is maintain your own index. He
called it "voluntary search", which is a turn of phrase that I like.

MK_FG's recent attempt to make a Skydrive plugin for the Cloud backend
is another example where voluntary search is needed.

Andrew asked for a specific use case for LDMF. How and why anyone
would use LDMFs even if we had them? After all, there's no way to use
the features of an LDMF, such as efficiently inserting bytes into the
middle of a file, through the standard POSIX file API.

I groped around for a while trying to answer that. There are a few
half-formed ideas about things that LDMFs *might* turn out to be
useful for, such as "filesystem in a file", which is basically what
the authors of virtual machines are doing to store their virtual
machine images in POSIX filesystems. Other half-formed ideas are to
use LDMFs as the backend storage for git.

But I finally got a hold of a use case that seems clear enough:
scalable directories. Directories ought to be able to hold arbitrarily
many entries, support efficiently adding or removing children, ought
to be able to maintain a sort order on the children, and support range
queries on the children. That *is* something that you can express
through the POSIX filesystem API, it seems clear that it could be
useful, and very importantly for the purposes of Andrew's research, it
is easy to evaluate how well your solution satisfies it.

Such directories might make a good building block for search. I
remember thinking when I unsuccessfully pitched LAFS as a possible
backend for Singly/LockerProject that if only directories had been
scalable, sorted, and range-queriable then LAFS might have sufficed
for their needs.

Along the way I pointed people at the performance.rst file:

https://tahoe-lafs.org/trac/tahoe-lafs/browser/git/docs/performance.rst

Only to see that it appears to be describing the performance of old
SDMFs, not of MDMFs! For example it says:

"""
Downloading B bytes of an A-byte mutable file

cpu: ~A

network: A

memory footprint: A

notes: As currently implemented, mutable files must be downloaded in
their entirety before any part of them can be read. We are exploring
fixes for this; see ticket #393 for more information.
"""

Which is very stale information. We've closed #393 and made it so that
you only need to download about B bytes to read B bytes of an A-byte
mutable file. I thought we had updated performance.rst to reflect
that. What gives? Do I misremember or did we somehow regress
performance.rst?

There are a few tickets about updating performance.rst to reflect the
existence of MDMF: #1497, #1772.

Andrew mentioned that since last week he implemented a new Red-Black
Merkle Tree in C++, and it is 400X faster than the first Python-based
prototype. I questioned why the language would make that much
difference and suggested he try PyPy.

Regards,

Zooko

https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1497# update
docs/performance.rst to explain the performance of MDMFs
https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1772# update docs to include MDMF

¹ https://tahoe-lafs.org/pipermail/tahoe-dev/2012-October/007750.html