[tahoe-dev] FUSE and Tahoe

rob kinninmont robk-tahoe at allmydata.com
Thu Oct 2 15:04:29 PDT 2008


I know there are numerous people who've expressed interested in FUSE  
support for tahoe, and perhaps moreover it seems many of those people  
were unaware of the third implementation, supporting write operations  
which lurked in the 'mac' subdirectory of the source tree.

Since I'm working on improving fuse support for tahoe, I'm writing  
this to raise awareness thereof, to summarise what we've got, insofar  
as I understand it, and to rally support from anyone interested in  
helping out.


If I've made any mistakes in the following, please let me know about  
them.  I'm also happy to field any questions you might have.


Bindings
--------

First off, a note about bindings, since there are two distinct fuse  
bindings for python that I know of, and we have implementations built  
on each of them.

'python-fuse' provides a python package named 'fuse' which is provided  
as part of the fuse package at sourceforge [1].  In usage one provides  
a series of callback functions (or objects implementing same as  
methods) and passes control into the 'fuse_main()' function  
implemented as part of libfuse.  The fuse_main then dispatches  
callbacks into ones python code as operations are made on the  
filesystem, potentially from numerous threads.

This works ok up to a point with numerous threads each blocking in  
their handling of each request, but I would really like to use a  
twisted reactor based framework to handling the more sophisticated  
logic of handling multiple interacting concurrent operations e.g.  
downloading a file from tahoe, caching it to local disk, and handling  
individual fuse read operations from the partially retrieved data.   
The ability to handle fuse requests in a twisted-based environment  
would also open up the possibility of using foolscap rather than the  
webapi for RPC to the tahoe node.

In my experiments, when I tried to integrate a twisted reactor into  
the fuse_main based tahoefuse, it displayed a dismaying tendency to  
explode in ones face.  My best guess is that something about how  
fuse_main is handling its threads and dispatch into python code does  
not play well with having python dispatch another thread to run a  
reactor.


An alternative set of python bindings for fuse [2] were submitted by  
Armin Rigo ('arigo') which dispense with fuse_main, relying on libfuse  
only to initially mount the filesystem and thereafter using pure  
python to implement the user-space side of the fuse protocol.  It  
seems that this would likely be much more amenable to integration with  
a twisted-based process.  Indeed chatting with some folks more  
knowledgeable than I turned up an example [3] using pyfuse and twisted  
to implement an sshfs.

However I haven't been able to get these bindings to work on the Mac  
as yet.  They work fine (c.f. 'impl_b') on linux, but alas something  
seems to be awry in their interaction with macfuse.


I'm currently toying with a somewhat obnoxious construction of  
introducing yet another layer of indirection, another process  
boundary, whereby a fuse_main based threaded 'shim' process brokers  
requests to another twisted based process which implements the caching  
and logic layer, impedance matching fine grained fuse operations to  
coarse grained tahoe webapi requests.  This approach is odious enough  
that it is without doubt intended only as a stopgap workaround until  
some cross-platform twisted compatible fuse bindings can be nailed down.



Implementations
---------------

There are currently three distinct implementations of fuse layers for  
tahoe, and each can be found in a subdirectory of contrib/fuse within  
the tahoe source tree.

1. 'impl_a' aka 'tahoe_fuse', contributed by Nathan 'nejucomo'

This implementation provides read-only access to tahoe.  It is built  
on python-fuse (fuse_main) bindings, and provides no caching, i.e.  
each read() request results in a synchronous call to the tahoe webapi  
to fetch the file contents.  It supports Linux.


2. 'impl_b' aka 'tfuse', contributed by Armin 'arigo'

This implementation provides read-only access to tahoe.  It is built  
on pyfuse bindings, and provides 'load-once' directory caching, no  
file caching.  I.e. when a directory is first read, it is loaded into  
memory and remembered indefinitely, files are loaded into memory from  
tahoe each time they are open()ed.  It supports Linux.


3. 'impl_c' aka 'blackmatch' née 'tahoefuse', contributed by myself,  
Rob 'robk'

In a recent cleanup, I moved this lesser-known 'tahoefuse'  
implementation out of the 'mac/' subtree of the tahoe source into the  
more easily discovered contrib/fuse, and renamed it to try and avoid  
confusion amongst like-named implementations.

This implementation provides read-write access to tahoe.  It is built  
on python-fuse bindings and provides caching of both directories and  
files.  It supports both Linux and Mac.

This implementation originally pre-fetched and cached in memory the  
entire directory structure of the mounted tahoe filesystem.  This was  
changed so that directories are loaded on demand and cached in memory  
for a (configurable) limited time, 20s by default.  This was motivated  
on the one hand by a desire to reflect the current state of the  
underlying tahoe directories, and on the other hand performance  
requirements in the face of the Finder's tendency to make large  
numbers of requests in quick succession in normal browsing.

Files, in contrast, are downloaded from tahoe and cached on disk upon  
first read, and all subsequent reads from that file are satisfied from  
disk.  Note that the download cache is keyed by uri (hence hash of  
file contents) not by filesystem path.  In the current code, this on  
disk cache grows without bound in use.  File writes are handled by  
opening a shadow tmp file on disk within the cache dir, and applying  
each write request thereto - upon close (strictly, 'release') the tmp  
file is uploaded to tahoe and the results recorded in the directory.

Note that in this scenario the open() of a file (for read) and the  
close() of a file (for write) are where the time-consuming operations  
to down/upload the file via the tahoe webapi take place, and hence  
these calls will block synchronously for an extended time.  Subsequent  
read / preceding write operations are performed on locally cached files.



Future Directions
-----------------

I am actively working on 'blackmatch' (impl_c) to implement a variety  
of performance and usability enhancements, from a general user's  
perspective, e.g. cache management, concurrency.  Please drop me a  
line if you're interested in getting involved.

1. Bindings to support twisted on the mac.  I'm afraid I'm currently a  
bit stumped by the pyfuse/mac, and python-fuse/threads/twisted issues,  
so I'm working on the aforementioned rather ugly workaround and  
focussing on the other enhancements below.  If anyone has any good  
ideas to offer on these issues, I'm all ears.

2. Cache management.  It's obviously not acceptable to have an append- 
only cache which grows without bound, so there needs to be some logic  
to manage the cache, maintain its size within limits and expire old  
data etc.

3. Concurrency, or non-blocking open.  Currently, if a file is not  
already cached, blackmatch blocks the open() operation and downloads  
the entire file from tahoe to disk cache.  Any operation which blocks  
for 'too long' (i.e. in excess of the 'daemon_timeout' option to  
fuse_main) will result in the filesystem being spontaneously unmounted  
and closed, at least on the mac.  Hence a goal is to have the download  
of the file, from tahoe to cache, to proceed asynchronously and have  
individual read() operations block until the data being read has  
arrived.  This will provide better much better behaviour for large  
files, and could even support e.g. playing video or audio streams from  
files that are still actively being downloaded.

4. Streaming upload.  A similar scenario applies to uploads, though  
the implications for tahoe (in re convergence of congruent files into  
single storage instance) are more subtle, and the complexities in the  
face of e.g. overlapping writes present more challenges for the logic  
managing data integrity.  But potentially a tahoe fuse implementation  
could commence upload when a file is opened for write, and stream the  
data, assuming linear non-overlapping writes, write the data directly  
into tahoe, thereby eliminating the upload delay upon close().  Other  
possible strategies include a prompt close() with a backgroun upload,  
though this has important user interface ramifications.


At any rate, if you have any questions about fuse support for tahoe,  
feel free to drop me a line.

cheers,
rob



Notes:

[1] http://fuse.sourceforge.net/wiki/index.php/FusePython  (at time of  
writing, this web page is unavailable.)
[2] http://codespeak.net/svn/user/arigo/hack/pyfuse/
[3] http://twistedmatrix.com/trac/browser/sandbox/exarkun/sshfs.py




More information about the tahoe-dev mailing list