= Tahoe-LAFS Summer-of-Code Projects = This page contains specific suggestions for projects we would like to see in the Summer of Code. Note that they vary a lot in required skills and difficulty. We hope to get applications with a broad spectrum. If you are interested in working on any of these projects, please contact the Mentors listed at the bottom of the page. In addition, you may wish to discuss your proposal on IRC—join us on #tahoe-lafs on irc.freenode.net. We encourage you to come up with your own suggestions, if you cannot find a suitable project here. You can find more project ideas by [wiki:ViewTickets exploring the issue tracker]. Especially see [http://allmydata.org/trac/tahoe-lafs/query?status=!closed&order=priority&keywords=~gsoc tickets labelled 'gsoc'] (developers: please add this label to any tickets that might make a good GSoC project). You may also want to read [http://allmydata.org/pipermail/tahoe-dev/2010-March/004117.html this mailing list thread] about GSoC ideas. Deadlines and directions for students' applications to the Google Summer-of-Code can be found on [http://code.google.com/soc/ the Google pages]. ||''Project''||''Difficulty''||''Contact''|| ||[#Medium-SizedDistributedMutableFilesMDMF Medium-Sized Distributed Mutable Files]||Medium||[mailto:warner-tahoe@lothar.com Brian Warner] or any mentor|| ||[#RedundantArrayofIndependentClouds Redundant Array of Independent Clouds]||Medium||[mailto:zooko@zooko.com Zooko Wilcox-O'Hearn] or any mentor|| ||[#ShareMigration Share Migration]||Medium||[mailto:warner-tahoe@lothar.com Brian Warner] or any mentor|| ||[#SecureDecentralizedWiki Secure Decentralized Wiki]||Medium||[mailto:zooko@zooko.com Zooko Wilcox-O'Hearn] or any mentor|| ||[#CloudApps Cloud Apps]||Easy–Hard||[mailto:lloyd@randombit.net Jack Lloyd] or any mentor|| ||[#WebDAVSupport WebDAV Support]||Medium-Hard||[mailto:david-sarah@jacaranda.org David-Sarah Hopwood] or any mentor|| ||[#DistributedIntroduction Distributed Introduction]||Easy||[mailto:gsoc2010@ndurner.de Nils Durner] or any mentor|| ||[#DVCSIntegration DVCS Integration]||Medium||[mailto:lloyd@randombit.net Jack Lloyd] or any mentor|| ---- == Medium-Sized Distributed Mutable Files (MDMF) == Mutable files in Tahoe-LAFS have some significant limitations and performance issues, as discussed in [http://allmydata.org/trac/tahoe-lafs/browser/docs/performance.txt docs/performance.txt]. Users who aren't aware of these limitations are surprised when they find out that mutable files can't scale to large sizes without using unacceptable levels of memory, and that reading one byte of the file costs as much as reading the entire file. A fix for this issue would essentially be fixing #393. That is, * Developing mutable files that are segmented on upload, as with immutable files. Part of this would involve making sure that the way we currently ensure the integrity of the parts of mutable files stored on servers is adequate for your new design, and altering it if it isn't. * Implementing efficient reading and writing of arbitrary spans of those mutable files. This would make Tahoe-LAFS less surprising to users, and allow mutable files to be used in more ways than they currently are. If successful enough, this might allow Tahoe-LAFS to support range queries or "graph database"-style access, in the style of the "NoSQL" projects. To learn more about this issue, you should first read [http://allmydata.org/trac/tahoe-lafs/browser/docs/performance.txt docs/performance.txt], so you're familiar with the performance problems with mutable files as currently implemented. You should also look at the [http://allmydata.org/trac/tahoe-lafs/browser/docs/specifications/file-encoding.txt file encoding specification], to understand how immutable files are segmented (since you'll be doing something similar with this project). [http://allmydata.org/trac/tahoe-lafs/browser/docs/specifications/mutable.txt The mutable file specification] may be informative as well. The mutable file upload and download code is in [http://allmydata.org/trac/tahoe-lafs/browser/src/allmydata/mutable mutable], and, for comparison, the immutable file upload and download code is in [http://allmydata.org/trac/tahoe-lafs/browser/src/allmydata/immutable immutable]. == Redundant Array of Independent Clouds == Add backends to the storage servers so that they store their shares on a cloud storage system instead of on their local filesystem. This means that you can get all of the availability and scalability of services such as Amazon S3 or Rackspace !CloudFiles combined with the security properties of Tahoe-LAFS. See [http://allmydata.org/~zooko/RAIC.png the RAIC diagram]. For details read ticket #999 which including pointers to the relevant source code and instructions on how to begin writing the code. == Share Migration == When uploading a file to a grid, Tahoe-LAFS will make sure that the file is healthy (a good discussion of what healthy means is found in #778) before reporting that the file is uploaded successfully. Tools to effectively maintain file health (or to adapt to new definitions of health) aren't quite complete, however -- our users have had several use cases that aren't easily addressed with what we have. Students taking this project would be building tools to address those use cases. A good starting point would be to become familiar with how files are placed on a grid. [source:docs/architecture.txt architecture.txt], [source:docs/specifications/file-encoding.txt file-encoding.txt], [source:docs/specifications/mutable.txt mutable.txt], [source:src/allmydata/immutable/upload.py the immutable file upload code], and [source:src/allmydata/mutable/publish.py the mutable file upload code] are good places to do that. Also, you might want to look at the [source:src/allmydata/storage/server.py storage server code] to understand that better. Some good tickets to start looking at are #699, #543, and #232; you'll find that those link to other tickets. There are many ways to help address these issues. Some ideas: * Alter the CLI and the WUI to give users the ability to rebalance files that they've uploaded already. (#699) * Build tools that allow node administrators to moves shares around a grid (#543, #864) * Alter Tahoe-LAFS to rebalance mutable files when uploading a new version of them. (#232) Any one of these projects is probably too small to fill a summer, but combining a few of them would be a big usability improvement for Tahoe-LAFS. Depending on how you address this, this is tightly integrated with ideas of file health and accounting, so prospective students would do well to explore those open issues, too. A good accounting jumping-off point is #666. A good jumping-off point for health is #778. == Secure Decentralized Wiki == Write a wiki in Google's [http://code.google.com/p/google-caja/ "caja"] dialect of !JavaScript. This wiki will load and store data directly on a Tahoe-LAFS storage grid so that it is a full "Cloud App"—there is no server. All computation is done in the user's web browser in caja and all of the storage is done by the decentralized Tahoe-LAFS storage grid. This wiki should leverage Tahoe-LAFS's secure sharing features to offer fine-grained, dynamic, and easy transclusion or client-side mashups. This project is intended to be the successor to [http://allmydata.org/trac/tiddly_on_tahoe the TiddlyWiki-on-Tahoe-LAFS project], which is a wiki written in !JavaScript and hosted on Tahoe-LAFS, but one that has been "bolted on" to Tahoe-LAFS instead of designed for Tahoe-LAFS, and is currently incapable of good transclusions or mashups. To get started, play with [http://testgrid.allmydata.org:3567/uri/URI:DIR2-RO:7h7syiurogz5erc2au74tjwguu:h7bdxvjtvidlkcdbld3j2d5sbgyzsbqs7wdnu6yznqrejzssc5za/wiki.html the TiddlyWiki-on-Tahoe-LAFS quick start], read the source code of [http://allmydata.org/trac/tiddly_on_tahoe/browser/tahoe_tiddly/HTTPSavingPlugin.js the HTTPSavingPlugin] and [http://allmydata.org/trac/tiddly_on_tahoe/browser/tahoe_tiddly/TahoePlugin.js the TahoePlugin] for !TiddlyWiki, and experiment with [http://caja.appspot.com/ writing live caja applets]. == Cloud Apps == Difficulty: easy to hard, depending on project choice and how far you want to push it Invent your own Summer-of-Code project by building a new web app on top of Tahoe-LAFS. The [#SecureDecentralizedWiki Secure Decentralized Wiki] is one example of a Cloud App. See [wiki:GSoCIdeas/CloudApps] for other ideas. == WebDAV Support == Difficulty: medium to hard, depending on how much of an existing WebDAV implementation you are able to reuse Implement a WebDAV front-end for Tahoe-LAFS so that files and directories stored in a distributed grid can be accessed by operating systems and applications that speak the WebDAV protocol. WebDAV is specified in [http://tools.ietf.org/html/rfc2518.html RFC 2518] and [http://www.ics.uci.edu/~ejw/authoring/ a few other documents]; it essentially extends HTTP to act as a filesystem access protocol. For details see #451 which describes what the Tahoe-LAFS web server does now, how this differs from what a WebDAV web server does, and how to get started experimenting with the relevant source code. The main attraction of implementing a WebDAV interface is that several operating systems have bundled and somewhat integrated support for it, including Windows, Mac OS X, and most distributions of Linux. In fact WebDAV may turn out to be an easier alternative to [http://en.wikipedia.org/wiki/Server_Message_Block SMB/CIFS] for allowing filesystem access from Windows. However, there is currently no working WebDAV implementation in Twisted Python. There used to be one (the {{{web2.dav}}} package), [http://twistedmatrix.com/trac/ticket/3081 but it bitrotted]. You'll have to decide whether to help fix that implementation, use a non-Twisted implementation such as [http://code.google.com/p/wsgidav/ WsgiDAV] that might be more difficult to integrate wth the existing Tahoe code, or write your own. In any case, WebDAV is a complicated protocol and you will need to decide what subset of it gives most "bang for the buck" and is practical to support in the time available. For example, locking is optional in the WebDAV spec; is it needed to interoperate with commonly used WebDAV clients? Unlike most filesystems which are constrained to be trees, the structure of a Tahoe is in general a cyclic graph. [http://tools.ietf.org/html/draft-ietf-webdav-bind draft-ietf-webdav-bind] is an Internet Draft that clarifies how WebDAV servers should handle cycles. [http://savannah.nongnu.org/projects/davfs2 davfs2] is a FUSE-based WebDAV filesystem client for Linux. To ensure that this runs correctly over your implementation of WebDAV, you'll probably need to adapt the tests for the existing Tahoe [source:contrib/fuse/impl_c/blackmatch.py "blackmatch"] FUSE interface (this would not be redundant since the blackmatch implementation has limitations, especially for write access, that davfs2 would not have). The [http://en.wikipedia.org/wiki/WebDAV#Microsoft_Windows WebDAV mini-redirector] is the component of Windows providing its WebDAV filesystem support. It is actually the less buggy of [http://www.zorched.net/2006/03/01/more-webdav-tips-tricks-and-bugs/ two implementations], but it still has had [http://greenbytes.de/tech/webdav/webdav-redirector-list.html bugs] and [http://www.microsoft.com/technet/security/bulletin/MS08-007.mspx security vulnerabilities] that you may need to take into account. [http://allmydata.org/trac/tahoe-lafs/query?status=!closed&order=priority&keywords=~webdav Tickets labelled 'webdav'] == Distributed Introduction == Implement a protocol for distributed introduction, thus removing the only remaining Single Point of Failure (SPoF) in the Tahoe-LAFS system. For details see [comment:11:ticket:68 ticket #68] which describes the distributed notification algorithm and points to the relevant source code. == DVCS Integration == Write patches for the [http://git-scm.com/ git] or [http://darcs.net darcs] distributed revision control tool so that it reads and writes directly to a Tahoe-LAFS storage grid instead of its local filesystem. This creates a "revision control repository in the sky"—a repository that is distributed, fault-tolerant, and highly available. It also lends Tahoe-LAFS's unique security and access-control properties to your revision control system—you can share read-only access or read-write access with specific people through Tahoe-LAFS's capability access control system, and you can rely on the integrated digital signatures to verify that you are reading an authorized version of the repository. When Zooko was at the RSA 2010 security conference in March 2010, an employee of the U.S. National Security Agency told him that they were interested in integrating git with Tahoe-LAFS. There is already a simple kind of integration for Tahoe-LAFS with the [http://bazaar.canonical.com/en/ bzr] distributed revision control tool. Bzr can be configured to write its repositories through ftp and Tahoe-LAFS offers and ftp front-end. Here are [http://www.mail-archive.com/tahoe-dev@allmydata.org/msg01533.html instructions] on how to use the combination of bzr and Tahoe-LAFS. Improving the bzr+Tahoe-LAFS integration to be faster, more flexible, and easier to use would be an alternative to integrating git or darcs. Required skills: for git you need to know some C and understand git's behavior. For darcs you need to know some Haskell and understand darcs's behavior. For bzr you need to know some Python (or actually forget it you can just learn Python as you go because it is so easy) and understand bzr's behavior. ---- = Mentors = ''Who is willing to spend about five hours a week (estimated) helping a student do it right?'' [[br]] * [mailto:zooko@zooko.com Zooko Wilcox-O'Hearn] [http://testgrid.allmydata.org:3567/uri/URI:DIR2-RO:j74uhg25nwdpjpacl6rkat2yhm:kav7ijeft5h7r7rxdp5bgtlt3viv32yabqajkrdykozia5544jqa/wiki.html blog] (Python/C/C++/JavaScript, security+cryptography) * [mailto:lloyd@randombit.net Jack Lloyd] [http://www.randombit.net blog] (C/C++/Python, security+cryptography) * [mailto:david-sarah@jacaranda.org David-Sarah Hopwood] (Python/C/JavaScript, SFTP frontend, security+cryptography) * [mailto:warner-tahoe@lothar.com Brian Warner] (Python/C/JavaScript, security+cryptography) * [mailto:gsoc2010@ndurner.de Nils Durner] (C/C++, security+cryptography, P2P)