#1374 new enhancement

"walk through" or guide for people who want to read some code

Reported by: zooko Owned by: nobody
Priority: major Milestone: undecided
Component: documentation Version: 1.8.2
Keywords: docs Cc:
Launchpad Bug:

Description (last modified by warner)

Riastradh writes on IRC: "in an afternoon I was able to understand basically everything Tarsnap does, and had time to spare to take a closer look at some details"

In doing so, he found the major bug in tarsnap's encryption which exposed all plaintext of all users until his bug report led to tarsnap fixing it.

So: I would like it if people like Riastradh were more likely to read the source code of Tahoe-LAFS on an afternoon!

I asked how many lines of code in tarsnap, and he replied "The core is twenty thousand lines of heavily commented C (probably half the lines are comments). A good chunk of that, about a thousand lines, is just a modified front end to bsdtar."

Note that tarsnap distributes the source code only for the client (the source code of the server is secret).

I ran the following command to get a rough estimate of the number of lines of code you would want to read to get a good idea of the behavior of the Tahoe-LAFS client:

find client.py codec.py control.py dirnode.py hashtree.py immutable/ interfaces.py mutable/ node.py nodemaker.py  storage_client.py uri.py scripts -type f -print0 | xargs -0 wc -l

Result:

   21767 total
find client.py codec.py control.py dirnode.py hashtree.py immutable/ interfaces.py mutable/ node.py nodemaker.py  storage_client.py uri.py scripts -print0 | xargs -0 sloccount

Result:

SLOC    Directory       SLOC-by-Language (Sorted)
3906    scripts         python=3906
3199    immutable       python=3199
2885    top_dir         python=2885
2453    mutable         python=2453
1446    downloader      python=1446


Totals grouped by language (dominant language first):
python:       13889 (100.00%)

So it is approximately the same amount of code and comments as tarsnap! (Although perhaps you would want to include or exclude different files or directories in your reading of the client.) Of course, Tahoe-LAFS is written in Python and tarsnap is written in C.

This ticket is to create a short document which describes for a newcomer how to find the source code which implements just the client, and then the source code which implements just the server, and what the different parts of the client source code are for, etc.

Alternately, instead of making a separate document for this, take extant documents and add hyperlinks into them pointing to the relevant source code. Documents which might benefit from pointers to source code:

Hm, and at the same time, the source files which are pointed to from these doc files should have links in them pointing back to these docs files.

Change History (8)

comment:1 Changed at 2011-03-01T21:36:46Z by davidsarah

Bear in mind that Riastradh was reading only the client code of tarsnap. Tahoe's client code is really quite small, but it isn't well factored-out from the server code. #1338 would fix that.

comment:2 in reply to: ↑ description Changed at 2011-03-01T21:47:55Z by davidsarah

Replying to zooko:

I ran the following command to get a rough estimate of the number of lines of code you would want to read to get a good idea of the behavior of the Tahoe-LAFS client:

find client.py codec.py control.py dirnode.py hashtree.py immutable/ interfaces.py mutable/ node.py nodemaker.py  storage_client.py uri.py scripts -type f -print0 | xargs -0 wc -l

Result:

   21767 total

That includes the storage client. Isn't the equivalent to the tarsnap client just the Tahoe command-line client? That isn't much more than scripts/.

From the experiment I did for #1338:

$ wc -l bin/*.py src/allmydata/scripts/*.py
   760 bin/cli.py
   196 bin/common.py
   275 bin/encodingutil.py
   253 bin/fileutil.py
   201 bin/fixups.py
    77 bin/registry.py
   118 bin/runner.py
    60 bin/tahoe-test.py
   760 bin/uri.py
     0 src/allmydata/scripts/__init__.py
   371 src/allmydata/scripts/backupdb.py
   615 src/allmydata/scripts/cli.py
   196 src/allmydata/scripts/common.py
    86 src/allmydata/scripts/common_http.py
   208 src/allmydata/scripts/create_node.py
   892 src/allmydata/scripts/debug.py
    62 src/allmydata/scripts/keygen.py
   114 src/allmydata/scripts/runner.py
    83 src/allmydata/scripts/slow_operation.py
   202 src/allmydata/scripts/startstop_node.py
    59 src/allmydata/scripts/stats_gatherer.py
   119 src/allmydata/scripts/tahoe_add_alias.py
   325 src/allmydata/scripts/tahoe_backup.py
   306 src/allmydata/scripts/tahoe_check.py
   782 src/allmydata/scripts/tahoe_cp.py
    44 src/allmydata/scripts/tahoe_get.py
   200 src/allmydata/scripts/tahoe_ls.py
   147 src/allmydata/scripts/tahoe_manifest.py
    44 src/allmydata/scripts/tahoe_mkdir.py
    74 src/allmydata/scripts/tahoe_mv.py
    85 src/allmydata/scripts/tahoe_put.py
    39 src/allmydata/scripts/tahoe_rm.py
    32 src/allmydata/scripts/tahoe_webopen.py
  7785 total
Last edited at 2011-03-01T21:48:20Z by davidsarah (previous) (diff)

comment:3 Changed at 2011-03-02T13:20:50Z by gdt

My usual plea: all of these documents and especially the guide to the code should be in the same repository with the code. I work offline sometimes for various reasons, and those are exactly the times I tend to have spare time and the ability to concentrate on code reading.

comment:4 follow-up: Changed at 2011-03-02T22:43:19Z by warner

Actually, I'd say that tahoe's *server* is quite small, and it is the client (i.e. "gateway") that represents the bulk of the code. We try to keep the server dumb, and keep the smarts on the edges.

Also, assuming tarsnap's client is encrypting user data before sending it to the server, I think tahoe's encoding/encrypting code is the most direct equivalent. Probably everything from Filenodes down to the server-accessing code in storage_client.py. The webapi code has no equivalent in tarsnap (since their only frontend, as I understand it, is a 'tar'-like command), hence the HTTP-client CLI scripts don't directly correspond either.

I've tried a handful of times (once per presentation, really) to come up with a good sequence in which to teach how the tahoe system+codebase works. I usually try a high-level overview followed by explaining the details from the inside out, aiming to present a simplified version before adding more details. Something like the following:

  • diagram of client, gateway, servers
  • coarse dataflow of (file goes to gateway, gateway creates shares, shares go to servers), and back again
  • then zoom in to show immutable encoding process: refine "creates shares" to reveal random-key generation, encryption, then erasure-coding, then add a flat hash, show filecap generation, then add segmentation and merkle trees, then show CHK-key generation
  • zoom out to show decoding process: hash checking, erasure-decoding, decryption
  • zoom out to show share placement, hash-permutation, server allocation queries, download-time DYHB queries, pipelining/impatience
  • zoom out to show corresponding server-side methods: remote_allocate_buckets, RIBucketWriter.write, remote_get_buckets, RIBucketReader.read.
  • zoom out to show internal Filenode objects, .read() method, NodeMaker methods
  • zoom out to show webapi frontend, describe URL syntax. Show FTP, SFTP frontends next to it, but maybe defer description
  • zoom out to show CLI command scripts
  • zoom out to high-level client/gateway/servers diagram, add Introducer
  • zoom in to show IntroducerClient, show how messages are distributed, how clients learn about servers, connect those IServer objects in storage_client.py to the diagram of server-allocation code
  • introduce mutable files, refine "create shares" process in stages like with immutable files: start with SMDF, show signed roothash, then introduce segmentation and MDMF,
  • show server-side mutable methods: testv-and-readv-and-writev
  • explain UCWE, recovery methods, .modify retry mechanism
  • explain directories, encoding format, DirectoryNode objects and methods, webapi syntax
  • then figure out what's left: backupdb, leases, deep-check/verify/renew, FTP/SFTP, admin node-creation/start/stop commands, stats-gatherer, key-generator, upload-download history/status, web utils like welcome/provisioning/reliability pages

comment:5 in reply to: ↑ 4 Changed at 2011-03-05T04:52:22Z by zooko

Wow. What an awesome roadmap for a comprehensive description. Too avoid being too intimidating, I suggest that each of those bullet points in turn would make a document complete enough that it could be added to the repo and published. :-) We already have the first one!

How about the second one?

Okay, so the next step on this ticket is to select some extant documents and update them or write new documents to serve as the first and second bullet points.

comment:6 Changed at 2011-03-05T21:49:33Z by riastradh

Three random notes:

  1. The Tarsnap nonce reuse bug violated the security model, but not every part of it: it didn't expose any plaintext to eavesdroppers or men in the middle on the network, for example, or the plaintext of one user to another user. If Amazon had known plaintexts, then the bug exposed more plaintext to Amazon if it was uploaded in the same session as but in separate files from the known plaintext. (Bad? Yes. Exposure of all plaintext of all users? A little overstated.)
  1. The sort of 'semantic density' of Python code is higher than that of C code, and Python is much harder to cross-reference than C, so twenty thousand lines of Python is generally going to take me much longer to read and digest than twenty thousand lines of C.
  1. The structure of Tarsnap is not quite analogous to that of Tahoe-LAFS: while the Tahoe 'client' (implementing the 'tahoe cp' &c. commands) is mostly a trivial shim that just talks HTTP to a local server, the Tarsnap client also handles all the chunkification, encryption, &c. This difference matters: I want to be able to ask, 'What does Amazon get to see?', and I can find the answer for Tarsnap in the client source code users receive, but I don't think the answer for Tahoe-LAFS lies in src/allmydata/scripts/.
Last edited at 2011-03-05T21:52:57Z by davidsarah (previous) (diff)

comment:7 Changed at 2012-05-16T22:00:08Z by zooko

han zheng asked how to study the source code on the mailing list. My answer included this idea:

Here's an idea:

To dive into the source code start at the server side. The Tahoe-LAFS storage server is prevented by the architecture from knowing anything about the encryption/decryption or integrity-checking. And, it doesn't know anything about the erasure coding. Therefore, it has less complexity than the Tahoe-LAFS gateway does, and it is easier to read the whole source code and understand what it does.

To understand everything that the storage server does, you would need to read the files in this directory:

https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/storage/

Version 0, edited at 2012-05-16T22:00:08Z by zooko (next)

comment:8 Changed at 2014-09-11T22:21:23Z by warner

  • Component changed from unknown to documentation
  • Description modified (diff)
Note: See TracTickets for help on using tickets.