#1425 closed enhancement (fixed)

blacklist support

Reported by: warner Owned by: warner
Priority: major Milestone: 1.9.0
Component: code-frontend-web Version: 1.8.2
Keywords: blacklist review-needed docs usability news-needed Cc:
Launchpad Bug:

Description

For various reasons, webapi gateway operators might want to have the ability to deny access to specific files. Putting this directly in tahoe, rather than obligating these operators to run a frontend proxy (like apache or nginx or something), will make it easier for everyone to use.

The attached patch creates a blacklist file, with a list of storage-index strings and a reason for each. Any webapi operation (indeed *any* operation, so FTP/SFTP too) that tries to access a node with one of the given SIs will throw an exception that contains the reason. The webapi frontend translates this exception into an HTTP "403 Forbidden" response.

Attachments (6)

blacklist.diff (19.4 KB) - added by warner at 2011-08-01T03:52:20Z.
updated patch: block directories too, name 'access.blacklist'
blacklist2.diff (19.8 KB) - added by warner at 2011-08-21T01:07:37Z.
updated patch
1425-davidsarah.darcs.patch (60.8 KB) - added by davidsarah at 2011-08-23T02:43:17Z.
Tests, implementation and docs for blacklists, equivalent to blacklist2.diff but rebased for trunk, and with an extra test that we can list a directory containing a blacklisted file. refs #1425nt to blacklist2.diff but rebased for trunk. refs #1425
blacklist3.diff (20.6 KB) - added by warner at 2011-08-24T00:09:02Z.
updated: handle directories properly, allow listing, incorporate recommendations
blacklist4.darcs.patch (68.0 KB) - added by davidsarah at 2011-08-24T05:39:22Z.
Implementation, tests and docs for blacklists. This version allows listing directories containing a blacklisted child. fixes #1425
blacklist5.darcs.patch (69.6 KB) - added by davidsarah at 2011-08-24T17:12:56Z.
Implementation, tests and docs for blacklists. This version allows listing directories containing a blacklisted child. Inclusion of blacklist.py fixed. fixes #1425

Download all attachments as: .zip

Change History (33)

comment:1 Changed at 2011-06-27T17:47:01Z by warner

  • Keywords review-needed added
  • Owner set to zooko

It turned out that the easiest implementation puts the blacklist right in IClient.create_node_from_uri, which means it's not limited to the webapi: FTP/SFTP (and any internal operations) will honor the blacklist too. So I'm looking for a better name for the config file than $NODEDIR/webapi.blacklist. I think just "blacklist" is a bit short, but could be convinced otherwise. Thoughts?

comment:2 Changed at 2011-08-01T02:24:08Z by davidsarah

  • Owner changed from zooko to davidsarah
  • Status changed from new to assigned

Changed at 2011-08-01T03:52:20Z by warner

updated patch: block directories too, name 'access.blacklist'

comment:3 Changed at 2011-08-01T03:55:11Z by warner

The new patch moves the blacklist check a bit deeper (into NodeMaker), so it correctly blocks directories too. I renamed the file to access.blacklist since it's not webapi-specific, and updated the docs to match both (and added tests to make sure that directories and their children can be blocked).

comment:4 follow-up: Changed at 2011-08-04T00:31:29Z by davidsarah

In webapi.rst, 'webapi' -> 'web-API'.

The time to do an os.stat on a file that does not exist and catch the resulting OSError, seems to be around 0.005 ms on my machine. Are we sure that it isn't premature optimization to avoid this check on each web-API request if it didn't exist at start-up?

(For some reason it is faster to catch the exception from os.stat than to use os.path.exists. A significant part of the difference is just the time to do the .path and .exists accesses, which shows a) how unoptimized CPython is, and b) that we shouldn't worry too much about OS calls being expensive relative to Python code.)

If the access.blacklist file is deleted while a node is running, subsequent requests will fail.

It seems more usable to allow the reason to contain spaces. I know this complicates extending the format to have additional fields, but we probably won't need that, and if we did then we could use something like

storage_index;new_field The men in black came round to my house.

The line wrapping changes to create_node_from_uri are not necessary; I don't object to them but they probably shouldn't be in this patch.

If a request happens at the same time as the access.blacklist file is being rewritten, the request may reread an incomplete copy of the file. (The process that rewrites access.blacklist could avoid this by writing the new file and then moving it into place, but the patch doesn't document that this is necessary, and also it won't work on Windows.)

When an access is blocked, the SI of the file should probably be logged. (I think the failed request will already be logged, but the exception only gives the reason string, which isn't necessarily unique to a file.)

comment:5 Changed at 2011-08-04T00:31:55Z by davidsarah

  • Keywords review-needed removed
  • Owner changed from davidsarah to warner
  • Status changed from assigned to new

comment:6 follow-up: Changed at 2011-08-04T03:30:06Z by zooko

If a request happens at the same time as the access.blacklist file is being rewritten, the request may reread an incomplete copy of the file. (The process that rewrites access.blacklist could avoid this by writing the new file and then moving it into place, but the patch doesn't document that this is necessary, and also it won't work on Windows.)

If you use FilePath.setContent then your code can be simple while having this behavior. In addition, on Windows it will write to a new file, delete the old file, move the new file into place. That should, if I understand correctly, guarantee that an uncoordinated reader will see one of (old file, new file, no file).

comment:7 in reply to: ↑ 6 ; follow-up: Changed at 2011-08-04T03:43:15Z by davidsarah

Replying to zooko:

If a request happens at the same time as the access.blacklist file is being rewritten, the request may reread an incomplete copy of the file. (The process that rewrites access.blacklist could avoid this by writing the new file and then moving it into place, but the patch doesn't document that this is necessary, and also it won't work on Windows.)

If you use FilePath.setContent then your code can be simple while having this behavior. In addition, on Windows it will write to a new file, delete the old file, move the new file into place. That should, if I understand correctly, guarantee that an uncoordinated reader will see one of (old file, new file, no file).

Currently if the node sees no file (after seeing one at startup), the blacklist code will raise an exception. (I think this will just fail the request rather than crashing the node.) I suppose that fails safe, but it's still undesirable.

In any case, someone editing the file probably won't be using FilePath. They might use an editor that does the atomic rename-into-place; it's not uncommon for editors to do that on Unix, but it's not good for usability to expect users to know whether their editor does that.

If we had a tahoe command to add and remove entries from the blacklist, that command could tell the node when to reread it, maybe (or the add/remove operations could be web-API requests, but that would be more complicated).

Alternatively, we could just document that the node always needs to be restarted after editing the blacklist.

Last edited at 2011-08-04T03:44:27Z by davidsarah (previous) (diff)

comment:8 in reply to: ↑ 7 Changed at 2011-08-04T16:51:11Z by davidsarah

Replying to davidsarah:

Alternatively, we could just document that the node always needs to be restarted after editing the blacklist.

How about doing that for 1.9, and then adding

tahoe blacklist add STORAGE_INDEX_OR_PATH
tahoe blacklist remove STORAGE_INDEX_OR_PATH
tahoe blacklist view

or similar commands (that would work with a running gateway) for 1.10?

comment:9 Changed at 2011-08-11T01:21:01Z by davidsarah

  • Keywords blacklist design-review-needed added

comment:10 Changed at 2011-08-21T00:18:55Z by warner

changes I'm making following the phone call from last weekend:

  • always check the blacklist file, not just if it was present at startup (prevent crash if file is deleted, removes need to reboot after adding first entry)
  • mention in the docs that you need to replace the file atomically
  • log SI of blocked file
  • consider changing format of blacklist file to allow spaces in reason

Changed at 2011-08-21T01:07:37Z by warner

updated patch

comment:11 follow-up: Changed at 2011-08-21T01:10:30Z by warner

  • Keywords review-needed added; design-review-needed removed
  • Owner changed from warner to davidsarah

That new patch adds the following:

  • allow comments, spaces, blank lines in access.blacklist
  • always check, not just when access.blacklist was present at boot
  • log prohibited file access (to twistd.log)
  • report blacklist parsing errors
  • mention need for atomic blacklist updates in docs

While testing, I noticed that doing the check in NodeMaker means that listing a directory will fail if one of the objects inside it is prohibited: the directory-listing code creates a Filenode for each object (to extract the readcaps/writecaps separately, I think), and that step fails. Should we fix that? By doing the check somewhere else (maybe Filenode.read?) we could allow directories to mention the prohibited item, which might be a usability win.

comment:12 in reply to: ↑ 11 ; follow-up: Changed at 2011-08-21T04:42:40Z by davidsarah

Replying to warner:

While testing, I noticed that doing the check in NodeMaker means that listing a directory will fail if one of the objects inside it is prohibited: the directory-listing code creates a Filenode for each object (to extract the readcaps/writecaps separately, I think), and that step fails. Should we fix that? By doing the check somewhere else (maybe Filenode.read?) we could allow directories to mention the prohibited item, which might be a usability win.

Yes, I think we should fix this (and have a test for it).

Would moving the check to Filenode.read be sufficient? I think you need checks in MutableFileVersion as well (for both reading and writing a blacklisted mutable object). It seems more fragile than doing the check in NodeMaker; there are more places to check, and missing one would leave some blacklisted objects accessible.

Maybe it's safer to keep the check in NodeMaker, but have the directory-reading code catch the exception and omit the blacklisted Filenode. Oh, but that would mean that modifying a directory would drop all blacklisted children, which is probably not what we want (the fact that the object is blacklisted for this gateway doesn't mean that it shouldn't be preserved for access via other gateways).

I don't know; I'll leave it up to you.

comment:13 Changed at 2011-08-21T04:45:08Z by davidsarah

In the example in webapi.rst,

-> error, 403 Access Prohibited: my-puppy-told-me-to

should be

-> error, 403 Access Prohibited: my puppy told me to

(Still reviewing.)

comment:14 Changed at 2011-08-21T05:07:53Z by davidsarah

webapi.rst line 1964: "whether it need to be" -> "whether it needs to be"

FileProhibited in src/allmydata/blacklist.py: __init__(self, reason) should either chain to the superclass Exception.__init__(self, reason) or implement __repr__, otherwise the reason won't be included in tracebacks.

Expand blacklist_fn to blacklist_filename (I was momentarily confused because I read fn as function.)

comment:15 Changed at 2011-08-21T05:08:01Z by davidsarah

    try: 
        current_mtime = os.stat(self.blacklist_fn).st_mtime
    except EnvironmentError: 
        # unreadable blacklist file means no blacklist 
        self.entries.clear() 
        return 

If the file exists but isn't readable (for instance, if we don't have permission to read it), that should not be a silent error. Make it either:

    except EnvironmentError:
        if os.path.exists(self.blacklist_fn):
            raise
        # nonexistent blacklist file means no blacklist 
        self.entries.clear() 
        return 

or

    except EnvironmentError, e:
        # unreadable blacklist file means no blacklist 
        if os.path.exists(self.blacklist_fn):
            twisted_log.err(e, "unreadable blacklist file")
        self.entries.clear() 
        return 

comment:16 in reply to: ↑ 4 Changed at 2011-08-21T05:32:10Z by davidsarah

  • Keywords reviewed added; review-needed removed

The scope of the try/except at line 26 of blacklist.py can be narrowed a bit; I'd put it around the for loop rather than the if.

This comment still applies:

The line wrapping changes to create_node_from_uri are not necessary; I don't object to them but they probably shouldn't be in this patch.

In allmydata/test/no_network.py, the line c.set_default_mutable_keysize(522) will conflict with [5171/ticket393-MDMF-2].

In test_blacklist:

self.g.clients[0].blacklist.last_mtime -= 2.0

Ugh, but I can't think of a cleaner way to do this, so I'll let you off ;-)

Good tests. Do we need a test that mutable files are correctly blacklisted? I suppose that's not necessary if we are checking in NodeMaker, but it would be if we were checking in immutable/filenode.py and mutable/filenode.py.

Otherwise +1.

comment:17 Changed at 2011-08-21T05:32:25Z by davidsarah

  • Owner changed from davidsarah to warner

comment:18 in reply to: ↑ 12 Changed at 2011-08-21T19:28:49Z by davidsarah

Replying to davidsarah:

Replying to warner:

While testing, I noticed that doing the check in NodeMaker means that listing a directory will fail if one of the objects inside it is prohibited: the directory-listing code creates a Filenode for each object (to extract the readcaps/writecaps separately, I think), and that step fails. Should we fix that? By doing the check somewhere else (maybe Filenode.read?) we could allow directories to mention the prohibited item, which might be a usability win.

Yes, I think we should fix this (and have a test for it).

... but if it would delay 1.9, then let's just do it the simpler way that fails for directory listing, document that, and open a ticket to fix it.

Changed at 2011-08-23T02:43:17Z by davidsarah

Tests, implementation and docs for blacklists, equivalent to blacklist2.diff but rebased for trunk, and with an extra test that we can list a directory containing a blacklisted file. refs #1425nt to blacklist2.diff but rebased for trunk. refs #1425

Changed at 2011-08-24T00:09:02Z by warner

updated: handle directories properly, allow listing, incorporate recommendations

comment:19 follow-up: Changed at 2011-08-24T00:14:09Z by warner

hrm, one wrinkle that's somehow bypassing tests meant to catch it: if you blacklist a file, access it (and get the error), then unblacklist it, the next access still throws an error. An obvious downside of monkeypatching to replace Node.read is that the Node might stick around: in this case in the nodemaker's cache (although that's a WeakValueDictionary so there must be something else holding on to it).

The safest approach (at least one that would let you unblacklist things quickly) would be to do the blacklist check on each call to read(), rather than replacing read() with a function that always throws an exception.

Changed at 2011-08-24T05:39:22Z by davidsarah

Implementation, tests and docs for blacklists. This version allows listing directories containing a blacklisted child. fixes #1425

comment:20 Changed at 2011-08-24T05:44:44Z by davidsarah

  • Keywords review-needed added; reviewed removed

Hmm, the changes to no_network.py in blacklist4.darcs.patch aren't actually necessary, because the test no longer needs to restart the node. I'll revert those changes.

comment:21 in reply to: ↑ 19 Changed at 2011-08-24T05:49:57Z by davidsarah

Replying to warner:

hrm, one wrinkle that's somehow bypassing tests meant to catch it: if you blacklist a file, access it (and get the error), then unblacklist it, the next access still throws an error. An obvious downside of monkeypatching to replace Node.read is that the Node might stick around: in this case in the nodemaker's cache (although that's a WeakValueDictionary so there must be something else holding on to it).

The safest approach (at least one that would let you unblacklist things quickly) would be to do the blacklist check on each call to read(), rather than replacing read() with a function that always throws an exception.

blacklist4.darcs.patch solves this problem because the nodemaker will cache the original node object, not the ProhibitedNode wrapper. A node will get wrapped with ProhibitedNode on each request depending on whether or not it is blacklisted at that request.

A slightly odd side-effect of this patch is that prohibited directories will be treated a little more like files. This is actually quite useful because it prevents recursive operations from trying to traverse them.

comment:22 follow-up: Changed at 2011-08-24T15:27:45Z by warner

blacklist4.darcs.patch : the blacklist.py file is missing, including the new ProhibitedNode class, so I can't test it right away. I assume the new class passes through a lot of methods but raises an exception during read() and download_version (and download_best_version).

Let's move the no_network.py cleanups to a different patch: they're useful cleanups, but I agree it'd be better to defer them until after the release.

I'm willing to go with this approach, if only for expediency.. I've got a few concerns that might warrant more work post-1.9:

  • we talked about this blocking FTP/SFTP too, and blocking read() ought to do that, but it'd be nice to have a test for it.
  • marking the files as prohibited on the directory listing is a nice touch, but I'd be happy to see this feature go in without it. Also, by removing the More Info link, it's harder for users to find the blocked filecap (and in particular the storage-index), so they can edit their access.blacklist file to unblock the file later. Maybe we could put "Access Prohibited" in the info column and make it a link to the same old More Info page as before, then add the reason to the more-info page? Likewise, I don't think we need to change the "type" column to indicate the file has been blacklisted: the strikethrough is plenty big enough.

Changed at 2011-08-24T17:12:56Z by davidsarah

Implementation, tests and docs for blacklists. This version allows listing directories containing a blacklisted child. Inclusion of blacklist.py fixed. fixes #1425

comment:23 Changed at 2011-08-24T17:13:39Z by davidsarah

I removed the no_network.py cleanups and added blacklist.py (sorry about that). The other changes can wait until after the alpha.

comment:24 in reply to: ↑ 22 Changed at 2011-08-24T20:57:42Z by davidsarah

Replying to warner:

  • marking the files as prohibited on the directory listing is a nice touch, but I'd be happy to see this feature go in without it. Also, by removing the More Info link, it's harder for users to find the blocked filecap (and in particular the storage-index), so they can edit their access.blacklist file to unblock the file later.

I added the reason in that column because the More Info page ended up giving an error, and there was no point in having the link in that case. I'll have a look at how to get the info page working and showing the blacklist reason.

Maybe we could put "Access Prohibited" in the info column and make it a link to the same old More Info page as before, then add the reason to the more-info page? Likewise, I don't think we need to change the "type" column to indicate the file has been blacklisted: the strikethrough is plenty big enough.

I agree. The change to the type column was there more because it was slightly easier to implement than showing the original type code (and because I didn't have the "Access Prohibited" at that point), than because it's necessary.

comment:25 Changed at 2011-08-25T02:26:32Z by david-sarah@…

  • Resolution set to fixed
  • Status changed from new to closed

In 3d7a32647c431385:

(The changeset message doesn't reference this ticket)

comment:26 Changed at 2011-08-25T11:52:05Z by zooko

When releasing 1.9, we should go to extra effort to communicate what this change does and doesn't do. I've been learning that almost all users have very simpleminded models of things, for example I think the existence of the public web gateway made most users think that Tahoe-LAFS was nothing more than some sort of online service. Explaining this feature (in the NEWS/release-notes/etc.) may be a good opportunity to explain the difference between running your own software (Tahoe-LAFS storage client) to upload or download files vs. visiting a web server (Tahoe-LAFS gateway) operated by someone else and asking them to serve files to you.

This feature lets the operator of a Tahoe-LAFS storage client (== Tahoe-LAFS gateway == the web server in question) configure their software so it refuses to serve certain files (to anyone). It does not give them any ability to affect whether other Tahoe-LAFS storage clients/gateways access those files.

How can we make this clear? Maybe the only way to make this clear is to create a variant of http://tahoe-lafs.org/~zooko/network-and-reliance-topology.png which shows multiple gateways in use, and indicate on that diagram that the blacklisting feature affects only the single gateway that chooses to use it.

comment:27 Changed at 2011-08-25T11:56:13Z by zooko

  • Keywords docs usability news-needed added
Note: See TracTickets for help on using tickets.