[tahoe-dev] patch review
Brian Warner
warner-tahoe at allmydata.com
Fri Jan 2 17:38:55 PST 2009
(I'm reviewing patches that landed while I was out on vacation).
[3292]: http://allmydata.org/trac/tahoe/changeset/3292
"refactor some common logging behavior into mixins":
the use of self._parentmsgid is suspicious. It looks like two subsequent
calls to log() like:
self.log("foo")
self.log("bar")
will wind up with "bar" being the child of "foo", whereas they ought to both
be siblings of the same original parent. In addition to not correctly
describing the relationship between these events, when these
long-chain-of-descendants messages are viewed in 'flogtool web-viewer',
they'll scroll off the right edge of the screen.
[3305]: http://allmydata.org/trac/tahoe/changeset/3305
"immutable download: don't catch all exceptions, just DeadReferenceError
and IntegrityCheckReject"
I haven't looked closely enough to be sure, but does this mean that a single
server could cause the whole download to fail, just by throwing some
unexpected exception? That doesn't seem like a good ability to grant to
storage servers.
In general, when we're asking multiple servers for data, and any subset
could do, I've tried to deal with potential exceptions in three classes:
DeadReferenceError: log it as UNUSUAL, give up on that server. This
happens, there's no need to get flustered about it, but
it shouldn't happen every day. So I use:
log.msg("query failure", failure=f, level=log.UNUSUAL)
Integrity failures: log it as WEIRD, give up on that server (or on that
block but be willing to use others). This indicates a
hardware problem, or a significant bug, or malice, and
somebody should look into it, but it doesn't prevent us
from downloading the file. I use:
log.msg("query failure", failure=f, level=log.WEIRD)
Other exceptions: log it as WEIRD, give up on that server (or on that
block), make sure it gets reported to any unit test
that's currently running, then use other servers to
download the file. This indicates a significant bug or
malice, and somebody should look into it, but it doesn't
prevent us from downloading the file. We don't expect
this to happen, and the most likely way for this to occur
is when we're shaking the bugs out of new code, so we
need to make sure the exceptions are easy to spot while
we're running unit tests (i.e. we really want the unit
test to fail, and for Trial to print out this exception).
The allmydata.util.log.err function will report the
Failure to both foolscap logging and to
twisted.python.log.err(), and the latter will get noticed
by Trial (and will flunk the unit test). So I'd use:
log.err("query failure", failure=f, level=log.BAD)
[3306]: http://allmydata.org/trac/tahoe/changeset/3306
"util.base32: loosen the precondition forbidding unicode"
As mentioned in the "String encoding in Tahoe" thread, I prefer arguments to
be of one class only, and functions which accept unicode-or-bytestring make
me nervous. Since you reverted the sys.argv conversion in [3311], I'd like
to see [3306], [3307], and possibly [3308] reverted too.
[3323]: http://allmydata.org/trac/tahoe/changeset/3323
"generically wrap any errback from callRemote() in a ServerFailure instance"
I'm worried that this is going to change the behavior of error-handling code
in surprising ways. The core of the change is to add an errback to all
callRemotes with:
def _wrap_server_failure(f): raise ServerFailure(f)
that stashes the original Failure in an attribute of the ServerFailure
exception, and delegates __repr__ and __str__ through to the original.
This is probably going to lose the stack frames which are normally stored in
a Failure, as well as the exception type (so f.trap or f.check will be
unable to examine the original exception type). This will deny the caller
the ability to make decisions based upon the remote exception type.
You might want the errback to just set an attribute on the original Failure
instance, rather than using a distinct superclass as an indicator. Remember
that you could get a KeyError from either side, and the calling code might
want to know about it.
Another option would be to create a marker class named "ServerException" or
"RemoteException", then create a "RemoteServerFailure" subclass of Failure
with an overriden check()/trap() method that says "yes" when asked about
ServerException/RemoteException, and have the wrapper replace the original
Failure with the new RemoteServerFailure. If RemoteServerFailure could
delegate everything else to the original, this would give you the desired
distinguish-remote-failure-from-local-ones property in a fully compatible
way. It might also be possible to just monkeypatch the original Failure to
change its check()/trap() methods to behave this way, without creating a
whole new Failure subclass.
Another approach (and the one that I think I usually take) is to use the
placement of addErrback()s to identify remote failures, rather than trying
to use the exception type. The following examples might demonstrate the
distinction:
1:
d = doSomething()
def do_remote(res):
return rref.callRemote("foo", args)
d.addCallback(do_remote)
def process(results):
return 2/results
d.addCallback(process)
def interpret_error(f):
if f.check(DeadReferenceError): print "remote"
if f.check(ZeroDivisionError): print "local"
d.addErrback(interpret_error)
return d
2:
d = doSomething()
def do_remote(res):
d1 = rref.callRemote("foo", args)
def remote_error(f):
print "remote"
return recover_somehow()
d1.addErrback(remote_error)
return d1
d.addCallback(do_remote)
def process(results):
return 2/results
d.addCallback(process)
def interpret_error(f):
if f.check(ZeroDivisionError): print "local"
d.addErrback(interpret_error)
return d
I tend to use the latter: if remote_error() fires, the problem occurred
during the callRemote. This approach is not as convenient to use if you
really want to bypass the post-callRemote processing in the case of error.
It seems to work fairly well if the pattern is more "get data from server 1,
if that fails try to get it from server 2, if that fails try server 3, etc,
then process whichever data was successfully retrieved".
This affects [3334] too.
[3326]: http://allmydata.org/trac/tahoe/changeset/3326
"storage: accept any size shares"
As I mentioned on the ticket (#346), this needs a change to
StorageServer.VERSION (to increase the advertised share-size limit) before
clients will actually be willing to take advantage of larger shares. I'm not
sure what value to suggest we advertise: we could use os.vfsstat() to
estimate how much space is available (and subtract our reserved value) at
startup and advertise that, or we could find a way to ask the system about
the largest filesize we can store, or we could pick some arbitrary large
number (like 2**64).
Also, the line that does "max_size % 4294967295" is probably a bug, and
should be replaced with "max_size % (2**32)".
cheers,
-Brian
More information about the tahoe-dev
mailing list