devchat notes 11-Apr-2017

Tue Apr 11 18:49:17 UTC 2017

Tahoe-LAFS devchat 11-Apr-2017
Attendees: warner, meejah, exarkun, daira, dawuud

* added meejah, exarkun, dawuud as committers!

* warner is replacing test_web with treq to fix deprecation warnings
* treq test strategies
  * build a simplistic implementation of a target object
  * interact with the frontend
  * look at the target and make sure it's state has changed in the
    expected way

* potential cleanup passes
  * switch to new-style classes (inherit from object), use
    super(MyClass,self).__init__(args) instead of
    MyParent.__init__(self,args)
    * need to think about test mixin classes and setUp methods,
      especially ones that return a Deferred
    * maybe make a rule: test-class setUp methods are always
      inlineCallbacks and "yield super()...setUp()"
  * switch to inlineCallbacks, maybe
  * replace ShouldFailMixin with calls to .failureResultOf or
    .assertFailure
  * move OneShotObserverList into Twisted itself
    * provide a way to pass the reactor in, combine with
      flushEventualQueue
    * allow tests to crank the mock reactor forwards until the results
      are ready
    * so tests can avoid really yielding to the real reactor
  * switch to relative imports
  * start on python3 work
    * switch to print_function
    * be more clear about bytes-vs-unicode
      * daira says that in most places where we say "str", we do in fact
        mean bytes, we've been pretty good about that
    * add b"" or u"" tags
    * util/hashutil.py should definitely be b""
    * maybe add from __future__ import unicode_literals to other files
    * foolscap will be a big blocker

* warner saw intermittent test coverage (flagged by codecov on the last
  few PRs that landed)
  * will file a ticket with the lines that are sometimes covered
  * we should figure out what's going on, add tests to cover them
    properly
  * new PRs that don't touch code (e.g. docs) should (obviously) never
    cause coverage regressions
  * it'd be nice if there was a tool to show which test files provided
    which coverage
    * so we can make sure unit-test files get full coverage on
      individual modules
    * then integration-like tests are useful too, but we don't depend
      upon them for coverage
  * "hypothesis" module for adding to test coverage

* meejah working on accounting branch
  * start by asyncifying leasedb API
  * messy/large
  * how to report fatal errors that are async? eg. leasedb-on-S3 fails
    * buildbot would tail the logfile until it saw an "I'm happy"
      message, print any failures it saw
  * he's trying to replace constructors with more async-friendly factory
    functions
    * warner is anxious about that
  * goal is to not reference a LeaseDB object until it knows it can talk
    to an (async/cloud) database
  * but currently everything in the Node/Client constructor is created
    synchronously
* when is the node "ready"?
  * what does that even mean?
  * not ready: I don't have any recent evidence that this might even
    work
    * e.g. haven't heard from the introducer at all
  * we used to have code that would react to an initial Introducer
    connection failure with sys.abort()
  * there are some early indicators of problems that are so severe, we
    want to inform the "tahoe start" user on stderr and exit
    * it's unfortunate to signal these errors in a log file instead of
      stderr
    * the "tahoe start" user will think that the node is viable, when it
      fact it's unusable
* two basic approaches
* approach one:
  * DON'T START UNTIL WE PROVE FEASIBILITY OF THE LAST STEP
  * replace Node/Client.__init__ with factory construction functions
    * change self.create_* to return the thing they create, instead of
      stashing it on self.x=
    * and change them to accept their dependencies as arguments
    * so tub = self.create_tub(config); introducer_client =
      self.create_introducer_client(config, tub)
    * then self.create_tub() no longer references "self", can be
      converted from a method of Node into a standalone function
    * then Node.__init__ is replaced with (standalong function)
      create_node()
  * change create_storage_server() (which does create_leasedb) to return
    a Deferred
    * which doesn't fire until the early can-I-even-DB checks pass
  * rewrite daemonization code to tolerate makeService() returning a
    Deferred
    * not easy: twistd can't handle this
    * may have to do all the daemonization ourselves
    * we need the reactor running to do that leaseDB setup work
    * not a good idea to start the reactor before daemonization: then
      you wind up with two reactors, still sharing FDs
    * one incomplete direction to explore:
      * startstop_node.start could examine nodetype before touching
        StartTahoeNodePlugin
      * do some async stuff early, before touching twistd.runApp
      * however, the reactor is not running at that point, which we'd
        need for leaseDB stuff
* approach two:
  * like what Buildbot used to do a long time ago
  * "tahoe start" forks
  * parent watches the child until it gets past an initial startup phase
  * if child signals an early error, parent prints that error to stderr
    and exits with rc=1
  * if child signals success, parent prints nothing and exits with rc=0
  * parent either follows a child logfile, or connects to a socket
    * logfile makes it easier to deal with the child exiting suddenly
    * socket makes it easier to distinguish between old and new
      messages, auto-flushes
    * maybe protocol should be: parent truncates logfile, parent
      opens+reads, child opens +writes +flushes
    * maybe pass argument into child to tell it where the logfile is
    * maybe pass an FD into child, rather than a logfile on disk
      * if the daemonization code enables that, e.g. O_CLOEXEC
  * may require a lot of custom daemonization code too
    * twistd daemonization code makes the parent exit, so we lose
      control
    * parent can't use reactor to follow logfile
      * could just do blocking reads, with maybe a SIGALRM for timeout
    * how to retain control of the parent while still taking advantage
      of twistd's code?
* the appearance of "twist" (in Twisted-16.4) means that "twistd" may
  eventually get deprecated, so maybe we should stop relying on the
  twistd internals quite so much (note that "twist" is almost entirely
  private; we think Twisted regrets exposing so much of twistd, although
  it's certainly been handy for us)
* another thought: what if we give up "tahoe start"?
  * only provide "tahoe run" (the non-daemonization tool)
  * and basically force operators to learn supervisord or systemd or
    daemontools
  * warner would miss "tahoe start", I think it's super handy
  * we could maybe make "tahoe start" spawn a "tahoe run", but I think
    that'd have all the same problems as above
* the "user story" for the above is, more-or-less: "the user should see
  any important early failure-messages on the console where they do
  'tahoe start'"