#3412 closed defect (fixed)

Many tests are flaky

Reported by: jaraco Owned by:
Priority: normal Milestone: Support Python 3
Component: unknown Version: n/a
Keywords: Cc:
Launchpad Bug:

Description (last modified by exarkun)

As observed in this run, part of a PR that doesn't change any code, the test_POST_upload_replace fails twice, once with a twisted.web.error and another with a KeyError.

In this build, allmydata.test.test_system.SystemTest fails with a FailTest (#3321).

this run reveals that test_filesystem is flaky.

Also this run looks flaky in test_status_path_404_error.

In this run, test_POST_rename_file failed with the same errors as test_POST_upload_replace, suggesting all of test_web is flaky.

Change History (22)

comment:2 Changed at 2020-09-12T16:33:58Z by jaraco

  • Description modified (diff)
  • Owner set to jaraco
  • Status changed from new to assigned
  • Summary changed from test_POST_upload_replace is flaky to Many tests are flaky

comment:3 Changed at 2020-09-12T16:47:18Z by jaraco

My plan is to just use this issue as an umbrella to capture most of the tests that are currently flaky and mark each with a 'retry' decorator to bypass the flakiness until someone has time to address the root cause.

comment:4 Changed at 2020-09-12T16:47:25Z by jaraco

  • Milestone changed from undecided to Support Python 3

comment:5 Changed at 2020-09-12T17:36:26Z by jaraco

  • Description modified (diff)

comment:6 Changed at 2020-09-12T17:43:53Z by jaraco

  • Description modified (diff)

comment:7 Changed at 2020-09-12T17:44:15Z by jaraco

  • Description modified (diff)

comment:8 Changed at 2020-09-12T18:12:34Z by jaraco

  • Description modified (diff)

comment:9 Changed at 2020-09-12T19:41:28Z by jaraco

  • Keywords review-needed added

comment:10 Changed at 2020-09-12T19:43:29Z by jaraco

I've flagged this as review-needed, but I'm not confident the PR covers all of the flaky tests. Even the most recent runs are still failing deprecations (tracked separately in #3414). Given the amount of toil this work requires, I'd like to get an initial review of the approach and feedback on the issue generally.

comment:11 Changed at 2020-09-13T15:18:45Z by wearpants

The collective consensus has been that we'll live with flaky tests... though personally, I'd like to see some improvement here, as it seems to be an issue multiple times per week, and CI is slow - as a result, we have eyeballed failing tests and decided to merge anyway (IIRC, this introduced a bucket into master at least once, because the test was actually failing not just flaky).

So I guess if there's a quick and dirty "fix" like retrying, that's worth doing but we should discuss further before investing significant effort in addressing root causes.

comment:12 Changed at 2020-09-14T12:55:55Z by exarkun

though personally, I'd like to see some improvement here

I think everyone would like to see some improvement. It's just not clear how to make that happen.

comment:13 Changed at 2020-09-14T12:56:32Z by exarkun

This ticket is a partial duplicate of https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3321

comment:14 Changed at 2020-09-14T13:27:03Z by exarkun

  • Keywords review-needed removed

comment:15 Changed at 2020-09-25T14:47:26Z by jaraco

In the PR, exarkun advised:

[These retries] only work for synchronous tests. If a test returns a Deferred and that Deferred fires with a failure, do they do anything at all?

So I reached out for advice.

In the #twisted IRC channel, I got some advice from tos9, Glyph, and others regarding the fact that trial allows test methods to return a Deferred object, which may have one or more callbacks that will only be later executed, so wrapping those methods in retry will have little effect.

I created trial-retry to work out the kinks, and there I landed on a technique that seems to have the intended effect.

Glyph had this to say about the approach:

Hopefully when we have some type-checking you'll get an error if you try to manipulate .callbacks like that entirely "from the outside", i.e. if you have a Deferred which definitely isn't in the process of firing, you can probably get away with .callbacks=something but in the middle of execution this may have arbitrarily weird effects; running some callbacks twice, failing to run others

It remains to be seen if the issue is merely theoretical or if such limitations might be triggered and in this codebase.

Acknowledging this potential weakness of the approach, I plan to apply this technique to the work in progress and see if it improves the reliability of the test suite.

Last edited at 2020-09-25T15:47:14Z by jaraco (previous) (diff)

comment:16 Changed at 2020-09-25T15:50:28Z by jaraco

Turns out the problem is even more complicated than I imagined. As illustrated here, tests can create Deferred objects and does not even need to return them for them to be honored in the test results. And because the handling of those results is outside the scope of the method, there's no decorator on the method that can intervene when those deferred tests fail... at least, not without some support by the test runner.

comment:17 Changed at 2020-09-25T16:23:51Z by jaraco

Given that:

  • there's little prior art for dealing with flaky tests in Trial,
  • it's possible for deferred functionality to be present without any signal on the method itself,
  • any retry functionality is going to require special handling in the test runner,
  • the consensus is that flaky tests are acceptable,

I'm going to shelve this effort for now.

comment:18 Changed at 2020-10-09T19:38:00Z by jaraco

Here is some of the conversation about this effort:

[2020-09-25 11:33:39] <jaraco> Is there a way for a decorator on a method to detect all of the deferreds that were created during the method and wait on all of them to succeed and retry if any of them fail?
[2020-09-25 11:37:02] <jaraco> I’ve written https://github.com/jaraco/trial-retry/blob/13499eb3a5d9e8d068c28b3c40e52ca71a56a13a/test_everything.py#L40-L71 to achieve that for a single deferred in the return value.
[2020-09-25 11:38:30] <jaraco> Except, even that approach won’t work because it’s only retrying the callback. It’s not retrying the test.
[2020-09-25 11:58:17] <itamarst> there's gather_results
[2020-09-25 11:58:36] <itamarst> but...
[2020-09-25 11:58:54] <itamarst> I'm not sure why you can't retry whole method
[2020-09-25 11:59:19] <itamarst> oh, you don't want to mess with internal list
[2020-09-25 11:59:52] <itamarst> you want to addErrback a function that, for N times, returns the result of the original decorated function
[2020-09-25 12:00:50] <itamarst> a single retry would look like this:  `return result.addErrback(lambda _: f(*args, **kwargs))`
[2020-09-25 12:01:35] <itamarst> a niave "retry 3 times" is `return result.addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs))`
[2020-09-25 12:01:53] <itamarst> you don't want to mess with `.callbacks` at all, just call `addErrback`
[2020-09-25 12:10:33] <jaraco> The problem is that the deferreds aren’t always returned by the method.
[2020-09-25 12:16:48] <jaraco> Such as in `test_status_path_404_error` - how does one attach an errback to the “up” call?
[2020-09-25 12:20:40] <jaraco> Also, adding errback isn’t enough to suppress the failure. You want to trap exceptions in the callback and errbacks of the deferreds created.
[2020-09-25 12:25:53] <itamarst> if isinstance(result, Deferred): result = result.addErrback(lambda _: [_.trap(), f(*args, **kwargs)][0])
[2020-09-25 12:25:56] <itamarst> return result
[2020-09-25 12:26:04] <itamarst> should be [1]
[2020-09-25 12:26:11] <itamarst> and really, shouldn't be lambda should be real function
[2020-09-25 12:26:16] <itamarst> but easier to do one liners in IRC :)
[2020-09-25 12:27:39] <itamarst> and it's probably not trap(), it's probably something with self.expectedFailures or something
[2020-09-25 12:27:45] <itamarst> I forget the API name
[2020-09-25 12:27:52] <itamarst> but the basic structure is as above, just addErrback
[2020-09-25 12:28:07] <itamarst> Deferreds get chained automatically
[2020-09-25 12:28:26] <itamarst> so you just need to suppress the "logged exceptions get marked as failures logic"
[2020-09-25 12:28:31] <itamarst> that Twisted's TestCase adds
[2020-09-25 12:28:43] <itamarst> back later if you want to pair
[2020-09-25 13:18:10] <exarkun> jaraco: I'm not sure this is going to be a fruitful avenue
[2020-09-25 13:18:27] <exarkun> jaraco: If just retrying the test method isn't sufficient then the test is probably so broken it needs to be rewritten
[2020-09-25 13:19:09] <exarkun> jaraco: Clobbering `Deferred.callbacks` is pretty much guaranteed to make it more broken rather than less
[2020-09-25 13:29:48] <itamarst> jaraco: if the worry is "what if someone does addCallback after the return", that will break the normal test infrastructure too, so you don't need to handle that case

I've thought about this some more and had some ideas. Like what if the asynchronous test could be synchronized then retried? That doesn't work because the event loop is already created for setup/teardown.

comment:19 Changed at 2020-10-23T14:53:56Z by exarkun

  • Description modified (diff)

comment:20 Changed at 2020-10-23T14:54:46Z by exarkun

  • Description modified (diff)

comment:21 Changed at 2021-01-30T14:50:29Z by jaraco

  • Owner jaraco deleted
  • Status changed from assigned to new

comment:22 Changed at 2021-07-05T22:17:53Z by itamarst

  • Resolution set to fixed
  • Status changed from new to closed

This is happening... a lot less? Going to close it, can open new one if new issues arise.

Note: See TracTickets for help on using tickets.