#3945 closed task (wontfix)

Retry moody GitHub Actions steps

Reported by: sajith Owned by: sajith
Priority: normal Milestone: undecided
Component: dev-infrastructure Version: n/a
Keywords: Cc:
Launchpad Bug:

Description

Some workflows fail on GitHub Actions either because the tests are moody or GitHub Actions itself is moody. Example: https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477

2022-11-27T01:09:13.3236569Z [FAIL]
2022-11-27T01:09:13.3236873Z Traceback (most recent call last):
2022-11-27T01:09:13.3237795Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 47, in _convert_done
2022-11-27T01:09:13.3238340Z     f.trap(PollComplete)
2022-11-27T01:09:13.3239166Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap
2022-11-27T01:09:13.3244610Z     self.raiseException()
2022-11-27T01:09:13.3245778Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException
2022-11-27T01:09:13.3259779Z     raise self.value.with_traceback(self.tb)
2022-11-27T01:09:13.3260719Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\internet\defer.py", line 206, in maybeDeferred
2022-11-27T01:09:13.3261254Z     result = f(*args, **kwargs)
2022-11-27T01:09:13.3261923Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 69, in _poll
2022-11-27T01:09:13.3262457Z     self.fail("Errors snooped, terminating early")
2022-11-27T01:09:13.3262935Z twisted.trial.unittest.FailTest: Errors snooped, terminating early
2022-11-27T01:09:13.3263257Z 
2022-11-27T01:09:13.3263547Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3263989Z ===============================================================================
2022-11-27T01:09:13.3264288Z [ERROR]
2022-11-27T01:09:13.3264609Z Traceback (most recent call last):
2022-11-27T01:09:13.3265386Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\rrefutil.py", line 26, in _no_get_version
2022-11-27T01:09:13.3268422Z     f.trap(Violation, RemoteException)
2022-11-27T01:09:13.3269217Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap
2022-11-27T01:09:13.3269711Z     self.raiseException()
2022-11-27T01:09:13.3270396Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException
2022-11-27T01:09:13.3270976Z     raise self.value.with_traceback(self.tb)
2022-11-27T01:09:13.3271553Z foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
2022-11-27T01:09:13.3271977Z 
2022-11-27T01:09:13.3272448Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3272884Z ===============================================================================
2022-11-27T01:09:13.3273207Z [ERROR]
2022-11-27T01:09:13.3273530Z Traceback (most recent call last):
2022-11-27T01:09:13.3274088Z Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
2022-11-27T01:09:13.3274512Z 
2022-11-27T01:09:13.3274802Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3275437Z -------------------------------------------------------------------------------
2022-11-27T01:09:13.3275958Z Ran 1776 tests in 1302.475s
2022-11-27T01:09:13.3276195Z 
2022-11-27T01:09:13.3276435Z FAILED (skips=27, failures=1, errors=2, successes=1748)

That failure has nothing to do with the changes that triggered that workflow; it might be a good idea to retry that step.

Some other workflows take a long time to run. Examples: on https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477, coverage (ubuntu-latest, pypy-37), integration (ubuntu-latest, 3.7), and integration (ubuntu-latest, 3.9). Although in this specific instance integration tests are failing due to #3943, it might be a good idea to retry them after a reasonable timeout, and give up altogether after a number of tries instead of spinning for many hours on end.

This perhaps would be a good use of actions/retry-step?

Change History (6)

comment:1 Changed at 2022-11-30T15:00:06Z by exarkun

I don't think automatically doing a rerun of the whole test suite when a test fails is a good idea.

If there is a real test failure then the result is CI takes N times as long to complete. If there is a spurious test failure that we're not aware of then the result is that we don't become aware of it for much longer. If there is a spurious test failure that we are aware of then the result is that it is swept under the rug and is much easier to ignore for much longer.

These all seem like downsides to me.

comment:2 Changed at 2022-11-30T15:26:08Z by sajith

Hmm, that is true. Do you think there's value in using a smaller timeout value though? Sometimes running tests seem to get stuck without terminating cleanly. Like in this case, for example:

https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3525447679

Integration tests on Ubuntu ran for six hours, which I guess GitHub's default timeout. From a developer experience perspective, I guess it would be useful for them to fail sooner than that.

comment:3 follow-up: Changed at 2022-11-30T18:06:31Z by meejah

A timeout less than 6 hours sounds good (!!) but yeah I mostly agree with what jean-paul is saying.

That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)

comment:4 in reply to: ↑ 3 Changed at 2022-12-01T20:26:04Z by sajith

Replying to meejah:

That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)

A quick search for "flaky", "spurious", and "test_upload_and_download_convergent" here in Trac turned up #3413, #3412, #1768, #1084, and this milestone: Integration and Unit Testing.

There might be more tickets. I guess all those tickets ideally should belong to that milestone.

Perhaps it might be worth collecting some data about these failures when testing master branch alone, since PR branches are likely add too much noise? https://github.com/tahoe-lafs/tahoe-lafs/actions?query=branch%3Amaster does not look ideal. However, since GitHub doesn't keep test logs long enough for organizations on free plans, collecting that data is going to be rather challenging.

Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)

comment:5 Changed at 2022-12-01T20:34:45Z by exarkun

Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)

I wouldn't say this is the case. I spent a large chunk of time last year fixing flaky tests. The test suite is currently much more reliable than it was before that effort.

comment:6 Changed at 2022-12-12T17:45:47Z by exarkun

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.