Opened at 2024-11-15T15:14:31Z
Last modified at 2025-01-12T06:41:19Z
#4126 assigned defect
Unit test suite inconsistently failing on CircleCI
Reported by: | hacklschorsch | Owned by: | hacklschorsch |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | dev-infrastructure | Version: | n/a |
Keywords: | ci | Cc: | benoit@… |
Launchpad Bug: |
Description (last modified by btlogy)
- For at least 3 months (likely more, but can no longer see older logs) we can see test_verify_one_bad_encprivkey spuriously failing in the CircleCI logs (except for master which was broken #4098)
- More recently, test_system.HTTPSystemTest is failing more often:
CI reactors under test.test_system on CircleCI fail inconsistently in the Tahoe-lafs AND LeastAuthority? orgs (not the same plan). And this cannot be reproduced locally on Nixos nor on GitHub? CI (inside similar docker images).
Possible root cause discussed in https://github.com/tahoe-lafs/tahoe-lafs/pull/1381#issuecomment-2476885548 meejah writes:
The unclean-reactor errors may be simply a downstream symptom of the real errors that also happen in that run (e.g. several tests time out).
My own tests suggest that indeed, raising the SystemTests? timeout make a couple of flaky tests much more stable:
Failure count | Test name |
1 | allmydata.test.test_system.HTTPSystemTest.test_mutable_mdmf |
3 | allmydata.test.test_system.HTTPSystemTest.test_mutable_sdmf |
30 | allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_convergent |
11 | allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_random_key |
This ticket is similar but not equal to ticket:4085, ticket:4022, ticket:2994 .
NOTE: there is an ongoing collaborative effort to get to the bottom of this issue using this tmp doc: https://cryptpad.fr/code/#/2/code/view/ApS8GZH4OfKbR71RdkRa1LLClaJk88emHeW0yvwhHkk/
Change History (13)
comment:1 Changed at 2024-11-15T15:22:05Z by hacklschorsch
- Description modified (diff)
- Owner set to hacklschorsch
comment:2 Changed at 2024-11-15T15:22:44Z by hacklschorsch
- Summary changed from Tests time out on CircleCI to Tests time out on CircleCI, subsequent 'unclean reactor' errors
comment:3 Changed at 2024-11-15T15:23:06Z by hacklschorsch
- Status changed from new to assigned
comment:4 Changed at 2024-11-15T15:40:38Z by hacklschorsch
comment:5 Changed at 2024-11-15T15:41:19Z by hacklschorsch
Raising timeout in https://github.com/tahoe-lafs/tahoe-lafs/pull/1387
comment:6 Changed at 2024-12-04T13:45:03Z by btlogy
- Component changed from unknown to dev-infrastructure
- Description modified (diff)
- Keywords ci added
- Summary changed from Tests time out on CircleCI, subsequent 'unclean reactor' errors to CI test_system fails inconsistently
If I'm not mistaken, those reactor
comment:7 Changed at 2024-12-06T00:17:26Z by btlogy
- Description modified (diff)
- Summary changed from CI test_system fails inconsistently to Unit test suite inconsistently failing on CircleCI
comment:8 Changed at 2024-12-06T00:18:31Z by btlogy
- Cc benoit@… added
comment:9 Changed at 2024-12-06T23:48:57Z by btlogy
I'm also observing some similar inconsistent failure on GHA: https://github.com/tahoe-lafs/tahoe-lafs/pull/1404#issuecomment-2524644247
- ubuntu-20.04, py-3.11: FAILED integration/test_web.py::test_upload_download
ValueError: Expected a 2xx code, got 500
- macos-14, py-3.11: ERROR integration/test_tor.py::test_onion_service_storage
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 255
comment:10 Changed at 2024-12-07T02:06:36Z by btlogy
Also be happening in Windows:
allmydata.test.test_system.HTTPSystemTest: test_mutable_sdmf
allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <class 'twisted.internet.error.ConnectingCancelledError'>
comment:11 Changed at 2024-12-07T02:08:24Z by btlogy
- Description modified (diff)
comment:12 Changed at 2024-12-07T10:16:33Z by btlogy
- Description modified (diff)
comment:13 Changed at 2025-01-12T06:40:12Z by hacklschorsch
Some useful statistics on which tests fail the most - and a not so useful since not working for us? flaky-test detection - can be found here: https://app.circleci.com/insights/github/tahoe-lafs/tahoe-lafs/workflows/ci/tests
"Flaky tests" is empty (?) "Most Failed Tests" contain our flaky tests (which probably are failing because of earlier problems) The first four or so pages (a ten tests|) of "Slowest Tests" include basically all the SystemTest? and HTTPSystemTest suites that fail randomly and too often and probably should be integration tests instead (?)
A better fix to the dirty reactors might be to clean them up - and indeed, SystemTests does some reactor-cleanup dance in tearDown(), which *should* also be called if a test fails - but it seems that does not always happen or always help.
Also see the discussion in https://stackoverflow.com/questions/39883058/teardown-not-called-after-timeout-in-twisted-trial