Opened at 2024-11-15T15:14:31Z
Last modified at 2025-01-12T06:41:19Z
#4126 assigned defect
Unit test suite inconsistently failing on CircleCI
Reported by: | hacklschorsch | Owned by: | hacklschorsch |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | dev-infrastructure | Version: | n/a |
Keywords: | ci | Cc: | benoit@… |
Launchpad Bug: |
Description (last modified by btlogy)
- For at least 3 months (likely more, but can no longer see older logs) we can see test_verify_one_bad_encprivkey spuriously failing in the CircleCI logs (except for master which was broken #4098)
- More recently, test_system.HTTPSystemTest is failing more often:
CI reactors under test.test_system on CircleCI fail inconsistently in the Tahoe-lafs AND LeastAuthority? orgs (not the same plan). And this cannot be reproduced locally on Nixos nor on GitHub? CI (inside similar docker images).
Possible root cause discussed in https://github.com/tahoe-lafs/tahoe-lafs/pull/1381#issuecomment-2476885548 meejah writes:
The unclean-reactor errors may be simply a downstream symptom of the real errors that also happen in that run (e.g. several tests time out).
My own tests suggest that indeed, raising the SystemTests? timeout make a couple of flaky tests much more stable:
Failure count | Test name |
1 | allmydata.test.test_system.HTTPSystemTest.test_mutable_mdmf |
3 | allmydata.test.test_system.HTTPSystemTest.test_mutable_sdmf |
30 | allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_convergent |
11 | allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_random_key |
This ticket is similar but not equal to ticket:4085, ticket:4022, ticket:2994 .
NOTE: there is an ongoing collaborative effort to get to the bottom of this issue using this tmp doc: https://cryptpad.fr/code/#/2/code/view/ApS8GZH4OfKbR71RdkRa1LLClaJk88emHeW0yvwhHkk/
Change History (13)
comment:1 Changed at 2024-11-15T15:22:05Z by hacklschorsch
- Description modified (diff)
- Owner set to hacklschorsch
comment:2 Changed at 2024-11-15T15:22:44Z by hacklschorsch
- Summary changed from Tests time out on CircleCI to Tests time out on CircleCI, subsequent 'unclean reactor' errors
comment:3 Changed at 2024-11-15T15:23:06Z by hacklschorsch
- Status changed from new to assigned
comment:4 Changed at 2024-11-15T15:40:38Z by hacklschorsch
comment:5 Changed at 2024-11-15T15:41:19Z by hacklschorsch
Raising timeout in https://github.com/tahoe-lafs/tahoe-lafs/pull/1387
comment:6 Changed at 2024-12-04T13:45:03Z by btlogy
- Component changed from unknown to dev-infrastructure
- Description modified (diff)
- Keywords ci added
- Summary changed from Tests time out on CircleCI, subsequent 'unclean reactor' errors to CI test_system fails inconsistently
If I'm not mistaken, those reactor
comment:7 Changed at 2024-12-06T00:17:26Z by btlogy
- Description modified (diff)
- Summary changed from CI test_system fails inconsistently to Unit test suite inconsistently failing on CircleCI
comment:8 Changed at 2024-12-06T00:18:31Z by btlogy
- Cc benoit@… added
comment:9 Changed at 2024-12-06T23:48:57Z by btlogy
I'm also observing some similar inconsistent failure on GHA: https://github.com/tahoe-lafs/tahoe-lafs/pull/1404#issuecomment-2524644247
- ubuntu-20.04, py-3.11: FAILED integration/test_web.py::test_upload_download
ValueError: Expected a 2xx code, got 500
- macos-14, py-3.11: ERROR integration/test_tor.py::test_onion_service_storage
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 255
comment:10 Changed at 2024-12-07T02:06:36Z by btlogy
Also be happening in Windows:
allmydata.test.test_system.HTTPSystemTest: test_mutable_sdmf
allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <class 'twisted.internet.error.ConnectingCancelledError'>
comment:11 Changed at 2024-12-07T02:08:24Z by btlogy
- Description modified (diff)
comment:12 Changed at 2024-12-07T10:16:33Z by btlogy
- Description modified (diff)
comment:13 Changed at 2025-01-12T06:40:12Z by hacklschorsch
Some useful statistics on which tests fail the most - and a not so useful since not working for us? flaky-test detection - can be found here: https://app.circleci.com/insights/github/tahoe-lafs/tahoe-lafs/workflows/ci/tests
- "Flaky tests" is empty (?)
- "Most Failed Tests" contain our flaky tests (which probably are failing because of earlier problems)
- The first four or so pages (a ten tests) of "Slowest Tests" include basically all the SystemTest and !HTTPSystemTest suites that fail randomly and too often and probably should be integration tests instead (?)
A better fix to the dirty reactors might be to clean them up - and indeed, SystemTests does some reactor-cleanup dance in tearDown(), which *should* also be called if a test fails - but it seems that does not always happen or always help.
Also see the discussion in https://stackoverflow.com/questions/39883058/teardown-not-called-after-timeout-in-twisted-trial