#4126 assigned defect

Unit test suite inconsistently failing on CircleCI

Reported by: hacklschorsch Owned by: hacklschorsch
Priority: normal Milestone: undecided
Component: dev-infrastructure Version: n/a
Keywords: ci Cc: benoit@…
Launchpad Bug:

Description (last modified by btlogy)

  1. For at least 3 months (likely more, but can no longer see older logs) we can see test_verify_one_bad_encprivkey spuriously failing in the CircleCI logs (except for master which was broken #4098)
  2. More recently, test_system.HTTPSystemTest is failing more often:

CI reactors under test.test_system on CircleCI fail inconsistently in the Tahoe-lafs AND LeastAuthority? orgs (not the same plan). And this cannot be reproduced locally on Nixos nor on GitHub? CI (inside similar docker images).

Possible root cause discussed in https://github.com/tahoe-lafs/tahoe-lafs/pull/1381#issuecomment-2476885548 meejah writes:

The unclean-reactor errors may be simply a downstream symptom of the real errors that also happen in that run (e.g. several tests time out).

My own tests suggest that indeed, raising the SystemTests? timeout make a couple of flaky tests much more stable:

Failure count Test name
1 allmydata.test.test_system.HTTPSystemTest.test_mutable_mdmf
3 allmydata.test.test_system.HTTPSystemTest.test_mutable_sdmf
30 allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_convergent
11 allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_random_key

This ticket is similar but not equal to ticket:4085, ticket:4022, ticket:2994 .

NOTE: there is an ongoing collaborative effort to get to the bottom of this issue using this tmp doc: https://cryptpad.fr/code/#/2/code/view/ApS8GZH4OfKbR71RdkRa1LLClaJk88emHeW0yvwhHkk/

Change History (13)

comment:1 Changed at 2024-11-15T15:22:05Z by hacklschorsch

  • Description modified (diff)
  • Owner set to hacklschorsch

comment:2 Changed at 2024-11-15T15:22:44Z by hacklschorsch

  • Summary changed from Tests time out on CircleCI to Tests time out on CircleCI, subsequent 'unclean reactor' errors

comment:3 Changed at 2024-11-15T15:23:06Z by hacklschorsch

  • Status changed from new to assigned

comment:4 Changed at 2024-11-15T15:40:38Z by hacklschorsch

A better fix to the dirty reactors might be to clean them up - and indeed, SystemTests does some reactor-cleanup dance in tearDown(), which *should* also be called if a test fails - but it seems that does not always happen or always help.

This is called even if the test method raised an exception [...] This method will only be called if the setUp() succeeds

Also see the discussion in https://stackoverflow.com/questions/39883058/teardown-not-called-after-timeout-in-twisted-trial

comment:6 Changed at 2024-12-04T13:45:03Z by btlogy

  • Component changed from unknown to dev-infrastructure
  • Description modified (diff)
  • Keywords ci added
  • Summary changed from Tests time out on CircleCI, subsequent 'unclean reactor' errors to CI test_system fails inconsistently

If I'm not mistaken, those reactor

comment:7 Changed at 2024-12-06T00:17:26Z by btlogy

  • Description modified (diff)
  • Summary changed from CI test_system fails inconsistently to Unit test suite inconsistently failing on CircleCI

comment:8 Changed at 2024-12-06T00:18:31Z by btlogy

  • Cc benoit@… added

comment:9 Changed at 2024-12-06T23:48:57Z by btlogy

I'm also observing some similar inconsistent failure on GHA: https://github.com/tahoe-lafs/tahoe-lafs/pull/1404#issuecomment-2524644247

  • ubuntu-20.04, py-3.11: FAILED integration/test_web.py::test_upload_download

ValueError: Expected a 2xx code, got 500

  • macos-14, py-3.11: ERROR integration/test_tor.py::test_onion_service_storage

twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 255

comment:10 Changed at 2024-12-07T02:06:36Z by btlogy

Also be happening in Windows:

https://app.circleci.com/pipelines/github/LeastAuthority/tahoe-lafs/1216/workflows/27697f8b-e55e-4a4d-953a-09c2dfbce129/jobs/14801/tests#failed-test-0

allmydata.test.test_system.HTTPSystemTest: test_mutable_sdmf

allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <class 'twisted.internet.error.ConnectingCancelledError'>

comment:11 Changed at 2024-12-07T02:08:24Z by btlogy

  • Description modified (diff)

comment:12 Changed at 2024-12-07T10:16:33Z by btlogy

  • Description modified (diff)

comment:13 Changed at 2025-01-12T06:40:12Z by hacklschorsch

Some useful statistics on which tests fail the most - and a not so useful since not working for us? flaky-test detection - can be found here: https://app.circleci.com/insights/github/tahoe-lafs/tahoe-lafs/workflows/ci/tests

  • "Flaky tests" is empty (?)
  • "Most Failed Tests" contain our flaky tests (which probably are failing because of earlier problems)
  • The first four or so pages (a ten tests) of "Slowest Tests" include basically all the SystemTest and !HTTPSystemTest suites that fail randomly and too often and probably should be integration tests instead (?)
Last edited at 2025-01-12T06:41:19Z by hacklschorsch (previous) (diff)
Note: See TracTickets for help on using tickets.