#4126 assigned defect

Unit test suite inconsistently failing on CircleCI

Reported by: hacklschorsch Owned by: hacklschorsch
Priority: normal Milestone: undecided
Component: dev-infrastructure Version: n/a
Keywords: ci Cc: benoit@…
Launchpad Bug:

Description (last modified by btlogy)

  1. For at least 3 months (likely more, but can no longer see older logs) we can see test_verify_one_bad_encprivkey spuriously failing in the CircleCI logs (except for master which was broken #4098)
  2. More recently, test_system.HTTPSystemTest is failing more often:

CI reactors under test.test_system on CircleCI fail inconsistently in the Tahoe-lafs AND LeastAuthority? orgs (not the same plan). And this cannot be reproduced locally on Nixos nor on GitHub? CI (inside similar docker images).

Possible root cause discussed in https://github.com/tahoe-lafs/tahoe-lafs/pull/1381#issuecomment-2476885548 meejah writes:

The unclean-reactor errors may be simply a downstream symptom of the real errors that also happen in that run (e.g. several tests time out).

My own tests suggest that indeed, raising the SystemTests? timeout make a couple of flaky tests much more stable:

Failure count Test name
1 allmydata.test.test_system.HTTPSystemTest.test_mutable_mdmf
3 allmydata.test.test_system.HTTPSystemTest.test_mutable_sdmf
30 allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_convergent
11 allmydata.test.test_system.HTTPSystemTest.test_upload_and_download_random_key

This ticket is similar but not equal to ticket:4085, ticket:4022, ticket:2994 .

NOTE: there is an ongoing collaborative effort to get to the bottom of this issue using this tmp doc: https://cryptpad.fr/code/#/2/code/view/ApS8GZH4OfKbR71RdkRa1LLClaJk88emHeW0yvwhHkk/

Change History (13)

comment:1 Changed at 2024-11-15T15:22:05Z by hacklschorsch

  • Description modified (diff)
  • Owner set to hacklschorsch

comment:2 Changed at 2024-11-15T15:22:44Z by hacklschorsch

  • Summary changed from Tests time out on CircleCI to Tests time out on CircleCI, subsequent 'unclean reactor' errors

comment:3 Changed at 2024-11-15T15:23:06Z by hacklschorsch

  • Status changed from new to assigned

comment:4 Changed at 2024-11-15T15:40:38Z by hacklschorsch

A better fix to the dirty reactors might be to clean them up - and indeed, SystemTests does some reactor-cleanup dance in tearDown(), which *should* also be called if a test fails - but it seems that does not always happen or always help.

This is called even if the test method raised an exception [...] This method will only be called if the setUp() succeeds

Also see the discussion in https://stackoverflow.com/questions/39883058/teardown-not-called-after-timeout-in-twisted-trial

comment:6 Changed at 2024-12-04T13:45:03Z by btlogy

  • Component changed from unknown to dev-infrastructure
  • Description modified (diff)
  • Keywords ci added
  • Summary changed from Tests time out on CircleCI, subsequent 'unclean reactor' errors to CI test_system fails inconsistently

If I'm not mistaken, those reactor

comment:7 Changed at 2024-12-06T00:17:26Z by btlogy

  • Description modified (diff)
  • Summary changed from CI test_system fails inconsistently to Unit test suite inconsistently failing on CircleCI

comment:8 Changed at 2024-12-06T00:18:31Z by btlogy

  • Cc benoit@… added

comment:9 Changed at 2024-12-06T23:48:57Z by btlogy

I'm also observing some similar inconsistent failure on GHA: https://github.com/tahoe-lafs/tahoe-lafs/pull/1404#issuecomment-2524644247

  • ubuntu-20.04, py-3.11: FAILED integration/test_web.py::test_upload_download

ValueError: Expected a 2xx code, got 500

  • macos-14, py-3.11: ERROR integration/test_tor.py::test_onion_service_storage

twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 255

comment:10 Changed at 2024-12-07T02:06:36Z by btlogy

Also be happening in Windows:

https://app.circleci.com/pipelines/github/LeastAuthority/tahoe-lafs/1216/workflows/27697f8b-e55e-4a4d-953a-09c2dfbce129/jobs/14801/tests#failed-test-0

allmydata.test.test_system.HTTPSystemTest: test_mutable_sdmf

allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <class 'twisted.internet.error.ConnectingCancelledError'>

comment:11 Changed at 2024-12-07T02:08:24Z by btlogy

  • Description modified (diff)

comment:12 Changed at 2024-12-07T10:16:33Z by btlogy

  • Description modified (diff)

comment:13 Changed at 2025-01-12T06:40:12Z by hacklschorsch

Some useful statistics on which tests fail the most - and a not so useful since not working for us? flaky-test detection - can be found here: https://app.circleci.com/insights/github/tahoe-lafs/tahoe-lafs/workflows/ci/tests

"Flaky tests" is empty (?) "Most Failed Tests" contain our flaky tests (which probably are failing because of earlier problems) The first four or so pages (a ten tests|) of "Slowest Tests" include basically all the SystemTest? and HTTPSystemTest suites that fail randomly and too often and probably should be integration tests instead (?)

Version 0, edited at 2025-01-12T06:40:12Z by hacklschorsch (next)
Note: See TracTickets for help on using tickets.