#4186 new defect

One server process did not start on testgrid due to PID File collision

Reported by: hacklschorsch Owned by:
Priority: normal Milestone: undecided
Component: unknown Version: n/a
Keywords: Cc:
Launchpad Bug:

Description

After a HW failure and rebooting, one of the storage node servers on the testgrid did not start because the PID was already in use.

This issue seems two-fold:

  1. PIDfiles are still used even when they shouldn't be. I thought I had turned them off by setting pidfile= (empty), because systemd does not need them. Seems that does not work.
  2. The PIDfile mechanism should compare PID *and* start time ([https://tahoe-lafs.readthedocs.io/en/latest/running.html#multiple-instances|as documented) but it seems the recorded start time does not help to discern the running PID from own process.

Snippet:

Aug 20 04:52:48 testgrid tahoe[5284]: ERROR: A process is already running as PID 738
Aug 20 04:52:48 testgrid tahoe[5284]: 'tahoe run' in '/var/lib/tahoe-lafs/alpha'

Full systemctl status output:

[root@testgrid:~]# systemctl status tahoe.alpha.service
× tahoe.alpha.service - Tahoe LAFS node alpha
     Loaded: loaded (/etc/systemd/system/tahoe.alpha.service; enabled; preset: ignored)
     Active: failed (Result: exit-code) since Wed 2025-08-20 04:52:48 UTC; 1min 21s ago
   Duration: 1.533s
 Invocation: debbaab77c594b87983e657284f8d1b1
    Process: 5280 ExecStartPre=/nix/store/sgw152fwaab21wr3fch4ig8cqcz3nw3n-unit-script-tahoe.alpha-pre-start/bin/tahoe.alpha-pre-start>
    Process: 5284 ExecStart=/nix/store/v0a0359ssg6avrf34a06kaz02cmz860p-python3-tahoe-lafs/bin/tahoe run --allow-stdin-close $STATE_DI>
   Main PID: 5284 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
   Mem peak: 61.6M
        CPU: 1.514s

Aug 20 04:52:46 testgrid systemd[1]: Starting Tahoe LAFS node alpha...
Aug 20 04:52:46 testgrid systemd[1]: Started Tahoe LAFS node alpha.
Aug 20 04:52:48 testgrid tahoe[5284]: ERROR: A process is already running as PID 738
Aug 20 04:52:48 testgrid tahoe[5284]: 'tahoe run' in '/var/lib/tahoe-lafs/alpha'
Aug 20 04:52:48 testgrid systemd[1]: tahoe.alpha.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 04:52:48 testgrid systemd[1]: tahoe.alpha.service: Failed with result 'exit-code'.
Aug 20 04:52:48 testgrid systemd[1]: tahoe.alpha.service: Consumed 1.514s CPU time, 61.6M memory peak.

Change History (2)

comment:1 Changed at 2025-08-20T14:22:17Z by hacklschorsch

https://docs.twisted.org/en/stable/core/howto/systemd.html#create-a-systemd-service-file

The --pidfile= flag prevents twistd from writing a pidfile. A pidfile is not necessary when Twisted runs as a foreground process.

comment:2 Changed at 2025-08-29T13:29:24Z by hacklschorsch

This PR might have a fix for this: https://github.com/tahoe-lafs/tahoe-lafs/pull/1412

Note: See TracTickets for help on using tickets.