fix(forgejo): instrument deploy runner watchdog #278

Merged

pdurlej merged 1 commit from codex/260/runner-watchdog-diagnostics into main

2026-05-14 10:43:39 +02:00

codex commented

2026-05-14 10:29:11 +02:00

Collaborator

Canary status: missing — fire canary 3+3 manually before merge

Summary

Adds pre-restart diagnostics to scripts/forgejo/deploy-runner-watchdog for #260.

The watchdog still uses the same stuck-job predicate and the same WAIT_SECONDS=120 default. When it detects a stuck trusted-main auto-apply job, it now logs evidence before restarting the deploy runner:

stuck platformctl-auto-apply.yml run/job rows with age_seconds and task assignment fields;
candidate deploy-host runner rows, including stale duplicate candidates if present;
recent action_task rows relevant to runner assignment;
systemctl show status for forgejo-deploy-runner.service before restart.

Diagnostic query failures are logged but do not block the existing restart path.

Canary Context Pack

Product story

We need the next deploy-runner pickup incident to leave enough evidence to distinguish Forgejo assignment drift, stale runner rows, service state, and task creation timing without guessing or mutating production data.

What changed

The watchdog gained read-only diagnostic snapshots before restart, plus tests that pin the expected evidence fields.

Why it changed

RCA on #260 showed the runner is alive and polling, while Forgejo only assigns work after a runner declaration/restart. Oracle agreed the next safest step is instrumentation before any cleanup or threshold change.

Files touched

scripts/forgejo/deploy-runner-watchdog
tests/test_deploy_runner_watchdog.py

Relevant context

Issue #260 deploy-host runner pickup failures
Issue #142 cutover thread checkpoint with soak/pickup caveat
prompts/codex-260-runner-pickup-rca-2026-05-14.md

Runtime evidence

No runtime mutation in this PR. Existing #260 evidence showed waiting trusted-main jobs, active runner polling, task assignment only after restart/declaration, and a stale duplicate runner row candidate.

Known constraints

This PR does not change the detection threshold, runner config, Forgejo DB rows, Infisical, or any production application. It only improves the audit trail when the existing watchdog decides to restart the deploy runner.

Explicit out-of-scope

Cleaning stale action_runner rows
Changing WAIT_SECONDS
Changing forgejo-deploy-runner.service
Changing runner labels/config
Touching Infisical or direct PAT fallback state

Requested decision

Review whether the added diagnostics capture the right pre-restart evidence while preserving the existing restart behavior.

Merge blockers

Any diagnostic failure must not prevent the existing restart path.
The watchdog must continue to target only forgejo-deploy-runner.service.
The stuck-job predicate must remain limited to trusted-main platformctl-auto-apply.yml push/workflow_dispatch jobs.

Test plan

bash -n scripts/forgejo/deploy-runner-watchdog
git diff --check
uv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q — 34 passed

Spec sources read

prompts/codex-260-runner-pickup-rca-2026-05-14.md — dispatch and constraints
scripts/forgejo/deploy-runner-watchdog — implementation under investigation
tests/test_deploy_runner_watchdog.py — existing watchdog coverage
control-plane/pyproject.toml — local test environment/dependencies

Refs #260

Canary status: missing — fire canary 3+3 manually before merge ## Summary Adds pre-restart diagnostics to `scripts/forgejo/deploy-runner-watchdog` for #260. The watchdog still uses the same stuck-job predicate and the same `WAIT_SECONDS=120` default. When it detects a stuck trusted-main auto-apply job, it now logs evidence before restarting the deploy runner: - stuck `platformctl-auto-apply.yml` run/job rows with `age_seconds` and task assignment fields; - candidate `deploy-host` runner rows, including stale duplicate candidates if present; - recent `action_task` rows relevant to runner assignment; - `systemctl show` status for `forgejo-deploy-runner.service` before restart. Diagnostic query failures are logged but do not block the existing restart path. ## Canary Context Pack ### Product story We need the next deploy-runner pickup incident to leave enough evidence to distinguish Forgejo assignment drift, stale runner rows, service state, and task creation timing without guessing or mutating production data. ### What changed The watchdog gained read-only diagnostic snapshots before restart, plus tests that pin the expected evidence fields. ### Why it changed RCA on #260 showed the runner is alive and polling, while Forgejo only assigns work after a runner declaration/restart. Oracle agreed the next safest step is instrumentation before any cleanup or threshold change. ### Files touched - `scripts/forgejo/deploy-runner-watchdog` - `tests/test_deploy_runner_watchdog.py` ### Relevant context - Issue #260 deploy-host runner pickup failures - Issue #142 cutover thread checkpoint with soak/pickup caveat - `prompts/codex-260-runner-pickup-rca-2026-05-14.md` ### Runtime evidence No runtime mutation in this PR. Existing #260 evidence showed waiting trusted-main jobs, active runner polling, task assignment only after restart/declaration, and a stale duplicate runner row candidate. ### Known constraints This PR does not change the detection threshold, runner config, Forgejo DB rows, Infisical, or any production application. It only improves the audit trail when the existing watchdog decides to restart the deploy runner. ### Explicit out-of-scope - Cleaning stale `action_runner` rows - Changing `WAIT_SECONDS` - Changing `forgejo-deploy-runner.service` - Changing runner labels/config - Touching Infisical or direct PAT fallback state ### Requested decision Review whether the added diagnostics capture the right pre-restart evidence while preserving the existing restart behavior. ### Merge blockers - Any diagnostic failure must not prevent the existing restart path. - The watchdog must continue to target only `forgejo-deploy-runner.service`. - The stuck-job predicate must remain limited to trusted-main `platformctl-auto-apply.yml` push/workflow_dispatch jobs. ## Test plan - `bash -n scripts/forgejo/deploy-runner-watchdog` - `git diff --check` - `uv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q` — 34 passed ## Spec sources read - `prompts/codex-260-runner-pickup-rca-2026-05-14.md` — dispatch and constraints - `scripts/forgejo/deploy-runner-watchdog` — implementation under investigation - `tests/test_deploy_runner_watchdog.py` — existing watchdog coverage - `control-plane/pyproject.toml` — local test environment/dependencies Refs #260

codex added 1 commit

2026-05-14 10:29:11 +02:00

fix(forgejo): instrument deploy runner watchdog

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 35s

Details

python-ci / Python 3.12 (pull_request) Successful in 38s

Details

python-ci / Python 3.13 (pull_request) Successful in 38s

Details

canary-required / canary (pull_request) Successful in 13s