fix(forgejo): instrument deploy runner watchdog #278
No reviewers
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform!278
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "codex/260/runner-watchdog-diagnostics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Canary status: missing — fire canary 3+3 manually before merge
Summary
Adds pre-restart diagnostics to
scripts/forgejo/deploy-runner-watchdogfor #260.The watchdog still uses the same stuck-job predicate and the same
WAIT_SECONDS=120default. When it detects a stuck trusted-main auto-apply job, it now logs evidence before restarting the deploy runner:platformctl-auto-apply.ymlrun/job rows withage_secondsand task assignment fields;deploy-hostrunner rows, including stale duplicate candidates if present;action_taskrows relevant to runner assignment;systemctl showstatus forforgejo-deploy-runner.servicebefore restart.Diagnostic query failures are logged but do not block the existing restart path.
Canary Context Pack
Product story
We need the next deploy-runner pickup incident to leave enough evidence to distinguish Forgejo assignment drift, stale runner rows, service state, and task creation timing without guessing or mutating production data.
What changed
The watchdog gained read-only diagnostic snapshots before restart, plus tests that pin the expected evidence fields.
Why it changed
RCA on #260 showed the runner is alive and polling, while Forgejo only assigns work after a runner declaration/restart. Oracle agreed the next safest step is instrumentation before any cleanup or threshold change.
Files touched
scripts/forgejo/deploy-runner-watchdogtests/test_deploy_runner_watchdog.pyRelevant context
prompts/codex-260-runner-pickup-rca-2026-05-14.mdRuntime evidence
No runtime mutation in this PR. Existing #260 evidence showed waiting trusted-main jobs, active runner polling, task assignment only after restart/declaration, and a stale duplicate runner row candidate.
Known constraints
This PR does not change the detection threshold, runner config, Forgejo DB rows, Infisical, or any production application. It only improves the audit trail when the existing watchdog decides to restart the deploy runner.
Explicit out-of-scope
action_runnerrowsWAIT_SECONDSforgejo-deploy-runner.serviceRequested decision
Review whether the added diagnostics capture the right pre-restart evidence while preserving the existing restart behavior.
Merge blockers
forgejo-deploy-runner.service.platformctl-auto-apply.ymlpush/workflow_dispatch jobs.Test plan
bash -n scripts/forgejo/deploy-runner-watchdoggit diff --checkuv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q— 34 passedSpec sources read
prompts/codex-260-runner-pickup-rca-2026-05-14.md— dispatch and constraintsscripts/forgejo/deploy-runner-watchdog— implementation under investigationtests/test_deploy_runner_watchdog.py— existing watchdog coveragecontrol-plane/pyproject.toml— local test environment/dependenciesRefs #260