ops(forgejo): deploy-host runner intermittently fails to pick up waiting jobs #260
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform#260
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Deploy-host runner can be active and declared, while manual
platformctl-auto-apply.ymlworkflow_dispatchjobs remain in Forgejowaitinguntilforgejo-deploy-runner.serviceis restarted.Evidence
Observed during dashboard and F1.5 smokes:
The current watchdog runs every minute and reports
no stuck trusted-main auto-apply jobs, but it only queries:Manual cutover smokes use
event = 'workflow_dispatch', so the watchdog ignores exactly the jobs that need the restart during F1.5.Read-only DB sample after run #635 completed showed manual auto-apply rows as
workflow_dispatch:Suggested fix
Patch
scripts/forgejo/deploy-runner-watchdogto include bothpushandworkflow_dispatchevents for trusted-main auto-apply jobs. Keep the samerefs/heads/main, waiting status, job waiting status, and unassignedtask_id = 0checks.Acceptance criteria
workflow_dispatchauto-apply runs.pushandworkflow_dispatchare matched.forgejo-deploy-runner.service, never app containers.Refs #142
Regression evidence — deploy runner pickup still stalls — 2026-05-14
Role: executor
Status: issue should be considered still active
During Infisical Token Auth soak smokes after #273/#275/#276 merged, both additional workflow_dispatch runs got stuck in
waitinguntilforgejo-deploy-runner.servicewas restarted manually.Evidence
Run #736/API #876 (
dashboard):action_run.status=5,action_run_job.status=5,task_id=0forgejo-deploy-runner.service: run picked up and completed success.in-sync, applynoop, healthOK.Run #737/API #877 (
matrix-well-known):action_run.status=5,action_run_job.status=5,task_id=0forgejo-deploy-runner.service: run picked up and completed success.in-sync, applynoop, healthOK.Notes
forgejo_token_source=infisical-token-authtwice and nodirect-env-fallbackmarker.Suggested next fix
Audit
scripts/forgejo/deploy-runner-watchdogagainst the live Forgejo DB status values and timing. It may be missing this state transition or itsWAIT_SECONDS/timer cadence may be too conservative for operator-visible smokes.Codex #260 root-cause investigation — 2026-05-14 10:15 CEST
Role: investigator
Outcome path: B (diagnosis-only)
Quota remaining: not exposed in this Codex session; no quota stop hit
Evidence gathered
Read-only evidence collected from RS2000 and Forgejo:
journalctl -u forgejo-deploy-runner.service --since "48 hours ago"journalctl -u forgejo-deploy-runner-watchdog.service --since "48 hours ago"systemctl cat forgejo-deploy-runner.service forgejo-deploy-runner-watchdog.{service,timer}/var/lib/forgejo-deploy-runner/config.yamlaction_run,action_run_job,action_task,action_runnerRunnerService/FetchTask,Declare,UpdateTaskThe codex PAT receives
403onGET /api/v1/repos/pdurlej/platform/actions/runners, so runner-state evidence came from the Forgejo DB instead of that API endpoint.Findings
Labels and token are not the root cause.
Target runs #729/API 869, #736/API 876, and #737/API 877 all requested
runs_on=["deploy-host"]and ultimately ran on active runnerid=5,rs2000-deploy-host,agent_labels=["deploy-host"].The deploy runner is polling while jobs are waiting.
Traefik logs show
159.195.33.107(host deploy runner path) postingRunnerService/FetchTaskroughly once per second during the stuck windows. Forgejo returns200with tiny empty responses until the runner service restarts.Task assignment happens immediately after runner re-declare/reconnect.
Example #876:
2026-05-14 09:46:10 CESTGET /actions/runs/876shows waiting2026-05-14 09:47:50runner service restartsRunnerService/Declare, thenUpdateTaskaction_task.id=1381,runner_id=5, created at09:47:50Observed assignment latency is systematic, not one-off.
For
platformctl-auto-apply.ymltrusted-main dispatches since 2026-05-13,seconds_to_taskwas consistently delayed: 32s to 177s. Recent examples:The watchdog did not miss #876/#877.
Current watchdog threshold is
WAIT_SECONDS=120. #876 was restarted at age ~100s and #877 at ~69s, before the watchdog was supposed to fire. #729 was older than 120s and the watchdog correctly restarted the deploy runner.No evidence found for resource pressure or DB bloat.
No OOM/throttle signal found in the investigated window.
action_*tables are small and recently autovacuumed/analyzed. No current stuck trusted-main auto-apply jobs at investigation time.Minor drift: stale duplicate deploy runner row exists.
DB has old
action_runner.id=4, namers2000-deploy-host,repo_id=35,version='',agent_labels=null,last_online=0. Active runner isid=5. Because id=4 has no labels and never receives tasks, this is not proven as root cause, but it is cleanup-worthy under operator/admin control.Root cause / best hypothesis
Strongest hypothesis: Forgejo-side runner assignment state goes stale until the deploy runner re-declares. The runner process remains alive and polls
FetchTask; the server answers successfully but empty. Restarting the service forcesDeclare, and the queued job is assigned immediately.This is closer to a Forgejo Actions scheduler/runner-state issue than to Infisical, PAT, SSH, labels, Tailscale, or workflow YAML. The existing watchdog is therefore a valid operational mitigation, but it is not a root fix.
Action taken
No PR opened. I do not have evidence for a safe config-level root fix. Changing
fetch_interval/fetch_timeoutwould be speculative: the runner is already reaching Forgejo frequently and receiving HTTP 200 responses.Recommended next steps
run_id,run.index,age_seconds,job_id, andtask_idfor every restart candidate. Optional next Lite PR.FetchTaskreturns empty while job waits,Declareimmediately unlocks task assignment.Next session pickup
Pick between:
scripts/forgejo/deploy-runner-watchdog, orCodex #260 RCA follow-up — 2026-05-14 10:29 CEST
Role: executor
Status: instrumentation PR opened
What I did
Evidence captured in the PR
deploy-hostrunner rows, including stale duplicate candidates.action_taskrows for assignment timing.systemctl show forgejo-deploy-runner.servicebefore restart.Verification
bash -n scripts/forgejo/deploy-runner-watchdoggit diff --checkuv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q→ 34 passedCurrent hypothesis ranking
Next action after merge
Wait for the next watchdog-triggered pickup failure. Then inspect
journalctl -u forgejo-deploy-runner-watchdog*/ timer logs and use the new snapshots to decide whether cleanup, config, or upstream Forgejo issue is warranted.Codex #260 root-cause investigation — 2026-05-14 10:51 CEST
Role: investigator
Outcome path: B (diagnosis-only + instrumentation activated)
Quota remaining: not available from local CLI; no quota warning surfaced
Evidence gathered
mainfast-forwarded tof592fc5(#278 merged); git identity iscodex <codex@noreply.git.pdurlej.com>; codex PAT length 40.ssh rs2000 'echo ok'succeeded./actions/runnersreturns 403, so runner state must be read from RS2000/Forgejo DB, not that API endpoint.ExecStart=/opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdog; timer active, one-shot service running every ~1m.forgejo-deploy-runner.serviceactive/running,MainPID=1262289,NRestarts=0, started2026-05-14 09:50:03 CEST.capacity: 1,timeout: 3h,fetch_timeout: 5s,fetch_interval: 2s,report_interval: 1s, labeldeploy-host:host.0.id=4 name=rs2000-deploy-host version=<empty> last_online=0 agent_labels=nullid=5 name=rs2000-deploy-host version=v7.0.0 last_online=2026-05-14 10:49:36+02 agent_labels=["deploy-host"]Root cause / best hypothesis
The strongest remaining hypothesis is still Forgejo-side assignment/eligibility state getting stale while the runner process remains healthy enough to poll. The stale duplicate runner row
id=4remains suspicious, but today's evidence still does not prove it is the root cause.A separate process root cause was confirmed: #278 merged watchdog instrumentation into the repo, but RS2000 was still executing release
79955cffrom/opt/pdurlej-platform/current, becauseplatformctl-auto-apply.ymlonly triggers onmodules/**. That meant non-module ops-script changes do not automatically reach the systemd-executed release root.Action taken
f592fc56725ff9affabe1db3473f4e6102a36b4dand atomically moving/opt/pdurlej-platform/currentto it./opt/pdurlej-platform/current -> releases/f592fc56725ff9affabe1db3473f4e6102a36b4dbash -n /opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdogpasses755 root:rootno stuck trusted-main auto-apply jobsNext session pickup
Wait for the next actual stuck pickup event with #278 instrumentation active. If it happens, inspect
journalctl -u forgejo-deploy-runner-watchdog.servicefor the new pre-restart snapshots:stuck trusted-main auto-apply jobscandidate deploy-host runner rowsrecent action_task rows for deploy-host investigationforgejo-deploy-runner.service status before restartIf those snapshots show
id=4being considered/assigned or label ambiguity, the next safe fix is stale runner cleanup or an upstream Forgejo issue. If snapshots show no runner assignment despite activeid=5, investigate Forgejo scheduler/queue behavior next.#260 observation during F2 smoke — mirotalk-sfu — 2026-05-14 11:14 CEST
Role: investigator
Intent: checkpoint #278 diagnostic capture
Needs owner: no
Run #895 (
workflow_dispatch,module=mirotalk-sfu) reproduced the pickup stall:2026-05-14 11:12:01+022026-05-14 11:14:27+02run_status=5,job_status=5,task_id=0,age_seconds=147ActiveState=active,SubState=running,MainPID=1262289,NRestarts=0, started09:50:03 CESTid=4, same name,last_online=0,agent_labels=nullid=5,version=v7.0.0,agent_labels=["deploy-host"],last_online=11:14:26+02task 1407This is now strong evidence that the process is alive and DB-visible as online, but Forgejo does not assign the waiting job until runner re-declaration. Stale duplicate runner row remains suspicious but still not conclusively causal.
Next: continue F2 batch unless restart count reaches the dispatch stop condition (
3+restarts in this batch).#260 observation during F2 smoke — mirotalk-admin — 2026-05-14 11:18 CEST
Role: investigator
Intent: checkpoint #278 diagnostic capture
Needs owner: no
Run #896 (
workflow_dispatch,module=mirotalk-admin) reproduced the same pickup stall pattern:2026-05-14 11:16:05+022026-05-14 11:18:38+02run_status=5,job_status=5,task_id=0,age_seconds=153ActiveState=active,SubState=running,MainPID=1586508,NRestarts=0, started11:14:28 CESTid=4, same name,last_online=0,agent_labels=nullid=5,version=v7.0.0,agent_labels=["deploy-host"],last_online=11:18:38+02task 1408This is the second consecutive workflow_dispatch deploy-host run in the batch that required watchdog re-declaration to get assigned.
Batch restart count: 2/3. Per F2 dispatch, a third watchdog restart in this batch is a hard stop for clean #260 evidence collection.
Codex Phase B — #260 RCA: invalid runner scope found — 2026-05-15 17:02 CEST
Role: investigator
Status: hard-gate reached; root cause identified, DB fix requires operator approval
What I found
The active deploy runner row is in an invalid Forgejo scope state:
Forgejo's own source says runners are one of three shapes only:
owner_id=0, repo_id=0owner_id!=0, repo_id=0owner_id=0, repo_id!=0It explicitly says both non-zero is not acceptable. The registration token currently in DB is repo-level only:
So the active deploy runner should be
owner_id=0, repo_id=35, notowner_id=1, repo_id=35.Why this explains the 150s pickup failure
Forgejo runner pickup is gated by
action_tasks_version:TasksVersioninFetchTaskPickTaskonly when runner version differs from DB version(0,0),(owner_id,0), and(0,repo_id)(1,35)Live DB confirms there is no row for
(owner_id=1, repo_id=35):For an invalid
(1,35)runner, Forgejo effectively returns local version1on every fetch. After the first fetch the runner caches1; future repo version bumps under(0,35)do not change the invalid combined scope, so the runner does not callPickTask. Restarting the runner resets its local cached version to0, so0 != 1and the next fetch picks the waiting job immediately. That matches runs #931 and #933 exactly.Evidence from reproductions
Run #931:
Run #933:
Source files read from Forgejo/runner official tags:
v15.0.1:models/actions/runner.go,models/actions/tasks_version.go,routers/api/actions/runner/runner.go,models/actions/task.gov7.0.0:internal/app/poll/poller.go,internal/app/cmd/daemon.goProposed runtime fix (hard-gated)
Do not hard-delete stale runner
id=4first. The stronger root cause is active runnerid=5having the wrong scope.Proposed operator-gated DB patch:
Before any write:
pg_dump -t action_runner -t action_tasks_versionto a timestamped file.After commit: trigger two consecutive
workflow_dispatchsmokes on a known-green module and require zero watchdog restarts for pickup.Gate
I am stopping here for G4/G7. Next step requires explicit operator approval to back up Forgejo tables and update the active runner row.
Codex Phase B — repo-side scope doctor PR opened — 2026-05-15
Role: executor / investigator
Status: repo-side fix prepared; runtime DB repair remains hard-gated
What I did
fix(forgejo): add deploy runner scope doctor.--apply.rs2000-deploy-hostmust be repo-scoped asowner_id=0, repo_id=35, not combinedowner_id=1, repo_id=35.Evidence
bash -n scripts/forgejo/deploy-runner-scope-doctorandpytest tests/test_deploy_runner_scope_doctor.py tests/test_deploy_runner_watchdog.py -q→ 8 passed.owner_id=1,repo_id=35,deleted=NULLowner_id=0,repo_id=35Hard gate still held
I did not modify Forgejo DB. The live repair still requires explicit operator approval at apply time.
Exact gate text I need before runtime apply:
After that approval, I will run backup + scoped transaction, then trigger two consecutive workflow_dispatch smokes and verify zero watchdog restarts.
Codex Phase B — deploy runner scope fixed + clean pickup verified — 2026-05-15 18:43 CEST
Role: executor / applier
Status: complete
What I did
/opt/pdurlej-platform/backups/forgejo-actions/action-runner-scope-20260515T163922Z.sqlrs2000-deploy-host id=5 owner_id=1, repo_id=35 -> owner_id=0, repo_id=35deploy-runner-scope-doctor --check: PASS.Evidence
Runtime DB repair output showed:
Two consecutive trusted-main workflow_dispatch smokes picked up cleanly with no watchdog restart:
Watchdog journal after the fix contains only periodic
no stuck trusted-main auto-apply jobs; norestarting forgejo-deploy-runner.serviceentries.Artifacts exist:
platformctl-auto-apply-957, 4174 bytesplatformctl-auto-apply-958, 3538 bytesResult
#260 root cause is fixed, not merely worked around. The watchdog remains as diagnostic belt-and-suspenders, but it was not needed for the two post-fix smokes.
Suggested close condition
Operator can close #260 after spot-checking this evidence. If desired, keep it open for 24h soak, but it should no longer be treated as active cutover fire.
Codex #260 observation — clean pickup after scope fix — 2026-05-16 09:24 CEST
Role: executor
Status: current evidence supports #260 fix
Evidence
e5e8fe02a3fc97dc147b9104670128bbc459a39avia run #1131/API #878.platformctl-auto-apply.ymlworkflow_dispatch smokes (#1132, #1134-#1138) all assigned to runner id=5.seconds_to_task=1for all six smokes.no stuck trusted-main auto-apply jobs; no restart fired.Interpretation
The post-fix runner scope is behaving cleanly in this observation window. This is the signal we wanted before continuing broader release batches.
Codex #260 soak evidence check — 2026-05-17 08:05 CEST
Role: executor
Outcome: clean soak evidence after deploy-runner scope fix
Evidence gathered
forgejo-deploy-runner.service:active.forgejo-deploy-runner-watchdog.timer:active.2026-05-16 00:00:00May 16 09:03:54.platformctl-auto-apply.ymljobs since2026-05-16 00:00+02:59seconds_to_task: min0, max2, avg0.95.1469Traefik F3:seconds_to_task=1.1460Forgejo F3:seconds_to_task=0.1447MinIO F3:seconds_to_task=2.5:owner_id=0,repo_id=35, versionv7.0.0, labels["deploy-host"], last_online2026-05-17 08:01:32+02.4is tombstoned/deleted and inert.Readout
This is the behavior we wanted: clean pickup without watchdog intervention. The old 69-177s task assignment delay is not present in the observed soak window.
Recommendation
Keep the watchdog as belt-and-suspenders instrumentation, but treat #260 as operationally resolved after this soak window. Next RS2000 migration work should be cleanup/closeout, not more runner RCA: close or re-scope #260, then address the known separate follow-ups
np-meerkat-frontend(#269) andminio-init(#310).Closing note - deploy-runner pickup resolved operationally - 2026-05-17 08:50 CEST
Role: executor
Status: resolved by scope fix + soak evidence
Evidence
forgejo-deploy-runner.service: active.forgejo-deploy-runner-watchdog.timer: active.2026-05-16 00:00: watchdog positive stuck detections = 0.2026-05-16 00:00: deploy-runner restarts = 0.platformctl-auto-apply.ymlpickup since2026-05-16 00:00+02: 59 runs, min 0s, max 2s, average 0.95s.v7.0.0, labels["deploy-host"].Decision
Closing this as operationally resolved. The watchdog remains as belt-and-suspenders instrumentation, but the false-fire pickup pattern is no longer present in soak evidence.
Reopen condition
Reopen if a trusted-main
platformctl-auto-apply.ymljob again waits withtask_id=0beyond threshold, or if watchdog restarts return after this closeout.Reference: PR #318 updates
state/STATUS_NOW.mdwith the post-F3 closeout state.