pdurlej/platform

Fork 0

ops(forgejo): deploy-host runner intermittently fails to pick up waiting jobs #260

New issue

Closed

opened 2026-05-13 09:10:06 +02:00 by codex · 12 comments

codex commented

2026-05-13 09:10:06 +02:00

Collaborator

Summary

Deploy-host runner can be active and declared, while manual platformctl-auto-apply.yml workflow_dispatch jobs remain in Forgejo waiting until forgejo-deploy-runner.service is restarted.

Evidence

Observed during dashboard and F1.5 smokes:

Dashboard run #612 / API #748 waited until deploy runner restart.
Dashboard run #621 / API #757 waited until deploy runner restart.
Matrix well-known run #635 / API #771 waited until deploy runner restart.

The current watchdog runs every minute and reports no stuck trusted-main auto-apply jobs, but it only queries:

ar.workflow_id = 'platformctl-auto-apply.yml'
and ar.event = 'push'
and ar.ref = 'refs/heads/main'
and ar.status = 5
and j.status = 5
and j.task_id = 0

Manual cutover smokes use event = 'workflow_dispatch', so the watchdog ignores exactly the jobs that need the restart during F1.5.

Read-only DB sample after run #635 completed showed manual auto-apply rows as workflow_dispatch:

771 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1252
757 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1235
748 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1225

Suggested fix

Patch scripts/forgejo/deploy-runner-watchdog to include both push and workflow_dispatch events for trusted-main auto-apply jobs. Keep the same refs/heads/main, waiting status, job waiting status, and unassigned task_id = 0 checks.

Acceptance criteria

Watchdog query includes manual workflow_dispatch auto-apply runs.
Unit/shell test proves both push and workflow_dispatch are matched.
No production containers are restarted by the test.
Service still restarts only forgejo-deploy-runner.service, never app containers.

Refs #142

## Summary Deploy-host runner can be active and declared, while manual `platformctl-auto-apply.yml` `workflow_dispatch` jobs remain in Forgejo `waiting` until `forgejo-deploy-runner.service` is restarted. ## Evidence Observed during dashboard and F1.5 smokes: - Dashboard run #612 / API #748 waited until deploy runner restart. - Dashboard run #621 / API #757 waited until deploy runner restart. - Matrix well-known run #635 / API #771 waited until deploy runner restart. The current watchdog runs every minute and reports `no stuck trusted-main auto-apply jobs`, but it only queries: ```sql ar.workflow_id = 'platformctl-auto-apply.yml' and ar.event = 'push' and ar.ref = 'refs/heads/main' and ar.status = 5 and j.status = 5 and j.task_id = 0 ``` Manual cutover smokes use `event = 'workflow_dispatch'`, so the watchdog ignores exactly the jobs that need the restart during F1.5. Read-only DB sample after run #635 completed showed manual auto-apply rows as `workflow_dispatch`: ```text 771 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1252 757 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1235 748 | platformctl-auto-apply.yml | workflow_dispatch | refs/heads/main | ... | task_id 1225 ``` ## Suggested fix Patch `scripts/forgejo/deploy-runner-watchdog` to include both `push` and `workflow_dispatch` events for trusted-main auto-apply jobs. Keep the same `refs/heads/main`, waiting status, job waiting status, and unassigned `task_id = 0` checks. ## Acceptance criteria - Watchdog query includes manual `workflow_dispatch` auto-apply runs. - Unit/shell test proves both `push` and `workflow_dispatch` are matched. - No production containers are restarted by the test. - Service still restarts only `forgejo-deploy-runner.service`, never app containers. Refs #142

codex added the

owner-attention

risk/process

labels

2026-05-13 09:10:06 +02:00

codex referenced this issue from a pull request that will close it,

2026-05-13 09:12:23 +02:00

fix(forgejo): watchdog covers manual auto-apply dispatch #261

codex referenced this issue

2026-05-13 09:15:06 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-13 09:18:05 +02:00

docs(state): update STATUS_NOW with post-cutover baseline #262

codex referenced this issue

2026-05-13 09:33:01 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

pdurlej closed this issue

2026-05-13 18:35:00 +02:00

codex commented

2026-05-14 09:51:39 +02:00

Author

Collaborator

Regression evidence — deploy runner pickup still stalls — 2026-05-14

Role: executor
Status: issue should be considered still active

During Infisical Token Auth soak smokes after #273/#275/#276 merged, both additional workflow_dispatch runs got stuck in waiting until forgejo-deploy-runner.service was restarted manually.

Evidence

Run #736/API #876 (dashboard):

DB before restart: action_run.status=5, action_run_job.status=5, task_id=0
After manual restart of only forgejo-deploy-runner.service: run picked up and completed success.
Runtime result: plan in-sync, apply noop, health OK.

Run #737/API #877 (matrix-well-known):

DB before restart: action_run.status=5, action_run_job.status=5, task_id=0
After manual restart of only forgejo-deploy-runner.service: run picked up and completed success.
Runtime result: plan in-sync, apply noop, health OK.

Notes

This is not an Infisical failure: both runs logged forgejo_token_source=infisical-token-auth twice and no direct-env-fallback marker.
This is not a production app failure: no unhealthy/restarting/dead containers after the smokes.
Watchdog did not restart the runner before manual intervention, despite the job being in the stuck shape it is meant to detect.

Suggested next fix

Audit scripts/forgejo/deploy-runner-watchdog against the live Forgejo DB status values and timing. It may be missing this state transition or its WAIT_SECONDS/timer cadence may be too conservative for operator-visible smokes.

## Regression evidence — deploy runner pickup still stalls — 2026-05-14 **Role:** executor **Status:** issue should be considered still active During Infisical Token Auth soak smokes after #273/#275/#276 merged, both additional workflow_dispatch runs got stuck in `waiting` until `forgejo-deploy-runner.service` was restarted manually. ### Evidence Run #736/API #876 (`dashboard`): - DB before restart: `action_run.status=5`, `action_run_job.status=5`, `task_id=0` - After manual restart of only `forgejo-deploy-runner.service`: run picked up and completed success. - Runtime result: plan `in-sync`, apply `noop`, health `OK`. Run #737/API #877 (`matrix-well-known`): - DB before restart: `action_run.status=5`, `action_run_job.status=5`, `task_id=0` - After manual restart of only `forgejo-deploy-runner.service`: run picked up and completed success. - Runtime result: plan `in-sync`, apply `noop`, health `OK`. ### Notes - This is not an Infisical failure: both runs logged `forgejo_token_source=infisical-token-auth` twice and no `direct-env-fallback` marker. - This is not a production app failure: no unhealthy/restarting/dead containers after the smokes. - Watchdog did not restart the runner before manual intervention, despite the job being in the stuck shape it is meant to detect. ### Suggested next fix Audit `scripts/forgejo/deploy-runner-watchdog` against the live Forgejo DB status values and timing. It may be missing this state transition or its `WAIT_SECONDS`/timer cadence may be too conservative for operator-visible smokes.

codex reopened this issue

2026-05-14 09:51:39 +02:00

codex referenced this issue

2026-05-14 09:52:00 +02:00

ops(secrets): migrate deploy-runner from direct PAT to Infisical Token Auth on machine identity #265

codex referenced this issue

2026-05-14 09:52:16 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

claude referenced this issue from a commit

2026-05-14 10:03:27 +02:00

prompts(codex): dispatch for #260 runner-pickup root cause investigation

claude referenced this issue

2026-05-14 10:04:34 +02:00

prompts(codex): dispatch for #260 runner-pickup root cause #277

pdurlej referenced this issue from a commit

2026-05-14 10:06:32 +02:00

Merge pull request 'prompts(codex): dispatch for #260 runner-pickup root cause' (#277) from claude/260-runner-pickup-rca-dispatch into main

codex commented

2026-05-14 10:16:09 +02:00

Author

Collaborator

Codex #260 root-cause investigation — 2026-05-14 10:15 CEST

Role: investigator
Outcome path: B (diagnosis-only)
Quota remaining: not exposed in this Codex session; no quota stop hit

Evidence gathered

Read-only evidence collected from RS2000 and Forgejo:

journalctl -u forgejo-deploy-runner.service --since "48 hours ago"
journalctl -u forgejo-deploy-runner-watchdog.service --since "48 hours ago"
systemctl cat forgejo-deploy-runner.service forgejo-deploy-runner-watchdog.{service,timer}
deploy runner config snapshot from /var/lib/forgejo-deploy-runner/config.yaml
Forgejo DB tables: action_run, action_run_job, action_task, action_runner
Traefik/Forgejo access logs for RunnerService/FetchTask, Declare, UpdateTask
kernel/OOM/resource signals around 2026-05-14 09:30-10:00 CEST

The codex PAT receives 403 on GET /api/v1/repos/pdurlej/platform/actions/runners, so runner-state evidence came from the Forgejo DB instead of that API endpoint.

Findings

Labels and token are not the root cause.
Target runs #729/API 869, #736/API 876, and #737/API 877 all requested runs_on=["deploy-host"] and ultimately ran on active runner id=5, rs2000-deploy-host, agent_labels=["deploy-host"].
The deploy runner is polling while jobs are waiting.
Traefik logs show 159.195.33.107 (host deploy runner path) posting RunnerService/FetchTask roughly once per second during the stuck windows. Forgejo returns 200 with tiny empty responses until the runner service restarts.
Task assignment happens immediately after runner re-declare/reconnect.
Example #876:
- workflow dispatch created: 2026-05-14 09:46:10 CEST
- repeated GET /actions/runs/876 shows waiting
- 2026-05-14 09:47:50 runner service restarts
- same second: Forgejo logs RunnerService/Declare, then UpdateTask
- DB: action_task.id=1381, runner_id=5, created at 09:47:50
Observed assignment latency is systematic, not one-off.
For platformctl-auto-apply.yml trusted-main dispatches since 2026-05-13, seconds_to_task was consistently delayed: 32s to 177s. Recent examples:
- #729/API 869: 150s, watchdog fired at 09:35:51
- #736/API 876: 100s, manual restart happened before watchdog threshold
- #737/API 877: 69s, manual restart happened before watchdog threshold
The watchdog did not miss #876/#877.
Current watchdog threshold is WAIT_SECONDS=120. #876 was restarted at age ~100s and #877 at ~69s, before the watchdog was supposed to fire. #729 was older than 120s and the watchdog correctly restarted the deploy runner.
No evidence found for resource pressure or DB bloat.
No OOM/throttle signal found in the investigated window. action_* tables are small and recently autovacuumed/analyzed. No current stuck trusted-main auto-apply jobs at investigation time.
Minor drift: stale duplicate deploy runner row exists.
DB has old action_runner.id=4, name rs2000-deploy-host, repo_id=35, version='', agent_labels=null, last_online=0. Active runner is id=5. Because id=4 has no labels and never receives tasks, this is not proven as root cause, but it is cleanup-worthy under operator/admin control.

Root cause / best hypothesis

Strongest hypothesis: Forgejo-side runner assignment state goes stale until the deploy runner re-declares. The runner process remains alive and polls FetchTask; the server answers successfully but empty. Restarting the service forces Declare, and the queued job is assigned immediately.

This is closer to a Forgejo Actions scheduler/runner-state issue than to Infisical, PAT, SSH, labels, Tailscale, or workflow YAML. The existing watchdog is therefore a valid operational mitigation, but it is not a root fix.

Action taken

No PR opened. I do not have evidence for a safe config-level root fix. Changing fetch_interval/fetch_timeout would be speculative: the runner is already reaching Forgejo frequently and receiving HTTP 200 responses.

Recommended next steps

Keep #260 open. Treat current watchdog as mitigation, not closure.
Do not manually restart for jobs younger than 120s during the next observation window unless the operator needs immediate progress. Let the watchdog prove whether it catches naturally.
Add instrumentation before another fix attempt: extend watchdog logging to include run_id, run.index, age_seconds, job_id, and task_id for every restart candidate. Optional next Lite PR.
Operator/admin cleanup: remove stale deploy runner row/id 4 via Forgejo admin UI/API after confirming it is not displayed as the active runner. This is not an agent DB mutation.
If this remains frequent: prepare a small upstream-quality Forgejo issue with evidence: polling FetchTask returns empty while job waits, Declare immediately unlocks task assignment.
Optional mitigation PR: if operator wants lower latency rather than root cause, consider lowering watchdog threshold from 120s to 60s after one clean observation. That is a mitigation tradeoff, not RCA closure.

Next session pickup

Pick between:

instrumentation Lite PR for scripts/forgejo/deploy-runner-watchdog, or
operator/admin stale-runner cleanup, or
upstream Forgejo bug report packet.

## Codex #260 root-cause investigation — 2026-05-14 10:15 CEST **Role:** investigator **Outcome path:** B (diagnosis-only) **Quota remaining:** not exposed in this Codex session; no quota stop hit ### Evidence gathered Read-only evidence collected from RS2000 and Forgejo: - `journalctl -u forgejo-deploy-runner.service --since "48 hours ago"` - `journalctl -u forgejo-deploy-runner-watchdog.service --since "48 hours ago"` - `systemctl cat forgejo-deploy-runner.service forgejo-deploy-runner-watchdog.{service,timer}` - deploy runner config snapshot from `/var/lib/forgejo-deploy-runner/config.yaml` - Forgejo DB tables: `action_run`, `action_run_job`, `action_task`, `action_runner` - Traefik/Forgejo access logs for `RunnerService/FetchTask`, `Declare`, `UpdateTask` - kernel/OOM/resource signals around 2026-05-14 09:30-10:00 CEST The codex PAT receives `403` on `GET /api/v1/repos/pdurlej/platform/actions/runners`, so runner-state evidence came from the Forgejo DB instead of that API endpoint. ### Findings 1. **Labels and token are not the root cause.** Target runs #729/API 869, #736/API 876, and #737/API 877 all requested `runs_on=["deploy-host"]` and ultimately ran on active runner `id=5`, `rs2000-deploy-host`, `agent_labels=["deploy-host"]`. 2. **The deploy runner is polling while jobs are waiting.** Traefik logs show `159.195.33.107` (host deploy runner path) posting `RunnerService/FetchTask` roughly once per second during the stuck windows. Forgejo returns `200` with tiny empty responses until the runner service restarts. 3. **Task assignment happens immediately after runner re-declare/reconnect.** Example #876: - workflow dispatch created: `2026-05-14 09:46:10 CEST` - repeated `GET /actions/runs/876` shows waiting - `2026-05-14 09:47:50` runner service restarts - same second: Forgejo logs `RunnerService/Declare`, then `UpdateTask` - DB: `action_task.id=1381`, `runner_id=5`, created at `09:47:50` 4. **Observed assignment latency is systematic, not one-off.** For `platformctl-auto-apply.yml` trusted-main dispatches since 2026-05-13, `seconds_to_task` was consistently delayed: 32s to 177s. Recent examples: - #729/API 869: 150s, watchdog fired at 09:35:51 - #736/API 876: 100s, manual restart happened before watchdog threshold - #737/API 877: 69s, manual restart happened before watchdog threshold 5. **The watchdog did not miss #876/#877.** Current watchdog threshold is `WAIT_SECONDS=120`. #876 was restarted at age ~100s and #877 at ~69s, before the watchdog was supposed to fire. #729 was older than 120s and the watchdog correctly restarted the deploy runner. 6. **No evidence found for resource pressure or DB bloat.** No OOM/throttle signal found in the investigated window. `action_*` tables are small and recently autovacuumed/analyzed. No current stuck trusted-main auto-apply jobs at investigation time. 7. **Minor drift: stale duplicate deploy runner row exists.** DB has old `action_runner.id=4`, name `rs2000-deploy-host`, `repo_id=35`, `version=''`, `agent_labels=null`, `last_online=0`. Active runner is `id=5`. Because id=4 has no labels and never receives tasks, this is not proven as root cause, but it is cleanup-worthy under operator/admin control. ### Root cause / best hypothesis Strongest hypothesis: **Forgejo-side runner assignment state goes stale until the deploy runner re-declares.** The runner process remains alive and polls `FetchTask`; the server answers successfully but empty. Restarting the service forces `Declare`, and the queued job is assigned immediately. This is closer to a Forgejo Actions scheduler/runner-state issue than to Infisical, PAT, SSH, labels, Tailscale, or workflow YAML. The existing watchdog is therefore a valid operational mitigation, but it is not a root fix. ### Action taken No PR opened. I do not have evidence for a safe config-level root fix. Changing `fetch_interval`/`fetch_timeout` would be speculative: the runner is already reaching Forgejo frequently and receiving HTTP 200 responses. ### Recommended next steps 1. **Keep #260 open.** Treat current watchdog as mitigation, not closure. 2. **Do not manually restart for jobs younger than 120s** during the next observation window unless the operator needs immediate progress. Let the watchdog prove whether it catches naturally. 3. **Add instrumentation before another fix attempt:** extend watchdog logging to include `run_id`, `run.index`, `age_seconds`, `job_id`, and `task_id` for every restart candidate. Optional next Lite PR. 4. **Operator/admin cleanup:** remove stale deploy runner row/id 4 via Forgejo admin UI/API after confirming it is not displayed as the active runner. This is not an agent DB mutation. 5. **If this remains frequent:** prepare a small upstream-quality Forgejo issue with evidence: polling `FetchTask` returns empty while job waits, `Declare` immediately unlocks task assignment. 6. **Optional mitigation PR:** if operator wants lower latency rather than root cause, consider lowering watchdog threshold from 120s to 60s after one clean observation. That is a mitigation tradeoff, not RCA closure. ### Next session pickup Pick between: - instrumentation Lite PR for `scripts/forgejo/deploy-runner-watchdog`, or - operator/admin stale-runner cleanup, or - upstream Forgejo bug report packet.

codex referenced this issue

2026-05-14 10:16:19 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-14 10:29:11 +02:00

fix(forgejo): instrument deploy runner watchdog #278

codex commented

2026-05-14 10:30:10 +02:00

Author

Collaborator

Codex #260 RCA follow-up — 2026-05-14 10:29 CEST

Role: executor
Status: instrumentation PR opened

What I did

Completed the #260 RCA pass and used a local Oracle second opinion for the next step.
Kept the outcome narrow: no runner cleanup, no threshold change, no Infisical changes, no production app mutation.
Opened PR #278: #278

Evidence captured in the PR

Watchdog will log stuck trusted-main auto-apply run/job rows before restart.
Watchdog will log candidate deploy-host runner rows, including stale duplicate candidates.
Watchdog will log recent action_task rows for assignment timing.
Watchdog will log systemctl show forgejo-deploy-runner.service before restart.
Diagnostic failures are non-blocking; the existing restart path still runs.

Verification

bash -n scripts/forgejo/deploy-runner-watchdog
git diff --check
uv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q → 34 passed

Current hypothesis ranking

Forgejo-side assignment/eligibility state gets stale while runner polling stays healthy.
Stale duplicate runner row is plausible but not proven root cause.
Runner service config timing is less likely; the evidence shows HTTP polling works.

Next action after merge

Wait for the next watchdog-triggered pickup failure. Then inspect journalctl -u forgejo-deploy-runner-watchdog* / timer logs and use the new snapshots to decide whether cleanup, config, or upstream Forgejo issue is warranted.

## Codex #260 RCA follow-up — 2026-05-14 10:29 CEST **Role:** executor **Status:** instrumentation PR opened ### What I did - Completed the #260 RCA pass and used a local Oracle second opinion for the next step. - Kept the outcome narrow: no runner cleanup, no threshold change, no Infisical changes, no production app mutation. - Opened PR #278: https://git.pdurlej.com/pdurlej/platform/pulls/278 ### Evidence captured in the PR - Watchdog will log stuck trusted-main auto-apply run/job rows before restart. - Watchdog will log candidate `deploy-host` runner rows, including stale duplicate candidates. - Watchdog will log recent `action_task` rows for assignment timing. - Watchdog will log `systemctl show forgejo-deploy-runner.service` before restart. - Diagnostic failures are non-blocking; the existing restart path still runs. ### Verification - `bash -n scripts/forgejo/deploy-runner-watchdog` - `git diff --check` - `uv run --project control-plane --extra dev pytest tests/test_deploy_runner_watchdog.py control-plane/platformctl/tests/test_forgejo_ci_scripts_contract.py control-plane/platformctl/tests/test_forgejo_workflow_lint_contract.py -q` → 34 passed ### Current hypothesis ranking 1. Forgejo-side assignment/eligibility state gets stale while runner polling stays healthy. 2. Stale duplicate runner row is plausible but not proven root cause. 3. Runner service config timing is less likely; the evidence shows HTTP polling works. ### Next action after merge Wait for the next watchdog-triggered pickup failure. Then inspect `journalctl -u forgejo-deploy-runner-watchdog*` / timer logs and use the new snapshots to decide whether cleanup, config, or upstream Forgejo issue is warranted.

codex referenced this issue

2026-05-14 10:30:10 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-14 10:51:33 +02:00

ops(deploy): non-module ops scripts do not promote to RS2000 release root after merge #279

codex commented

2026-05-14 10:52:12 +02:00

Author

Collaborator

Codex #260 root-cause investigation — 2026-05-14 10:51 CEST

Role: investigator
Outcome path: B (diagnosis-only + instrumentation activated)
Quota remaining: not available from local CLI; no quota warning surfaced

Evidence gathered

Identity/sync: local main fast-forwarded to f592fc5 (#278 merged); git identity is codex <codex@noreply.git.pdurlej.com>; codex PAT length 40.
SSH: ssh rs2000 'echo ok' succeeded.
Forgejo API: codex PAT can read recent action runs, but /actions/runners returns 403, so runner state must be read from RS2000/Forgejo DB, not that API endpoint.
Watchdog service: ExecStart=/opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdog; timer active, one-shot service running every ~1m.
Runner service: forgejo-deploy-runner.service active/running, MainPID=1262289, NRestarts=0, started 2026-05-14 09:50:03 CEST.
Runner config snapshot: capacity: 1, timeout: 3h, fetch_timeout: 5s, fetch_interval: 2s, report_interval: 1s, label deploy-host:host.
Current stuck trusted-main auto-apply jobs: 0.
Runner DB rows still show duplicate candidates:
- id=4 name=rs2000-deploy-host version=<empty> last_online=0 agent_labels=null
- id=5 name=rs2000-deploy-host version=v7.0.0 last_online=2026-05-14 10:49:36+02 agent_labels=["deploy-host"]
Recent deploy-host tasks: last runner_id=5 tasks were workflow_dispatch runs #876/#877 around 09:47-09:50; later tasks are runner_id=2 non-deploy CI jobs.

Root cause / best hypothesis

The strongest remaining hypothesis is still Forgejo-side assignment/eligibility state getting stale while the runner process remains healthy enough to poll. The stale duplicate runner row id=4 remains suspicious, but today's evidence still does not prove it is the root cause.

A separate process root cause was confirmed: #278 merged watchdog instrumentation into the repo, but RS2000 was still executing release 79955cf from /opt/pdurlej-platform/current, because platformctl-auto-apply.yml only triggers on modules/**. That meant non-module ops-script changes do not automatically reach the systemd-executed release root.

Action taken

PR #278 was already merged by operator.
Activated #278 on RS2000 by creating trusted release f592fc56725ff9affabe1db3473f4e6102a36b4d and atomically moving /opt/pdurlej-platform/current to it.
Verified:
- /opt/pdurlej-platform/current -> releases/f592fc56725ff9affabe1db3473f4e6102a36b4d
- watchdog instrumentation marker present
- bash -n /opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdog passes
- mode/owner: 755 root:root
- next timer run at 10:50:37 reported no stuck trusted-main auto-apply jobs
Opened follow-up #279 for the release-root promotion gap: #279

Next session pickup

Wait for the next actual stuck pickup event with #278 instrumentation active. If it happens, inspect journalctl -u forgejo-deploy-runner-watchdog.service for the new pre-restart snapshots:

stuck trusted-main auto-apply jobs
candidate deploy-host runner rows
recent action_task rows for deploy-host investigation
forgejo-deploy-runner.service status before restart

If those snapshots show id=4 being considered/assigned or label ambiguity, the next safe fix is stale runner cleanup or an upstream Forgejo issue. If snapshots show no runner assignment despite active id=5, investigate Forgejo scheduler/queue behavior next.

## Codex #260 root-cause investigation — 2026-05-14 10:51 CEST **Role:** investigator **Outcome path:** B (diagnosis-only + instrumentation activated) **Quota remaining:** not available from local CLI; no quota warning surfaced ### Evidence gathered - Identity/sync: local `main` fast-forwarded to `f592fc5` (#278 merged); git identity is `codex <codex@noreply.git.pdurlej.com>`; codex PAT length 40. - SSH: `ssh rs2000 'echo ok'` succeeded. - Forgejo API: codex PAT can read recent action runs, but `/actions/runners` returns 403, so runner state must be read from RS2000/Forgejo DB, not that API endpoint. - Watchdog service: `ExecStart=/opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdog`; timer active, one-shot service running every ~1m. - Runner service: `forgejo-deploy-runner.service` active/running, `MainPID=1262289`, `NRestarts=0`, started `2026-05-14 09:50:03 CEST`. - Runner config snapshot: `capacity: 1`, `timeout: 3h`, `fetch_timeout: 5s`, `fetch_interval: 2s`, `report_interval: 1s`, label `deploy-host:host`. - Current stuck trusted-main auto-apply jobs: `0`. - Runner DB rows still show duplicate candidates: - `id=4 name=rs2000-deploy-host version=<empty> last_online=0 agent_labels=null` - `id=5 name=rs2000-deploy-host version=v7.0.0 last_online=2026-05-14 10:49:36+02 agent_labels=["deploy-host"]` - Recent deploy-host tasks: last runner_id=5 tasks were workflow_dispatch runs #876/#877 around 09:47-09:50; later tasks are runner_id=2 non-deploy CI jobs. ### Root cause / best hypothesis The strongest remaining hypothesis is still Forgejo-side assignment/eligibility state getting stale while the runner process remains healthy enough to poll. The stale duplicate runner row `id=4` remains suspicious, but today's evidence still does not prove it is the root cause. A separate process root cause was confirmed: #278 merged watchdog instrumentation into the repo, but RS2000 was still executing release `79955cf` from `/opt/pdurlej-platform/current`, because `platformctl-auto-apply.yml` only triggers on `modules/**`. That meant non-module ops-script changes do not automatically reach the systemd-executed release root. ### Action taken - PR #278 was already merged by operator. - Activated #278 on RS2000 by creating trusted release `f592fc56725ff9affabe1db3473f4e6102a36b4d` and atomically moving `/opt/pdurlej-platform/current` to it. - Verified: - `/opt/pdurlej-platform/current -> releases/f592fc56725ff9affabe1db3473f4e6102a36b4d` - watchdog instrumentation marker present - `bash -n /opt/pdurlej-platform/current/scripts/forgejo/deploy-runner-watchdog` passes - mode/owner: `755 root:root` - next timer run at 10:50:37 reported `no stuck trusted-main auto-apply jobs` - Opened follow-up #279 for the release-root promotion gap: https://git.pdurlej.com/pdurlej/platform/issues/279 ### Next session pickup Wait for the next actual stuck pickup event with #278 instrumentation active. If it happens, inspect `journalctl -u forgejo-deploy-runner-watchdog.service` for the new pre-restart snapshots: - `stuck trusted-main auto-apply jobs` - `candidate deploy-host runner rows` - `recent action_task rows for deploy-host investigation` - `forgejo-deploy-runner.service status before restart` If those snapshots show `id=4` being considered/assigned or label ambiguity, the next safe fix is stale runner cleanup or an upstream Forgejo issue. If snapshots show no runner assignment despite active `id=5`, investigate Forgejo scheduler/queue behavior next.

codex referenced this issue

2026-05-14 10:52:12 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

glm referenced this issue from a commit

2026-05-14 10:56:58 +02:00

prompts(codex): dispatch for F2 ramp continuation (5 stateless smokes)

claude referenced this issue

2026-05-14 10:57:13 +02:00

prompts(codex): dispatch for F2 ramp continuation (5 stateless smokes) #280

codex commented

2026-05-14 11:15:23 +02:00

Author

Collaborator

#260 observation during F2 smoke — mirotalk-sfu — 2026-05-14 11:14 CEST

Role: investigator
Intent: checkpoint #278 diagnostic capture
Needs owner: no

Run #895 (workflow_dispatch, module=mirotalk-sfu) reproduced the pickup stall:

created: 2026-05-14 11:12:01+02
watchdog fired: 2026-05-14 11:14:27+02
pre-restart state: run_status=5, job_status=5, task_id=0, age_seconds=147
runner service before restart: ActiveState=active, SubState=running, MainPID=1262289, NRestarts=0, started 09:50:03 CEST
runner rows at capture time:
- stale id=4, same name, last_online=0, agent_labels=null
- active id=5, version=v7.0.0, agent_labels=["deploy-host"], last_online=11:14:26+02
after watchdog restart: runner declared successfully and immediately picked up task 1407
final run result: success; apply noop; health OK

This is now strong evidence that the process is alive and DB-visible as online, but Forgejo does not assign the waiting job until runner re-declaration. Stale duplicate runner row remains suspicious but still not conclusively causal.

Next: continue F2 batch unless restart count reaches the dispatch stop condition (3+ restarts in this batch).

## #260 observation during F2 smoke — mirotalk-sfu — 2026-05-14 11:14 CEST **Role:** investigator **Intent:** checkpoint #278 diagnostic capture **Needs owner:** no Run #895 (`workflow_dispatch`, `module=mirotalk-sfu`) reproduced the pickup stall: - created: `2026-05-14 11:12:01+02` - watchdog fired: `2026-05-14 11:14:27+02` - pre-restart state: `run_status=5`, `job_status=5`, `task_id=0`, `age_seconds=147` - runner service before restart: `ActiveState=active`, `SubState=running`, `MainPID=1262289`, `NRestarts=0`, started `09:50:03 CEST` - runner rows at capture time: - stale `id=4`, same name, `last_online=0`, `agent_labels=null` - active `id=5`, `version=v7.0.0`, `agent_labels=["deploy-host"]`, `last_online=11:14:26+02` - after watchdog restart: runner declared successfully and immediately picked up `task 1407` - final run result: success; apply noop; health OK This is now strong evidence that the process is alive and DB-visible as online, but Forgejo does not assign the waiting job until runner re-declaration. Stale duplicate runner row remains suspicious but still not conclusively causal. **Next:** continue F2 batch unless restart count reaches the dispatch stop condition (`3+` restarts in this batch).

codex referenced this issue

2026-05-14 11:15:24 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex commented

2026-05-14 11:19:52 +02:00

Author

Collaborator

#260 observation during F2 smoke — mirotalk-admin — 2026-05-14 11:18 CEST

Role: investigator
Intent: checkpoint #278 diagnostic capture
Needs owner: no

Run #896 (workflow_dispatch, module=mirotalk-admin) reproduced the same pickup stall pattern:

created: 2026-05-14 11:16:05+02
watchdog fired: 2026-05-14 11:18:38+02
pre-restart state: run_status=5, job_status=5, task_id=0, age_seconds=153
runner service before restart: ActiveState=active, SubState=running, MainPID=1586508, NRestarts=0, started 11:14:28 CEST
runner rows at capture time:
- stale id=4, same name, last_online=0, agent_labels=null
- active id=5, version=v7.0.0, agent_labels=["deploy-host"], last_online=11:18:38+02
after watchdog restart: runner declared successfully and immediately picked up task 1408
final run result: success; apply noop; health OK

This is the second consecutive workflow_dispatch deploy-host run in the batch that required watchdog re-declaration to get assigned.

Batch restart count: 2/3. Per F2 dispatch, a third watchdog restart in this batch is a hard stop for clean #260 evidence collection.

## #260 observation during F2 smoke — mirotalk-admin — 2026-05-14 11:18 CEST **Role:** investigator **Intent:** checkpoint #278 diagnostic capture **Needs owner:** no Run #896 (`workflow_dispatch`, `module=mirotalk-admin`) reproduced the same pickup stall pattern: - created: `2026-05-14 11:16:05+02` - watchdog fired: `2026-05-14 11:18:38+02` - pre-restart state: `run_status=5`, `job_status=5`, `task_id=0`, `age_seconds=153` - runner service before restart: `ActiveState=active`, `SubState=running`, `MainPID=1586508`, `NRestarts=0`, started `11:14:28 CEST` - runner rows at capture time: - stale `id=4`, same name, `last_online=0`, `agent_labels=null` - active `id=5`, `version=v7.0.0`, `agent_labels=["deploy-host"]`, `last_online=11:18:38+02` - after watchdog restart: runner declared successfully and immediately picked up `task 1408` - final run result: success; apply noop; health OK This is the second consecutive workflow_dispatch deploy-host run in the batch that required watchdog re-declaration to get assigned. **Batch restart count:** 2/3. Per F2 dispatch, a third watchdog restart in this batch is a hard stop for clean #260 evidence collection.

codex referenced this issue

2026-05-14 11:19:52 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-14 11:23:35 +02:00

fix(modules): shorten Umami outcome for strict-v2 #281

codex referenced this issue

2026-05-14 11:25:18 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

glm referenced this issue from a commit

2026-05-14 11:57:45 +02:00

prompts(codex): A+B dispatch for #260 proper fix + umami health probe

claude referenced this issue

2026-05-14 11:58:01 +02:00

prompts(codex): A+B dispatch for #260 proper fix + umami health probe #282

pdurlej referenced this issue from a commit

2026-05-14 11:58:47 +02:00

Merge pull request 'prompts(codex): A+B dispatch for #260 proper fix + umami health probe' (#282) from claude/260-proper-fix-plus-umami-probe-dispatch into main

glm referenced this issue from a commit

2026-05-14 12:02:28 +02:00

decisions(0018): agentic-first systems fix root causes, not workarounds

claude referenced this issue

2026-05-14 12:02:40 +02:00

decisions(0018): agentic-first systems fix root causes, not workarounds #283

codex referenced this issue

2026-05-14 12:08:56 +02:00

fix(smoke): probe Umami health inside container #284

codex referenced this issue

2026-05-14 15:53:51 +02:00

fix(smoke): probe Umami through Tailnet route #285

codex referenced this issue

2026-05-15 16:44:32 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex commented

2026-05-15 16:51:22 +02:00

Author

Collaborator

Codex Phase B — #260 RCA: invalid runner scope found — 2026-05-15 17:02 CEST

Role: investigator
Status: hard-gate reached; root cause identified, DB fix requires operator approval

What I found

The active deploy runner row is in an invalid Forgejo scope state:

action_runner id=5 name=rs2000-deploy-host owner_id=1 repo_id=35 labels=["deploy-host"] deleted=NULL

Forgejo's own source says runners are one of three shapes only:

global: owner_id=0, repo_id=0
owner/org: owner_id!=0, repo_id=0
repo-level: owner_id=0, repo_id!=0

It explicitly says both non-zero is not acceptable. The registration token currently in DB is repo-level only:

action_runner_token id=2 owner_id=NULL repo_id=35 is_active=true

So the active deploy runner should be owner_id=0, repo_id=35, not owner_id=1, repo_id=35.

Why this explains the 150s pickup failure

Forgejo runner pickup is gated by action_tasks_version:

runner sends cached TasksVersion in FetchTask
Forgejo calls PickTask only when runner version differs from DB version
job creation increments task versions for scopes (0,0), (owner_id,0), and (0,repo_id)
there is no maintained combined scope (1,35)

Live DB confirms there is no row for (owner_id=1, repo_id=35):

select * from action_tasks_version where owner_id=1 and repo_id=35;
-- 0 rows

For an invalid (1,35) runner, Forgejo effectively returns local version 1 on every fetch. After the first fetch the runner caches 1; future repo version bumps under (0,35) do not change the invalid combined scope, so the runner does not call PickTask. Restarting the runner resets its local cached version to 0, so 0 != 1 and the next fetch picks the waiting job immediately. That matches runs #931 and #933 exactly.

Evidence from reproductions

Run #931:

run_id=931 event=push job=trusted main plan apply health task_id=0 age_seconds=155
watchdog restarted forgejo-deploy-runner at 16:37:09
runner declared and immediately got task 1461

Run #933:

run_id=933 event=workflow_dispatch job=trusted main plan apply health task_id=0 age_seconds=166
watchdog restarted forgejo-deploy-runner at 16:42:27
runner declared and immediately got task 1463

Source files read from Forgejo/runner official tags:

Forgejo v15.0.1: models/actions/runner.go, models/actions/tasks_version.go, routers/api/actions/runner/runner.go, models/actions/task.go
forgejo-runner v7.0.0: internal/app/poll/poller.go, internal/app/cmd/daemon.go

Proposed runtime fix (hard-gated)

Do not hard-delete stale runner id=4 first. The stronger root cause is active runner id=5 having the wrong scope.

Proposed operator-gated DB patch:

BEGIN;
SELECT id, name, owner_id, repo_id, deleted FROM action_runner WHERE id=5;
UPDATE action_runner
SET owner_id = 0
WHERE id = 5
  AND name = 'rs2000-deploy-host'
  AND repo_id = 35
  AND owner_id = 1
  AND deleted IS NULL;
SELECT id, name, owner_id, repo_id, deleted FROM action_runner WHERE id=5;
-- COMMIT only after operator approval; otherwise ROLLBACK

Before any write: pg_dump -t action_runner -t action_tasks_version to a timestamped file.

After commit: trigger two consecutive workflow_dispatch smokes on a known-green module and require zero watchdog restarts for pickup.

Gate

I am stopping here for G4/G7. Next step requires explicit operator approval to back up Forgejo tables and update the active runner row.

## Codex Phase B — #260 RCA: invalid runner scope found — 2026-05-15 17:02 CEST **Role:** investigator **Status:** hard-gate reached; root cause identified, DB fix requires operator approval ### What I found The active deploy runner row is in an invalid Forgejo scope state: ```text action_runner id=5 name=rs2000-deploy-host owner_id=1 repo_id=35 labels=["deploy-host"] deleted=NULL ``` Forgejo's own source says runners are one of three shapes only: - global: `owner_id=0, repo_id=0` - owner/org: `owner_id!=0, repo_id=0` - repo-level: `owner_id=0, repo_id!=0` It explicitly says both non-zero is not acceptable. The registration token currently in DB is repo-level only: ```text action_runner_token id=2 owner_id=NULL repo_id=35 is_active=true ``` So the active deploy runner should be `owner_id=0, repo_id=35`, not `owner_id=1, repo_id=35`. ### Why this explains the 150s pickup failure Forgejo runner pickup is gated by `action_tasks_version`: 1. runner sends cached `TasksVersion` in `FetchTask` 2. Forgejo calls `PickTask` only when runner version differs from DB version 3. job creation increments task versions for scopes `(0,0)`, `(owner_id,0)`, and `(0,repo_id)` 4. there is no maintained combined scope `(1,35)` Live DB confirms there is no row for `(owner_id=1, repo_id=35)`: ```text select * from action_tasks_version where owner_id=1 and repo_id=35; -- 0 rows ``` For an invalid `(1,35)` runner, Forgejo effectively returns local version `1` on every fetch. After the first fetch the runner caches `1`; future repo version bumps under `(0,35)` do not change the invalid combined scope, so the runner does not call `PickTask`. Restarting the runner resets its local cached version to `0`, so `0 != 1` and the next fetch picks the waiting job immediately. That matches runs #931 and #933 exactly. ### Evidence from reproductions Run #931: ```text run_id=931 event=push job=trusted main plan apply health task_id=0 age_seconds=155 watchdog restarted forgejo-deploy-runner at 16:37:09 runner declared and immediately got task 1461 ``` Run #933: ```text run_id=933 event=workflow_dispatch job=trusted main plan apply health task_id=0 age_seconds=166 watchdog restarted forgejo-deploy-runner at 16:42:27 runner declared and immediately got task 1463 ``` Source files read from Forgejo/runner official tags: - Forgejo `v15.0.1`: `models/actions/runner.go`, `models/actions/tasks_version.go`, `routers/api/actions/runner/runner.go`, `models/actions/task.go` - forgejo-runner `v7.0.0`: `internal/app/poll/poller.go`, `internal/app/cmd/daemon.go` ### Proposed runtime fix (hard-gated) Do **not** hard-delete stale runner `id=4` first. The stronger root cause is active runner `id=5` having the wrong scope. Proposed operator-gated DB patch: ```sql BEGIN; SELECT id, name, owner_id, repo_id, deleted FROM action_runner WHERE id=5; UPDATE action_runner SET owner_id = 0 WHERE id = 5 AND name = 'rs2000-deploy-host' AND repo_id = 35 AND owner_id = 1 AND deleted IS NULL; SELECT id, name, owner_id, repo_id, deleted FROM action_runner WHERE id=5; -- COMMIT only after operator approval; otherwise ROLLBACK ``` Before any write: `pg_dump -t action_runner -t action_tasks_version` to a timestamped file. After commit: trigger two consecutive `workflow_dispatch` smokes on a known-green module and require **zero watchdog restarts** for pickup. ### Gate I am stopping here for G4/G7. Next step requires explicit operator approval to back up Forgejo tables and update the active runner row.

codex referenced this issue

2026-05-15 16:58:24 +02:00

fix(forgejo): add deploy runner scope doctor #286

codex commented

2026-05-15 16:58:41 +02:00

Author

Collaborator

Codex Phase B — repo-side scope doctor PR opened — 2026-05-15

Role: executor / investigator
Status: repo-side fix prepared; runtime DB repair remains hard-gated

What I did

Opened PR #286: fix(forgejo): add deploy runner scope doctor.
Added a check-first script that diagnoses the deploy runner scope invariant and only mutates with explicit --apply.
Documented the root invariant: rs2000-deploy-host must be repo-scoped as owner_id=0, repo_id=35, not combined owner_id=1, repo_id=35.
Updated the runbook so watchdog is framed as diagnostics/belt-and-suspenders, not the accepted normal pickup path.

Evidence

PR: #286
Local tests: bash -n scripts/forgejo/deploy-runner-scope-doctor and pytest tests/test_deploy_runner_scope_doctor.py tests/test_deploy_runner_watchdog.py -q → 8 passed.
Read-only live check via stdin confirmed the known repairable state and performed no mutation:
- active runner id=5: owner_id=1, repo_id=35, deleted=NULL
- expected: owner_id=0, repo_id=35

Hard gate still held

I did not modify Forgejo DB. The live repair still requires explicit operator approval at apply time.

Exact gate text I need before runtime apply:

GO: backup + fix action_runner id=5 owner_id=0

After that approval, I will run backup + scoped transaction, then trigger two consecutive workflow_dispatch smokes and verify zero watchdog restarts.

## Codex Phase B — repo-side scope doctor PR opened — 2026-05-15 **Role:** executor / investigator **Status:** repo-side fix prepared; runtime DB repair remains hard-gated ### What I did - Opened PR #286: `fix(forgejo): add deploy runner scope doctor`. - Added a check-first script that diagnoses the deploy runner scope invariant and only mutates with explicit `--apply`. - Documented the root invariant: `rs2000-deploy-host` must be repo-scoped as `owner_id=0, repo_id=35`, not combined `owner_id=1, repo_id=35`. - Updated the runbook so watchdog is framed as diagnostics/belt-and-suspenders, not the accepted normal pickup path. ### Evidence - PR: https://git.pdurlej.com/pdurlej/platform/pulls/286 - Local tests: `bash -n scripts/forgejo/deploy-runner-scope-doctor` and `pytest tests/test_deploy_runner_scope_doctor.py tests/test_deploy_runner_watchdog.py -q` → 8 passed. - Read-only live check via stdin confirmed the known repairable state and performed no mutation: - active runner id=5: `owner_id=1`, `repo_id=35`, `deleted=NULL` - expected: `owner_id=0`, `repo_id=35` ### Hard gate still held I did **not** modify Forgejo DB. The live repair still requires explicit operator approval at apply time. Exact gate text I need before runtime apply: ```text GO: backup + fix action_runner id=5 owner_id=0 ``` After that approval, I will run backup + scoped transaction, then trigger two consecutive workflow_dispatch smokes and verify zero watchdog restarts.

codex commented

2026-05-15 18:44:30 +02:00

Author

Collaborator

Codex Phase B — deploy runner scope fixed + clean pickup verified — 2026-05-15 18:43 CEST

Role: executor / applier
Status: complete

What I did

Applied the operator-approved DB repair after PR #286 landed.
Backup written before mutation:
/opt/pdurlej-platform/backups/forgejo-actions/action-runner-scope-20260515T163922Z.sql
Updated exactly one active runner row:
rs2000-deploy-host id=5 owner_id=1, repo_id=35 -> owner_id=0, repo_id=35
Re-ran deploy-runner-scope-doctor --check: PASS.

Evidence

Runtime DB repair output showed:

DO
id=5 rs2000-deploy-host owner_id=0 repo_id=35 deleted=NULL
COMMIT
PASS repair complete

Two consecutive trusted-main workflow_dispatch smokes picked up cleanly with no watchdog restart:

run	module	run_status	runner_id	start delay	duration	watchdog
#957 / index #798	matrix-well-known	success	5	0s	10s	no restart
#958 / index #799	umami	success	5	2s	8s	no restart

Watchdog journal after the fix contains only periodic no stuck trusted-main auto-apply jobs; no restarting forgejo-deploy-runner.service entries.

Artifacts exist:

platformctl-auto-apply-957, 4174 bytes
platformctl-auto-apply-958, 3538 bytes

Result

#260 root cause is fixed, not merely worked around. The watchdog remains as diagnostic belt-and-suspenders, but it was not needed for the two post-fix smokes.

Suggested close condition

Operator can close #260 after spot-checking this evidence. If desired, keep it open for 24h soak, but it should no longer be treated as active cutover fire.

## Codex Phase B — deploy runner scope fixed + clean pickup verified — 2026-05-15 18:43 CEST **Role:** executor / applier **Status:** complete ### What I did - Applied the operator-approved DB repair after PR #286 landed. - Backup written before mutation: `/opt/pdurlej-platform/backups/forgejo-actions/action-runner-scope-20260515T163922Z.sql` - Updated exactly one active runner row: `rs2000-deploy-host id=5 owner_id=1, repo_id=35 -> owner_id=0, repo_id=35` - Re-ran `deploy-runner-scope-doctor --check`: PASS. ### Evidence Runtime DB repair output showed: ```text DO id=5 rs2000-deploy-host owner_id=0 repo_id=35 deleted=NULL COMMIT PASS repair complete ``` Two consecutive trusted-main workflow_dispatch smokes picked up cleanly with no watchdog restart: | run | module | run_status | runner_id | start delay | duration | watchdog | |---:|---|---:|---:|---:|---:|---| | #957 / index #798 | matrix-well-known | success | 5 | 0s | 10s | no restart | | #958 / index #799 | umami | success | 5 | 2s | 8s | no restart | Watchdog journal after the fix contains only periodic `no stuck trusted-main auto-apply jobs`; no `restarting forgejo-deploy-runner.service` entries. Artifacts exist: - `platformctl-auto-apply-957`, 4174 bytes - `platformctl-auto-apply-958`, 3538 bytes ### Result #260 root cause is fixed, not merely worked around. The watchdog remains as diagnostic belt-and-suspenders, but it was not needed for the two post-fix smokes. ### Suggested close condition Operator can close #260 after spot-checking this evidence. If desired, keep it open for 24h soak, but it should no longer be treated as active cutover fire.

codex referenced this issue

2026-05-15 18:44:30 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-15 18:53:19 +02:00

DRAFT: docs(prompts): stage F2 post-260 soak ramp #287

codex referenced this issue

2026-05-15 18:59:37 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-15 19:05:35 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-16 08:29:50 +02:00

fix(deploy): promote trusted release root for ops scripts #293

codex referenced this issue

2026-05-16 09:18:35 +02:00

docs(rs2000): record release readiness evidence #294

codex commented

2026-05-16 09:21:57 +02:00

Author

Collaborator

Codex #260 observation — clean pickup after scope fix — 2026-05-16 09:24 CEST

Role: executor
Status: current evidence supports #260 fix

Evidence

Release-root promoted to e5e8fe02a3fc97dc147b9104670128bbc459a39a via run #1131/API #878.
Six subsequent platformctl-auto-apply.yml workflow_dispatch smokes (#1132, #1134-#1138) all assigned to runner id=5.
seconds_to_task=1 for all six smokes.
Watchdog journal during the window reported only no stuck trusted-main auto-apply jobs; no restart fired.

Interpretation

The post-fix runner scope is behaving cleanly in this observation window. This is the signal we wanted before continuing broader release batches.

## Codex #260 observation — clean pickup after scope fix — 2026-05-16 09:24 CEST **Role:** executor **Status:** current evidence supports #260 fix ### Evidence - Release-root promoted to `e5e8fe02a3fc97dc147b9104670128bbc459a39a` via run #1131/API #878. - Six subsequent `platformctl-auto-apply.yml` workflow_dispatch smokes (#1132, #1134-#1138) all assigned to runner id=5. - `seconds_to_task=1` for all six smokes. - Watchdog journal during the window reported only `no stuck trusted-main auto-apply jobs`; no restart fired. ### Interpretation The post-fix runner scope is behaving cleanly in this observation window. This is the signal we wanted before continuing broader release batches.

codex commented

2026-05-17 08:03:07 +02:00

Author

Collaborator

Codex #260 soak evidence check — 2026-05-17 08:05 CEST

Role: executor
Outcome: clean soak evidence after deploy-runner scope fix

Evidence gathered

forgejo-deploy-runner.service: active.
forgejo-deploy-runner-watchdog.timer: active.
Watchdog since 2026-05-16 00:00:
- positive stuck detections: 0
- deploy-runner restart lines: 0
- last deploy-runner service start in that window: May 16 09:03:54.
Recent platformctl-auto-apply.yml jobs since 2026-05-16 00:00+02:
- count: 59
- seconds_to_task: min 0, max 2, avg 0.95.
- Recent final-boss examples:
  - run #1133 / API 1469 Traefik F3: seconds_to_task=1.
  - run #1125 / API 1460 Forgejo F3: seconds_to_task=0.
  - run #1112 / API 1447 MinIO F3: seconds_to_task=2.
Runner DB state:
- active runner id 5: owner_id=0, repo_id=35, version v7.0.0, labels ["deploy-host"], last_online 2026-05-17 08:01:32+02.
- stale runner id 4 is tombstoned/deleted and inert.

Readout

This is the behavior we wanted: clean pickup without watchdog intervention. The old 69-177s task assignment delay is not present in the observed soak window.

Recommendation

Keep the watchdog as belt-and-suspenders instrumentation, but treat #260 as operationally resolved after this soak window. Next RS2000 migration work should be cleanup/closeout, not more runner RCA: close or re-scope #260, then address the known separate follow-ups np-meerkat-frontend (#269) and minio-init (#310).

## Codex #260 soak evidence check — 2026-05-17 08:05 CEST **Role:** executor **Outcome:** clean soak evidence after deploy-runner scope fix ### Evidence gathered - `forgejo-deploy-runner.service`: `active`. - `forgejo-deploy-runner-watchdog.timer`: `active`. - Watchdog since `2026-05-16 00:00`: - positive stuck detections: `0` - deploy-runner restart lines: `0` - last deploy-runner service start in that window: `May 16 09:03:54`. - Recent `platformctl-auto-apply.yml` jobs since `2026-05-16 00:00+02`: - count: `59` - `seconds_to_task`: min `0`, max `2`, avg `0.95`. - Recent final-boss examples: - run #1133 / API `1469` Traefik F3: `seconds_to_task=1`. - run #1125 / API `1460` Forgejo F3: `seconds_to_task=0`. - run #1112 / API `1447` MinIO F3: `seconds_to_task=2`. - Runner DB state: - active runner id `5`: `owner_id=0`, `repo_id=35`, version `v7.0.0`, labels `["deploy-host"]`, last_online `2026-05-17 08:01:32+02`. - stale runner id `4` is tombstoned/deleted and inert. ### Readout This is the behavior we wanted: clean pickup without watchdog intervention. The old 69-177s task assignment delay is not present in the observed soak window. ### Recommendation Keep the watchdog as belt-and-suspenders instrumentation, but treat #260 as operationally resolved after this soak window. Next RS2000 migration work should be cleanup/closeout, not more runner RCA: close or re-scope #260, then address the known separate follow-ups `np-meerkat-frontend` (#269) and `minio-init` (#310).

codex referenced this issue

2026-05-17 08:53:01 +02:00

docs(status): record RS2000 F3 closeout state #318

codex referenced this issue

2026-05-17 08:53:32 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex commented

2026-05-17 08:53:32 +02:00

Author

Collaborator

Closing note - deploy-runner pickup resolved operationally - 2026-05-17 08:50 CEST

Role: executor
Status: resolved by scope fix + soak evidence

Evidence

forgejo-deploy-runner.service: active.
forgejo-deploy-runner-watchdog.timer: active.
Since 2026-05-16 00:00: watchdog positive stuck detections = 0.
Since 2026-05-16 00:00: deploy-runner restarts = 0.
Recent trusted-main platformctl-auto-apply.yml pickup since 2026-05-16 00:00+02: 59 runs, min 0s, max 2s, average 0.95s.
Active deploy runner row: id 5, repo_id 35, owner_id 0, version v7.0.0, labels ["deploy-host"].
Stale duplicate runner row is tombstoned/inert.

Decision

Closing this as operationally resolved. The watchdog remains as belt-and-suspenders instrumentation, but the false-fire pickup pattern is no longer present in soak evidence.

Reopen condition

Reopen if a trusted-main platformctl-auto-apply.yml job again waits with task_id=0 beyond threshold, or if watchdog restarts return after this closeout.

Reference: PR #318 updates state/STATUS_NOW.md with the post-F3 closeout state.

## Closing note - deploy-runner pickup resolved operationally - 2026-05-17 08:50 CEST **Role:** executor **Status:** resolved by scope fix + soak evidence ### Evidence - `forgejo-deploy-runner.service`: active. - `forgejo-deploy-runner-watchdog.timer`: active. - Since `2026-05-16 00:00`: watchdog positive stuck detections = 0. - Since `2026-05-16 00:00`: deploy-runner restarts = 0. - Recent trusted-main `platformctl-auto-apply.yml` pickup since `2026-05-16 00:00+02`: 59 runs, min 0s, max 2s, average 0.95s. - Active deploy runner row: id 5, repo_id 35, owner_id 0, version `v7.0.0`, labels `["deploy-host"]`. - Stale duplicate runner row is tombstoned/inert. ### Decision Closing this as operationally resolved. The watchdog remains as belt-and-suspenders instrumentation, but the false-fire pickup pattern is no longer present in soak evidence. ### Reopen condition Reopen if a trusted-main `platformctl-auto-apply.yml` job again waits with `task_id=0` beyond threshold, or if watchdog restarts return after this closeout. Reference: PR #318 updates `state/STATUS_NOW.md` with the post-F3 closeout state.

codex closed this issue

2026-05-17 08:53:33 +02:00

codex referenced this issue

2026-05-17 08:59:40 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex referenced this issue

2026-05-18 08:33:18 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142