ops(rs2000): close runtime health cleanup #800

Merged
pdurlej merged 1 commit from codex/orders/rs2000-runtime-health-closeout into main 2026-06-17 21:50:14 +02:00
Collaborator

Canary status: missing — not fired; this PR touches runbook/report state only and does not change canary-scoped code paths.

Canary Context Pack

Product story

RS2000 had a disk/runtime health incident that made Matrix/Element look broken and exposed missing disk-budget policy. The platform now needs a durable operator-facing closeout: what was fixed live, what remains intentionally red, and what the next agents should do without guessing or deleting data.

What changed

  • Extended runbooks/rs2000-disk-hygiene.md with remaining exception classes and size-budget targets.
  • Added state/reports/rs2000-runtime-health-closeout-2026-06-17.md with runtime evidence, root-cause summary, rollback note for the live ICS wrapper change, and follow-up queue.
  • Created follow-up issues #795-#799 and recorded them in the report.

Why it changed

The live cleanup fixed the immediate service symptoms, but leaving the policy in chat would make future agents rediscover the same risk and potentially over-clean persistent data. This PR turns the cleanup into durable repo state.

Files touched

  • runbooks/rs2000-disk-hygiene.md
  • state/reports/rs2000-runtime-health-closeout-2026-06-17.md

Relevant context

  • PR #793: first RS2000 disk hygiene rollout.
  • PR #794: sealed Vault snapshot handling and broader logopts follow-up.
  • Follow-up issues: #795, #796, #797, #798, #799.

Runtime evidence

Final read-only RS2000 snapshot:

/dev/vda3 503G 276G 207G 58% /
element_config 200
matrix_versions 200
forgejo 200
infisical_root 200
n8n 200
ntfy 200
honcho_health 200
safe_session 200
home-platform-iskra-things-sync-1 running healthy 64m/3

Remaining intentionally visible failed unit:

platform-unique-knowledge-backup.service

Known constraints

No secret values were read or printed. Unique-knowledge backup repair and ICS credential repair remain secret/offsite-gated follow-ups. Docker volume/image cleanup remains classification-first, not prune-first.

Explicit out-of-scope

  • Deleting Docker volumes, images, backups, or unique-knowledge archives.
  • Rotating credentials or reading secret values.
  • Merging, automerge, deploy, or runtime mutation from this PR.

Requested decision

Approve the report/runbook closeout as the durable record of the RS2000 runtime-health cleanup.

Merge blockers

  • Any accidental secret-bearing content.
  • Any instruction that would authorize blanket prune/delete or forced legacy recreate.
  • Missing link from report to the follow-up queue.

Spec sources read

  • runbooks/rs2000-disk-hygiene.md — current disk policy runbook.
  • state/reports/rs2000-disk-hygiene-2026-06-17.md — prior rollout evidence.
  • state/reports/rs2000-backup-logopts-followup-2026-06-17.md — prior backup/logopts follow-up evidence.
  • runbooks/unique-knowledge-backup.md — unique-knowledge backup safety lane.
  • scripts/backup/README.md — unique-knowledge wrapper context.
  • Runtime metadata via read-only SSH — systemd, Docker inspect/logopts, HTTP smokes; no secret values.

Verification

  • git diff --check
  • Secret-pattern scan over touched markdown files
  • AntiGravity/Gemini sanitized readability/safety review: no blockers, improvement applied to #795-#799
  • Ollama/Kimi sanitized boundary review: no volume prune, no unique-knowledge deletion, no forced legacy recreates

Related: #795, #796, #797, #798, #799

Canary status: missing — not fired; this PR touches runbook/report state only and does not change canary-scoped code paths. ## Canary Context Pack ### Product story RS2000 had a disk/runtime health incident that made Matrix/Element look broken and exposed missing disk-budget policy. The platform now needs a durable operator-facing closeout: what was fixed live, what remains intentionally red, and what the next agents should do without guessing or deleting data. ### What changed - Extended `runbooks/rs2000-disk-hygiene.md` with remaining exception classes and size-budget targets. - Added `state/reports/rs2000-runtime-health-closeout-2026-06-17.md` with runtime evidence, root-cause summary, rollback note for the live ICS wrapper change, and follow-up queue. - Created follow-up issues #795-#799 and recorded them in the report. ### Why it changed The live cleanup fixed the immediate service symptoms, but leaving the policy in chat would make future agents rediscover the same risk and potentially over-clean persistent data. This PR turns the cleanup into durable repo state. ### Files touched - `runbooks/rs2000-disk-hygiene.md` - `state/reports/rs2000-runtime-health-closeout-2026-06-17.md` ### Relevant context - PR #793: first RS2000 disk hygiene rollout. - PR #794: sealed Vault snapshot handling and broader logopts follow-up. - Follow-up issues: #795, #796, #797, #798, #799. ### Runtime evidence Final read-only RS2000 snapshot: ```text /dev/vda3 503G 276G 207G 58% / element_config 200 matrix_versions 200 forgejo 200 infisical_root 200 n8n 200 ntfy 200 honcho_health 200 safe_session 200 home-platform-iskra-things-sync-1 running healthy 64m/3 ``` Remaining intentionally visible failed unit: ```text platform-unique-knowledge-backup.service ``` ### Known constraints No secret values were read or printed. Unique-knowledge backup repair and ICS credential repair remain secret/offsite-gated follow-ups. Docker volume/image cleanup remains classification-first, not prune-first. ### Explicit out-of-scope - Deleting Docker volumes, images, backups, or unique-knowledge archives. - Rotating credentials or reading secret values. - Merging, automerge, deploy, or runtime mutation from this PR. ### Requested decision Approve the report/runbook closeout as the durable record of the RS2000 runtime-health cleanup. ### Merge blockers - Any accidental secret-bearing content. - Any instruction that would authorize blanket prune/delete or forced legacy recreate. - Missing link from report to the follow-up queue. ## Spec sources read - `runbooks/rs2000-disk-hygiene.md` — current disk policy runbook. - `state/reports/rs2000-disk-hygiene-2026-06-17.md` — prior rollout evidence. - `state/reports/rs2000-backup-logopts-followup-2026-06-17.md` — prior backup/logopts follow-up evidence. - `runbooks/unique-knowledge-backup.md` — unique-knowledge backup safety lane. - `scripts/backup/README.md` — unique-knowledge wrapper context. - Runtime metadata via read-only SSH — systemd, Docker inspect/logopts, HTTP smokes; no secret values. ## Verification - `git diff --check` - Secret-pattern scan over touched markdown files - AntiGravity/Gemini sanitized readability/safety review: no blockers, improvement applied to #795-#799 - Ollama/Kimi sanitized boundary review: no volume prune, no unique-knowledge deletion, no forced legacy recreates Related: #795, #796, #797, #798, #799
ops(rs2000): close runtime health cleanup
All checks were successful
canary-required / collect-diff (pull_request) Successful in 5s
base-is-main / guard (pull_request) Successful in 2s
canary-required / canary (pull_request) Has been skipped
patchwarden-pr-sanity / collect-diff (pull_request) Successful in 4s
patchwarden-client-dry-run / collect-diff (pull_request) Successful in 4s
patchwarden-client-dry-run / dry-run (pull_request) Successful in 1m10s
patchwarden-pr-sanity / sanity (pull_request) Successful in 2m13s
e811b902e8
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

Patchwarden PR sanity

  • Status: advisory_findings
  • PR: 800
  • Commit: e811b902e806437900e6937ec727884c6eb60dfb
  • Security-sensitive label: present
  • Authority: advisory model review plus deterministic blockers only
  • 3+3 canary: still alive; this does not replace it

Deterministic findings

No deterministic findings.

Model reviewers

global-glm / glm-5.1:cloud

  • Status: ok
  • Verdict: OK
  • Findings: none

global-deepseek / deepseek-v4-pro:cloud

  • Status: ok
  • Verdict: OK
  • Findings: none

redteam / kimi-k2.6:cloud

  • Status: ok

  • Verdict: NOT_OK

  • high Durable closeout anchors rollback to ephemeral on-disk file outside version control

    • Evidence: state/reports/rs2000-runtime-health-closeout-2026-06-17.md states rollback for the live ICS wrapper change is to restore /opt/iskra-openclaw/scripts/iskra-publish-ics.sh.pre-stale-fallback-rc-fix-20260617T191016Z; follow-up #797 confirms
    • Next: Version-control both the pre-change script and the current live wrapper in this repo (e.g., under scripts/ or hotfixes/) so the documented rollback path survives host rebuilds or disk cleanup before this closeout is approved as the durable record

Policy notes

  • GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot.
  • Optional red-team model is enabled only when PLATFORMCTL_PR_SANITY_REDTEAM_MODEL is configured.
  • Auto-merge is not enabled here.
<!-- patchwarden-pr-sanity:pdurlej/platform:PR-800 --> # Patchwarden PR sanity - Status: `advisory_findings` - PR: `800` - Commit: `e811b902e806437900e6937ec727884c6eb60dfb` - Security-sensitive label: `present` - Authority: advisory model review plus deterministic blockers only - 3+3 canary: still alive; this does not replace it ## Deterministic findings No deterministic findings. ## Model reviewers ### `global-glm` / `glm-5.1:cloud` - Status: `ok` - Verdict: `OK` - Findings: none ### `global-deepseek` / `deepseek-v4-pro:cloud` - Status: `ok` - Verdict: `OK` - Findings: none ### `redteam` / `kimi-k2.6:cloud` - Status: `ok` - Verdict: `NOT_OK` - **`high`** Durable closeout anchors rollback to ephemeral on-disk file outside version control - Evidence: `state/reports/rs2000-runtime-health-closeout-2026-06-17.md states rollback for the live ICS wrapper change is to restore `/opt/iskra-openclaw/scripts/iskra-publish-ics.sh.pre-stale-fallback-rc-fix-20260617T191016Z`; follow-up #797 confirms ` - Next: Version-control both the pre-change script and the current live wrapper in this repo (e.g., under `scripts/` or `hotfixes/`) so the documented rollback path survives host rebuilds or disk cleanup before this closeout is approved as the durable record ## Policy notes - GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot. - Optional red-team model is enabled only when `PLATFORMCTL_PR_SANITY_REDTEAM_MODEL` is configured. - Auto-merge is not enabled here.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
First-time contributor

No Patchwarden findings to render.

No Patchwarden findings to render.
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!800
No description provided.