ops(rs2000): record disk hygiene rollout #793

Merged
pdurlej merged 1 commit from codex/orders/rs2000-disk-hygiene into main 2026-06-17 19:58:57 +02:00
Collaborator

Canary status: missing — fire canary 3+3 manually before merge

Canary Context Pack

Product story

RS2000 ran out of disk during the Matrix/Element incident because rebuildable artifacts and backup outputs were not bounded. This PR records the live remediation that keeps Docker cache/log growth controlled without automating deletion of application data.

What changed

  • Added host-ops backup retention helper and installer used for the keep-2 critical backup policy.
  • Added Docker disk policy helper and installer for RS2000.
  • Added RS2000 disk hygiene runbook.
  • Added rollout evidence report for the 2026-06-17 runtime deployment.
  • Added focused tests for backup retention and Docker disk policy generation.

Why it changed

The runtime was fixed first under scoped operator approval. This PR makes that deployed state reproducible and reviewable in the canonical platform repo.

Files touched

  • runbooks/rs2000-disk-hygiene.md
  • scripts/host-ops/backup_retention.py
  • scripts/host-ops/install_keep2_backup_policy.py
  • scripts/host-ops/docker_disk_policy.py
  • scripts/host-ops/install_docker_disk_policy.py
  • state/reports/rs2000-disk-hygiene-2026-06-17.md
  • tests/test_host_ops_backup_retention.py
  • tests/test_host_ops_docker_disk_policy.py

Relevant context

  • AGENTS.md hard-stop model for critical infra and destructive cleanup.
  • docs/forgejo-agent-operations.md for PR identity and Forgejo workflow.
  • runbooks/rs2000-disk-hygiene.md for the resulting operator runbook.

Runtime evidence

Captured in state/reports/rs2000-disk-hygiene-2026-06-17.md.

Key post-deploy facts:

  • / improved from 188G free / 62% used to 206G free / 58% used after BuildKit GC.
  • Docker Build Cache dropped from 88.04G to 70.98G.
  • No Docker volumes, backups, or rollback images were pruned.
  • Element, Matrix, Forgejo, Infisical, n8n, ntfy, Uptime Kuma, and Honcho smokes passed.
  • Docker had no unhealthy/restarting containers after rollout.

Known constraints

  • /etc/docker/daemon.json log rotation applies to newly created containers; existing long-lived containers keep their creation-time logging options until recreated.
  • platform-unique-knowledge-backup.service and hp-backup-critical.service were already failed before this rollout and remain separate follow-ups.
  • BuildKit GC can make future builds slower because cache may need to rebuild.

Explicit out-of-scope

  • Docker volume pruning.
  • Backup archive deletion beyond the already deployed host-ops keep-2 critical policy.
  • Docker daemon restart.
  • Application deploy or restart.
  • Secret reads, credential changes, exposure changes, or network boundary changes.

Requested decision

Review the repo representation of the live rollout and merge if the implementation matches the safety boundary.

Merge blockers

  • Any path that can delete Docker volumes or application data automatically.
  • Any secret-bearing artifact in the PR.
  • Incorrect runtime evidence or missing rollback path.

Spec sources read

  • AGENTS.md — repo and critical-infra safety contract.
  • docs/forgejo-agent-operations.md — Forgejo identity and PR contract.
  • runbooks/unique-knowledge-backup.md — to verify unique-knowledge archives are outside Docker GC scope.
  • scripts/backup/unique_knowledge_backup.py and tests/test_unique_knowledge_backup.py — to understand existing backup test patterns.

Tests and verification

  • python3 -m py_compile scripts/host-ops/backup_retention.py scripts/host-ops/install_keep2_backup_policy.py scripts/host-ops/docker_disk_policy.py scripts/host-ops/install_docker_disk_policy.py
  • python3 -m pytest tests/test_host_ops_backup_retention.py tests/test_host_ops_docker_disk_policy.py — 8 passed
  • Secret-pattern scan over new scripts/tests/runbook/report returned no matches.
  • Runtime smokes and disk checks are recorded in state/reports/rs2000-disk-hygiene-2026-06-17.md.
Canary status: missing — fire canary 3+3 manually before merge ## Canary Context Pack ### Product story RS2000 ran out of disk during the Matrix/Element incident because rebuildable artifacts and backup outputs were not bounded. This PR records the live remediation that keeps Docker cache/log growth controlled without automating deletion of application data. ### What changed - Added host-ops backup retention helper and installer used for the keep-2 critical backup policy. - Added Docker disk policy helper and installer for RS2000. - Added RS2000 disk hygiene runbook. - Added rollout evidence report for the 2026-06-17 runtime deployment. - Added focused tests for backup retention and Docker disk policy generation. ### Why it changed The runtime was fixed first under scoped operator approval. This PR makes that deployed state reproducible and reviewable in the canonical platform repo. ### Files touched - runbooks/rs2000-disk-hygiene.md - scripts/host-ops/backup_retention.py - scripts/host-ops/install_keep2_backup_policy.py - scripts/host-ops/docker_disk_policy.py - scripts/host-ops/install_docker_disk_policy.py - state/reports/rs2000-disk-hygiene-2026-06-17.md - tests/test_host_ops_backup_retention.py - tests/test_host_ops_docker_disk_policy.py ### Relevant context - AGENTS.md hard-stop model for critical infra and destructive cleanup. - docs/forgejo-agent-operations.md for PR identity and Forgejo workflow. - runbooks/rs2000-disk-hygiene.md for the resulting operator runbook. ### Runtime evidence Captured in state/reports/rs2000-disk-hygiene-2026-06-17.md. Key post-deploy facts: - `/` improved from 188G free / 62% used to 206G free / 58% used after BuildKit GC. - Docker Build Cache dropped from 88.04G to 70.98G. - No Docker volumes, backups, or rollback images were pruned. - Element, Matrix, Forgejo, Infisical, n8n, ntfy, Uptime Kuma, and Honcho smokes passed. - Docker had no unhealthy/restarting containers after rollout. ### Known constraints - `/etc/docker/daemon.json` log rotation applies to newly created containers; existing long-lived containers keep their creation-time logging options until recreated. - `platform-unique-knowledge-backup.service` and `hp-backup-critical.service` were already failed before this rollout and remain separate follow-ups. - BuildKit GC can make future builds slower because cache may need to rebuild. ### Explicit out-of-scope - Docker volume pruning. - Backup archive deletion beyond the already deployed host-ops keep-2 critical policy. - Docker daemon restart. - Application deploy or restart. - Secret reads, credential changes, exposure changes, or network boundary changes. ### Requested decision Review the repo representation of the live rollout and merge if the implementation matches the safety boundary. ### Merge blockers - Any path that can delete Docker volumes or application data automatically. - Any secret-bearing artifact in the PR. - Incorrect runtime evidence or missing rollback path. ## Spec sources read - AGENTS.md — repo and critical-infra safety contract. - docs/forgejo-agent-operations.md — Forgejo identity and PR contract. - runbooks/unique-knowledge-backup.md — to verify unique-knowledge archives are outside Docker GC scope. - scripts/backup/unique_knowledge_backup.py and tests/test_unique_knowledge_backup.py — to understand existing backup test patterns. ## Tests and verification - `python3 -m py_compile scripts/host-ops/backup_retention.py scripts/host-ops/install_keep2_backup_policy.py scripts/host-ops/docker_disk_policy.py scripts/host-ops/install_docker_disk_policy.py` - `python3 -m pytest tests/test_host_ops_backup_retention.py tests/test_host_ops_docker_disk_policy.py` — 8 passed - Secret-pattern scan over new scripts/tests/runbook/report returned no matches. - Runtime smokes and disk checks are recorded in `state/reports/rs2000-disk-hygiene-2026-06-17.md`.
ops(rs2000): record disk hygiene rollout
All checks were successful
canary-required / collect-diff (pull_request) Successful in 5s
python-ci / Python 3.11 (pull_request) Successful in 44s
python-ci / Python 3.12 (pull_request) Successful in 46s
python-ci / Python 3.13 (pull_request) Successful in 42s
canary-required / canary (pull_request) Successful in 16s
base-is-main / guard (pull_request) Successful in 1s
patchwarden-client-dry-run / collect-diff (pull_request) Successful in 4s
patchwarden-client-dry-run / dry-run (pull_request) Successful in 21s
patchwarden-pr-sanity / sanity (pull_request) Successful in 5m15s
patchwarden-pr-sanity / collect-diff (pull_request) Successful in 4s
df98a1e773
First-time contributor

Patchwarden PR sanity

  • Status: eligible_sanity_clean
  • PR: 793
  • Commit: df98a1e7736f0e183f09cd41d2ca0d1266b3df84
  • Security-sensitive label: present
  • Authority: advisory model review plus deterministic blockers only
  • 3+3 canary: still alive; this does not replace it

Deterministic findings

No deterministic findings.

Model reviewers

global-glm / glm-5.1:cloud

  • Status: ok
  • Verdict: OK
  • Findings: none

global-deepseek / deepseek-v4-pro:cloud

  • Status: ok

  • Verdict: OK

  • medium Missing canary for critical infra change

    • Evidence: PR description states 'Canary status: missing — fire canary 3+3 manually before merge'. The change touches host-ops scripts that affect disk hygiene on a production host, which likely falls under AGENTS.md critical-infra canary requirements
    • Next: Fire the canary (3+3) before merging, as noted in the PR description, to satisfy the repo's safety contract.

redteam / kimi-k2.6:cloud

  • Status: error
  • Verdict: -
  • Note: ReadTimeout: The read operation timed out
  • Findings: none

Policy notes

  • GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot.
  • Optional red-team model is enabled only when PLATFORMCTL_PR_SANITY_REDTEAM_MODEL is configured.
  • Auto-merge is not enabled here.
<!-- patchwarden-pr-sanity:pdurlej/platform:PR-793 --> # Patchwarden PR sanity - Status: `eligible_sanity_clean` - PR: `793` - Commit: `df98a1e7736f0e183f09cd41d2ca0d1266b3df84` - Security-sensitive label: `present` - Authority: advisory model review plus deterministic blockers only - 3+3 canary: still alive; this does not replace it ## Deterministic findings No deterministic findings. ## Model reviewers ### `global-glm` / `glm-5.1:cloud` - Status: `ok` - Verdict: `OK` - Findings: none ### `global-deepseek` / `deepseek-v4-pro:cloud` - Status: `ok` - Verdict: `OK` - **`medium`** Missing canary for critical infra change - Evidence: `PR description states 'Canary status: missing — fire canary 3+3 manually before merge'. The change touches host-ops scripts that affect disk hygiene on a production host, which likely falls under AGENTS.md critical-infra canary requirements` - Next: Fire the canary (3+3) before merging, as noted in the PR description, to satisfy the repo's safety contract. ### `redteam` / `kimi-k2.6:cloud` - Status: `error` - Verdict: `-` - Note: ReadTimeout: The read operation timed out - Findings: none ## Policy notes - GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot. - Optional red-team model is enabled only when `PLATFORMCTL_PR_SANITY_REDTEAM_MODEL` is configured. - Auto-merge is not enabled here.
pdurlej deleted branch codex/orders/rs2000-disk-hygiene 2026-06-17 19:58:58 +02:00
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!793
No description provided.