fix(host-ops): skip sealed vault backup snapshot #794

Merged
pdurlej merged 1 commit from codex/orders/rs2000-backup-logopts-followup into main 2026-06-17 21:02:44 +02:00
Collaborator

Canary status: missing — fire canary 3+3 manually before merge

Canary Context Pack

Product story

RS2000 should stay recoverable and disk-stable without waking Piotr for false-fire backup failures. A sealed sunset Vault should not make the whole critical backup fail when the durable Vault data-volume archive has already been captured.

What changed

  • Added an idempotent host-ops installer that patches backup.sh to skip the optional Vault Raft snapshot when Vault is sealed.
  • Added focused tests for the sealed-Vault patcher.
  • Updated the RS2000 disk hygiene runbook with container recreate/logopts and sealed-Vault backup behavior.
  • Recorded runtime evidence for the backup fix, Docker daemon restart, controlled container recreation, remaining exceptions, and final smokes.

Why it changed

Live hp-backup-critical.service failed because Vault was initialized but sealed. The backup script had already archived vault_data, but still treated vault-backup.sh as mandatory when Vault was running.

Files touched

  • scripts/host-ops/install_vault_sealed_backup_skip.py
  • tests/test_host_ops_vault_sealed_backup_skip.py
  • runbooks/rs2000-disk-hygiene.md
  • state/reports/rs2000-backup-logopts-followup-2026-06-17.md

Relevant context

  • Prior disk hygiene rollout: PR #793
  • Runtime service: hp-backup-critical.service
  • Docker daemon log policy: /etc/docker/daemon.json with max-size=64m, max-file=3

Runtime evidence

  • hp-backup-critical.service rerun succeeded.
  • Latest critical backup: 20260617-200345-critical.
  • Final systemctl is-failed hp-backup-critical.service: inactive.
  • Final endpoint smokes: Element 200, Matrix 200, Forgejo 200, Infisical 200, n8n 200, ntfy 200, Honcho health 200.
  • No containers unhealthy or restarting at final check.

Known constraints

  • No Vault unseal key, root token, PAT, or secret value was read or printed.
  • Some remaining containers still lack daemon logopts because their compose ownership is legacy/orphaned or requires real secret-rendered paths.

Explicit out-of-scope

  • Unique-knowledge backup retention; operator explicitly chose to ignore it here.
  • Docker volume prune or backup deletion beyond the existing keep-2 critical retention helper.
  • Manual recreation of legacy/orphan containers without coherent compose source.
  • Vault bootstrap/unseal/revival.

Requested decision

Review and merge the repo codification of the already-applied runtime fix and rollout evidence.

Merge blockers

  • Installer must remain idempotent.
  • Backup behavior must not hide an unreachable/malformed Vault status as success.
  • No secret values may enter repo artifacts.

Spec sources read

  • scripts/host-ops/install_keep2_backup_policy.py — existing host-ops installer style.
  • tests/test_host_ops_backup_retention.py — test style for host-ops helpers.
  • runbooks/rs2000-disk-hygiene.md — runbook updated by prior rollout.
  • state/reports/rs2000-disk-hygiene-2026-06-17.md — prior rollout evidence and remaining-work baseline.
  • Live /opt/pdurlej-platform/runtime/host-ops/scripts/backup.sh snippets via SSH — runtime source for the hotfix, no secrets.

Verification

python3 -m py_compile scripts/host-ops/install_vault_sealed_backup_skip.py
python3 -m pytest tests/test_host_ops_backup_retention.py tests/test_host_ops_vault_sealed_backup_skip.py tests/test_host_ops_docker_disk_policy.py

Result: 11 passed.

Canary status: missing — fire canary 3+3 manually before merge ## Canary Context Pack ### Product story RS2000 should stay recoverable and disk-stable without waking Piotr for false-fire backup failures. A sealed sunset Vault should not make the whole critical backup fail when the durable Vault data-volume archive has already been captured. ### What changed - Added an idempotent host-ops installer that patches `backup.sh` to skip the optional Vault Raft snapshot when Vault is sealed. - Added focused tests for the sealed-Vault patcher. - Updated the RS2000 disk hygiene runbook with container recreate/logopts and sealed-Vault backup behavior. - Recorded runtime evidence for the backup fix, Docker daemon restart, controlled container recreation, remaining exceptions, and final smokes. ### Why it changed Live `hp-backup-critical.service` failed because Vault was initialized but sealed. The backup script had already archived `vault_data`, but still treated `vault-backup.sh` as mandatory when Vault was running. ### Files touched - `scripts/host-ops/install_vault_sealed_backup_skip.py` - `tests/test_host_ops_vault_sealed_backup_skip.py` - `runbooks/rs2000-disk-hygiene.md` - `state/reports/rs2000-backup-logopts-followup-2026-06-17.md` ### Relevant context - Prior disk hygiene rollout: PR #793 - Runtime service: `hp-backup-critical.service` - Docker daemon log policy: `/etc/docker/daemon.json` with `max-size=64m`, `max-file=3` ### Runtime evidence - `hp-backup-critical.service` rerun succeeded. - Latest critical backup: `20260617-200345-critical`. - Final `systemctl is-failed hp-backup-critical.service`: `inactive`. - Final endpoint smokes: Element 200, Matrix 200, Forgejo 200, Infisical 200, n8n 200, ntfy 200, Honcho health 200. - No containers unhealthy or restarting at final check. ### Known constraints - No Vault unseal key, root token, PAT, or secret value was read or printed. - Some remaining containers still lack daemon logopts because their compose ownership is legacy/orphaned or requires real secret-rendered paths. ### Explicit out-of-scope - Unique-knowledge backup retention; operator explicitly chose to ignore it here. - Docker volume prune or backup deletion beyond the existing keep-2 critical retention helper. - Manual recreation of legacy/orphan containers without coherent compose source. - Vault bootstrap/unseal/revival. ### Requested decision Review and merge the repo codification of the already-applied runtime fix and rollout evidence. ### Merge blockers - Installer must remain idempotent. - Backup behavior must not hide an unreachable/malformed Vault status as success. - No secret values may enter repo artifacts. ## Spec sources read - `scripts/host-ops/install_keep2_backup_policy.py` — existing host-ops installer style. - `tests/test_host_ops_backup_retention.py` — test style for host-ops helpers. - `runbooks/rs2000-disk-hygiene.md` — runbook updated by prior rollout. - `state/reports/rs2000-disk-hygiene-2026-06-17.md` — prior rollout evidence and remaining-work baseline. - Live `/opt/pdurlej-platform/runtime/host-ops/scripts/backup.sh` snippets via SSH — runtime source for the hotfix, no secrets. ## Verification ```bash python3 -m py_compile scripts/host-ops/install_vault_sealed_backup_skip.py python3 -m pytest tests/test_host_ops_backup_retention.py tests/test_host_ops_vault_sealed_backup_skip.py tests/test_host_ops_docker_disk_policy.py ``` Result: `11 passed`.
fix(host-ops): skip sealed vault backup snapshot
All checks were successful
canary-required / collect-diff (pull_request) Successful in 5s
python-ci / Python 3.11 (pull_request) Successful in 44s
python-ci / Python 3.12 (pull_request) Successful in 46s
python-ci / Python 3.13 (pull_request) Successful in 42s
base-is-main / guard (pull_request) Successful in 1s
canary-required / canary (pull_request) Successful in 19s
patchwarden-client-dry-run / collect-diff (pull_request) Successful in 4s
patchwarden-client-dry-run / dry-run (pull_request) Successful in 22s
patchwarden-pr-sanity / sanity (pull_request) Successful in 4m9s
patchwarden-pr-sanity / collect-diff (pull_request) Successful in 4s
cad5f63058
First-time contributor

Patchwarden PR sanity

  • Status: advisory_findings
  • PR: 794
  • Commit: cad5f630581f82b55fa410c3c4aefbeaffc9834b
  • Security-sensitive label: present
  • Authority: advisory model review plus deterministic blockers only
  • 3+3 canary: still alive; this does not replace it

Deterministic findings

No deterministic findings.

Model reviewers

global-glm / glm-5.1:cloud

  • Status: ok

  • Verdict: ABSTAIN

  • medium reviewer output unparseable

    • Evidence: no valid JSON object found in response
    • Next: investigate provider response; possibly increase max_tokens or retry

global-deepseek / deepseek-v4-pro:cloud

  • Status: ok

  • Verdict: NOT_OK

  • blocker Canary not fired before merge

    • Evidence: PR description explicitly states 'Canary status: missing — fire canary 3+3 manually before merge'. No canary-related files are changed in the diff.
    • Next: Fire the required canary (3+3) before merging, as per the repository's merge policy.
  • medium Installer tightly coupled to backup.sh content

    • Evidence: install_vault_sealed_backup_skip.py uses exact string matching for COMPOSE_BLOCK and OLD_VAULT_BACKUP_BLOCK. If backup.sh is modified in the future, the installer will raise SystemExit and fail to patch, potentially leaving the backup unpro
    • Next: Add a comment in backup.sh near the matched blocks to keep them in sync with the installer, or refactor the installer to use marker comments for more resilient patching.
  • low Race condition if installer runs during backup execution

    • Evidence: The installer writes directly to the backup.sh file without any locking or check for a running backup process. If hp-backup-critical.service is active, the script could be read while being modified.
    • Next: Document that the installer should only be run when the backup service is stopped, or add a pre-check for active backup processes.

redteam / kimi-k2.6:cloud

  • Status: error
  • Verdict: -
  • Note: ReadTimeout: The read operation timed out
  • Findings: none

Policy notes

  • GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot.
  • Optional red-team model is enabled only when PLATFORMCTL_PR_SANITY_REDTEAM_MODEL is configured.
  • Auto-merge is not enabled here.
<!-- patchwarden-pr-sanity:pdurlej/platform:PR-794 --> # Patchwarden PR sanity - Status: `advisory_findings` - PR: `794` - Commit: `cad5f630581f82b55fa410c3c4aefbeaffc9834b` - Security-sensitive label: `present` - Authority: advisory model review plus deterministic blockers only - 3+3 canary: still alive; this does not replace it ## Deterministic findings No deterministic findings. ## Model reviewers ### `global-glm` / `glm-5.1:cloud` - Status: `ok` - Verdict: `ABSTAIN` - **`medium`** reviewer output unparseable - Evidence: `no valid JSON object found in response` - Next: investigate provider response; possibly increase max_tokens or retry ### `global-deepseek` / `deepseek-v4-pro:cloud` - Status: `ok` - Verdict: `NOT_OK` - **`blocker`** Canary not fired before merge - Evidence: `PR description explicitly states 'Canary status: missing — fire canary 3+3 manually before merge'. No canary-related files are changed in the diff.` - Next: Fire the required canary (3+3) before merging, as per the repository's merge policy. - **`medium`** Installer tightly coupled to backup.sh content - Evidence: `install_vault_sealed_backup_skip.py uses exact string matching for COMPOSE_BLOCK and OLD_VAULT_BACKUP_BLOCK. If backup.sh is modified in the future, the installer will raise SystemExit and fail to patch, potentially leaving the backup unpro` - Next: Add a comment in backup.sh near the matched blocks to keep them in sync with the installer, or refactor the installer to use marker comments for more resilient patching. - **`low`** Race condition if installer runs during backup execution - Evidence: `The installer writes directly to the backup.sh file without any locking or check for a running backup process. If hp-backup-critical.service is active, the script could be read while being modified.` - Next: Document that the installer should only be run when the backup service is stopped, or add a pre-check for active backup processes. ### `redteam` / `kimi-k2.6:cloud` - Status: `error` - Verdict: `-` - Note: ReadTimeout: The read operation timed out - Findings: none ## Policy notes - GLM 5.1 + DeepSeek V4 Pro are the operator-required model mix for this bot. - Optional red-team model is enabled only when `PLATFORMCTL_PR_SANITY_REDTEAM_MODEL` is configured. - Auto-merge is not enabled here.
pdurlej deleted branch codex/orders/rs2000-backup-logopts-followup 2026-06-17 21:02:44 +02:00
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!794
No description provided.