ops(rs2000): F3 stateful smoke design + backup-before contract #271

Closed
opened 2026-05-13 23:28:45 +02:00 by codex · 1 comment
Collaborator

Context

F3 is the future stateful smoke phase for RS2000. Unlike F1/F1.5/F2 no-op smokes, F3 touches services with local data or persistent dependencies. Every F3 candidate needs a backup-before-apply contract before any workflow_dispatch smoke.

PR #270 adds a draft operator-run helper:

  • scripts/cutover/backup-before-apply.sh
  • scripts/cutover/README.md

No F3 smoke is part of PR #270. This issue is the design gate before any stateful smoke starts.

Stateful candidates by backup class

Class A - Postgres-family

Procedure: pg_dumpall inside the target container, gzip output, write mode 0600 under /opt/pdurlej-platform/backups.

Candidates:

  • postgres
  • honcho-postgres
  • agent-plane-shadow-postgres

Notes: core/high-value data. Do not use as first F3 smoke.

Class B - Redis-family

Procedure: request redis-cli BGSAVE when possible, then archive mounted Redis data. Password-protected Redis may skip BGSAVE until auth behavior is verified.

Candidates:

  • redis
  • honcho-redis
  • infisical-redis

Notes: core or security-sensitive dependencies. Not first F3.

Class C - Vault

Procedure: if VAULT_TOKEN is provided and Vault is unsealed, use vault operator raft snapshot save; otherwise mount archive fallback.

Candidates:

  • vault

Notes: sunset/security-sensitive. Not first F3.

Class D - MinIO/S3

Procedure: prefer mc mirror to backup path when MINIO_MC_ALIAS is configured; otherwise mount archive fallback.

Candidates:

  • minio

Notes: needs operator decision on mc credentials/alias and retention before use.

Class E - Filesystem/app state

Procedure: archive Docker mounts for the resolved live container.

Candidates:

  • audio-mcp-legacy
  • audio-mcp
  • deploy-control
  • forgejo
  • git-mirror
  • gmail-private-mcp
  • gmail-triage-mcp
  • hermes-agency
  • infisical
  • jellyfin
  • kanboard
  • n8n-main
  • np-meerkat-backend
  • np-memos
  • np-openhabittracker
  • np-radicale
  • np-silverbullet
  • np-tududi
  • np
  • ntfy
  • obsidian-headless-sync
  • products-agent-eval-lab
  • safe-session-api
  • shelfmark
  • signal-bridge-legacy
  • signal-bridge-mautrix
  • signal-cli
  • storage-ro-mcp
  • synapse
  • teamspeak3
  • traefik
  • uptime-kuma
  • voice-transcription

Notes: likely first F3 should come from this class, but not routing, identity, core state, or high-value services. Current recommendation: start with uptime-kuma or another low-blast Class E service after a mount audit.

Class F - Engine-specific state

Procedure: engine dump preferred, mount archive fallback until dump is proven.

Candidates:

  • karakeep-meilisearch
  • searxng

Notes: searxng may be low-blast, but only after confirming its real mounted state and restore story.

Class G - Agaria separate root

Procedure: separate compose/root backup outside canonical RS2000 compose path.

Candidates:

  • agaria-postgres
  • agaria-redis

Notes: out of canonical compose; requires Agaria-specific plan before F3.

Backup destination and size evidence

Default destination from PR #270:

/opt/pdurlej-platform/backups/<module>-<UTC timestamp>.<suffix>

Backup files must be mode 0600 because they can contain secrets/private data.

Read-only evidence from 2026-05-13:

Docker local volumes: 63 total, 26 active, 11.34GB total, 5.304GB reclaimable

Per-module expected backup sizes are not yet asserted. They need a dedicated read-only volume audit before first F3 smoke.

Operator decisions before F3

  • Which class goes first? Recommendation: Class E, low-blast filesystem-state service.
  • Which first module? Recommendation: uptime-kuma or similar after mount audit.
  • Backup retention policy: how many backups to keep, where, and whether encrypted.
  • Restore procedure per class: explicit restore command must exist before first smoke in that class.
  • Operator on-duty requirement: F3 is not autonomous; operator + Codex joint session only.

Restore contract to define before first F3

Minimum restore notes needed:

  • Class A: restore pg_dumpall output into the correct Postgres container/database context.
  • Class B: stop service, restore Redis data files/AOF/RDB, start service, verify health.
  • Class C: Vault snapshot restore command and unseal/operator-token procedure.
  • Class D: MinIO mirror restore or volume restore path.
  • Class E: stop service, restore archived mount contents, start service, verify health.
  • Class F: engine-specific restore command, not just backup.
  • Class G: Agaria-specific restore outside canonical compose.

Explicit non-goals

  • No F3 smoke in the PR #270 session.
  • No production mutation.
  • No production restart.
  • No automatic Actions integration for the backup script.
  • No deletion of legacy rollback paths.

References

  • Cutover lane: #142
  • Backup helper PR: #270
  • Recovery plan: PR #250
  • np-meerkat compose gap: #269
## Context F3 is the future stateful smoke phase for RS2000. Unlike F1/F1.5/F2 no-op smokes, F3 touches services with local data or persistent dependencies. Every F3 candidate needs a backup-before-apply contract before any workflow_dispatch smoke. PR #270 adds a draft operator-run helper: - `scripts/cutover/backup-before-apply.sh` - `scripts/cutover/README.md` No F3 smoke is part of PR #270. This issue is the design gate before any stateful smoke starts. ## Stateful candidates by backup class ### Class A - Postgres-family Procedure: `pg_dumpall` inside the target container, gzip output, write mode 0600 under `/opt/pdurlej-platform/backups`. Candidates: - `postgres` - `honcho-postgres` - `agent-plane-shadow-postgres` Notes: core/high-value data. Do not use as first F3 smoke. ### Class B - Redis-family Procedure: request `redis-cli BGSAVE` when possible, then archive mounted Redis data. Password-protected Redis may skip BGSAVE until auth behavior is verified. Candidates: - `redis` - `honcho-redis` - `infisical-redis` Notes: core or security-sensitive dependencies. Not first F3. ### Class C - Vault Procedure: if `VAULT_TOKEN` is provided and Vault is unsealed, use `vault operator raft snapshot save`; otherwise mount archive fallback. Candidates: - `vault` Notes: sunset/security-sensitive. Not first F3. ### Class D - MinIO/S3 Procedure: prefer `mc mirror` to backup path when `MINIO_MC_ALIAS` is configured; otherwise mount archive fallback. Candidates: - `minio` Notes: needs operator decision on `mc` credentials/alias and retention before use. ### Class E - Filesystem/app state Procedure: archive Docker mounts for the resolved live container. Candidates: - `audio-mcp-legacy` - `audio-mcp` - `deploy-control` - `forgejo` - `git-mirror` - `gmail-private-mcp` - `gmail-triage-mcp` - `hermes-agency` - `infisical` - `jellyfin` - `kanboard` - `n8n-main` - `np-meerkat-backend` - `np-memos` - `np-openhabittracker` - `np-radicale` - `np-silverbullet` - `np-tududi` - `np` - `ntfy` - `obsidian-headless-sync` - `products-agent-eval-lab` - `safe-session-api` - `shelfmark` - `signal-bridge-legacy` - `signal-bridge-mautrix` - `signal-cli` - `storage-ro-mcp` - `synapse` - `teamspeak3` - `traefik` - `uptime-kuma` - `voice-transcription` Notes: likely first F3 should come from this class, but not routing, identity, core state, or high-value services. Current recommendation: start with `uptime-kuma` or another low-blast Class E service after a mount audit. ### Class F - Engine-specific state Procedure: engine dump preferred, mount archive fallback until dump is proven. Candidates: - `karakeep-meilisearch` - `searxng` Notes: `searxng` may be low-blast, but only after confirming its real mounted state and restore story. ### Class G - Agaria separate root Procedure: separate compose/root backup outside canonical RS2000 compose path. Candidates: - `agaria-postgres` - `agaria-redis` Notes: out of canonical compose; requires Agaria-specific plan before F3. ## Backup destination and size evidence Default destination from PR #270: ```text /opt/pdurlej-platform/backups/<module>-<UTC timestamp>.<suffix> ``` Backup files must be mode 0600 because they can contain secrets/private data. Read-only evidence from 2026-05-13: ```text Docker local volumes: 63 total, 26 active, 11.34GB total, 5.304GB reclaimable ``` Per-module expected backup sizes are not yet asserted. They need a dedicated read-only volume audit before first F3 smoke. ## Operator decisions before F3 - [ ] Which class goes first? Recommendation: Class E, low-blast filesystem-state service. - [ ] Which first module? Recommendation: `uptime-kuma` or similar after mount audit. - [ ] Backup retention policy: how many backups to keep, where, and whether encrypted. - [ ] Restore procedure per class: explicit restore command must exist before first smoke in that class. - [ ] Operator on-duty requirement: F3 is not autonomous; operator + Codex joint session only. ## Restore contract to define before first F3 Minimum restore notes needed: - Class A: restore `pg_dumpall` output into the correct Postgres container/database context. - Class B: stop service, restore Redis data files/AOF/RDB, start service, verify health. - Class C: Vault snapshot restore command and unseal/operator-token procedure. - Class D: MinIO mirror restore or volume restore path. - Class E: stop service, restore archived mount contents, start service, verify health. - Class F: engine-specific restore command, not just backup. - Class G: Agaria-specific restore outside canonical compose. ## Explicit non-goals - No F3 smoke in the PR #270 session. - No production mutation. - No production restart. - No automatic Actions integration for the backup script. - No deletion of legacy rollback paths. ## References - Cutover lane: #142 - Backup helper PR: #270 - Recovery plan: PR #250 - np-meerkat compose gap: #269
Author
Collaborator

Codex Wave 1 M01 closeout — F3 design gate superseded by completed F3 migration

Role: executor
Action: closing as resolved/superseded

This issue was the pre-F3 design gate for stateful smoke + backup-before contract. The current roadmap now treats F3/F3 final bosses as complete verification waves, and Milestone 01 only needs stale F3 reconciliation plus legacy cleanup.

Relevant durable state:

  • state/roadmap/current-platform-roadmap.md: F1/F1.5/F2/F3 smokes are complete verification waves, no longer future phases.
  • state/STATUS_NOW.md: F3 live-service migration is complete; final-boss services were migrated with backup-before evidence.
  • M01 cleanup continues in #387 and ADR-0020; destructive legacy cleanup is still gated separately.

No runtime mutation was performed. Future backup/restore work belongs to Milestone 02 (#45/#238), not this stale F3 gate.

## Codex Wave 1 M01 closeout — F3 design gate superseded by completed F3 migration **Role:** executor **Action:** closing as resolved/superseded This issue was the pre-F3 design gate for stateful smoke + backup-before contract. The current roadmap now treats F3/F3 final bosses as complete verification waves, and Milestone 01 only needs stale F3 reconciliation plus legacy cleanup. Relevant durable state: - `state/roadmap/current-platform-roadmap.md`: F1/F1.5/F2/F3 smokes are complete verification waves, no longer future phases. - `state/STATUS_NOW.md`: F3 live-service migration is complete; final-boss services were migrated with backup-before evidence. - M01 cleanup continues in #387 and ADR-0020; destructive legacy cleanup is still gated separately. No runtime mutation was performed. Future backup/restore work belongs to Milestone 02 (#45/#238), not this stale F3 gate.
codex closed this issue 2026-05-24 08:24:19 +02:00
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#271
No description provided.