ops(dr): execute backup/DR restore test — close Phase 8 cutover-gate (35+ days stale) #238

Closed
opened 2026-05-12 07:59:37 +02:00 by pdurlej · 3 comments
Owner

Why this matters (and how it helps Iskra)

If Postgres on RS2000 (Iskra's Honcho memory backend) dies tomorrow, we don't know if the backups actually restore. Last successful restore test ran 2026-04-07 — that's 35+ days stale per DeepSeek 2026-05-11 review.

This is the single largest pointlessly-open risk to Iskra. Code quality is 7/10, security 9/10, but a stale backup makes both irrelevant if recovery fails when needed.

Per OPEN_LOOPS, this is also the formal cutover-gate blocker for declaring Phase 8 / "production status." We can't say Iskra is "production-ready" until this loop is closed.


Context (from state/L3/OPEN_LOOPS.mdunresolved_active cluster "Backup/DR validation gate")

  • Phase 8 monthly restore test required per original Feb 18 plan (af8855ae)
  • hp-restore-smoke.timer last ran 2026-04-07 per audit
  • 35+ days stale — way past the monthly cadence
  • DeepSeek 2026-05-11 review: flagged as 🔴 HIGH severity in risk assessment

Pan Herbatka's prep package (per STATUS_NOW.md) drafted runbooks/dr-restore-test.md for this. If that runbook landed in main (check via git log), then the gap is just: execute it once + document outcome.


Acceptance criteria

  • Check pre-conditions: confirm runbooks/dr-restore-test.md exists and is current. If not, recover/regenerate it first.
  • Execute the test on a non-production target — RS2000 has Hetzner snapshots; restore one to a sandbox VM or ephemeral host.
  • Verify:
    • Postgres restores cleanly
    • Honcho memory backend reads at least one Iskra session row
    • Vault unseals (operator action — see "operator subtasks" below)
    • Apply a smoke conversation through the restored stack
  • Document outcome in state/reports/2026-05-12-dr-restore-execution.md — RTO measured, RPO confirmed (last successful backup timestamp), any anomalies.
  • Re-enable hp-restore-smoke.timer if disabled — and add monitoring so it never drifts 35+ days again. ENOTEMPTY-monitor pattern from iskra-openclaw#? could template this.
  • Phase 8 gate flip: update STATUS_NOW.md noting Phase 8 cutover-gate clear.

Operator subtasks (you decide before Codex runs)

These need Piotr's hands, not Codex's:

  1. Vault unseal during restore — Codex cannot do this.
  2. Sandbox target choice — Hetzner cloud spin-up vs reuse of staging? Operator decides cost vs cadence.
  3. Acceptance of restore outcome — Codex documents, operator declares pass/fail.

Codex Packet

Scope: orchestration + verification only. Codex does NOT perform the Vault unseal step or production data movement.

Likely sequence:

  1. Pre-flight PR: verify dr-restore-test.md current; emit a state/reports/2026-05-12-dr-restore-preflight.md listing what's needed (target host, snapshots available, Vault unseal procedure).
  2. Execute (joint): operator + Codex run the test. Codex captures evidence, operator unseals Vault.
  3. Post-mortem PR: state/reports/2026-05-12-dr-restore-execution.md + monitoring hook for future runs.

Non-goals

  • "Implement backup encryption upgrade" — out of scope. Verify what we have works first.
  • Move to different backup tool (Restic, Borg, etc.) — separate decision.
  • Restore to production target — sandbox only.

References

  • state/L3/OPEN_LOOPS.md — "Backup/DR validation gate" (HIGH risk per af8855ae)
  • DeepSeek 2026-05-11 review — flagged as 🔴 HIGH
  • Pan Herbatka prep package (per STATUS_NOW.md) — runbooks/dr-restore-test.md
  • Phase 8 plan (Feb 18, commit af8855ae)

Filed by claude (Prof Kong intermezzo, 2026-05-12 morning). This is Pi/DeepSeek's #1 risk recommendation and probably should be next-week priority after operator + Codex finish their current Phase 3 chain.

## Why this matters (and how it helps Iskra) If Postgres on RS2000 (Iskra's Honcho memory backend) dies tomorrow, **we don't know if the backups actually restore**. Last successful restore test ran **2026-04-07** — that's 35+ days stale per DeepSeek 2026-05-11 review. This is the **single largest pointlessly-open risk** to Iskra. Code quality is 7/10, security 9/10, but a stale backup makes both irrelevant if recovery fails when needed. Per OPEN_LOOPS, this is also the formal `cutover-gate` blocker for declaring Phase 8 / "production status." We can't say Iskra is "production-ready" until this loop is closed. --- ## Context (from `state/L3/OPEN_LOOPS.md` — `unresolved_active` cluster "Backup/DR validation gate") - **Phase 8 monthly restore test** required per original Feb 18 plan (`af8855ae`) - **`hp-restore-smoke.timer` last ran 2026-04-07** per audit - **35+ days stale** — way past the monthly cadence - DeepSeek 2026-05-11 review: flagged as **🔴 HIGH severity** in risk assessment Pan Herbatka's prep package (per STATUS_NOW.md) drafted `runbooks/dr-restore-test.md` for this. If that runbook landed in main (check via git log), then the gap is just: **execute it once + document outcome**. --- ## Acceptance criteria - [ ] **Check pre-conditions**: confirm `runbooks/dr-restore-test.md` exists and is current. If not, recover/regenerate it first. - [ ] **Execute the test on a non-production target** — RS2000 has Hetzner snapshots; restore one to a sandbox VM or ephemeral host. - [ ] **Verify**: - Postgres restores cleanly - Honcho memory backend reads at least one Iskra session row - Vault unseals (operator action — see "operator subtasks" below) - Apply a smoke conversation through the restored stack - [ ] **Document outcome** in `state/reports/2026-05-12-dr-restore-execution.md` — RTO measured, RPO confirmed (last successful backup timestamp), any anomalies. - [ ] **Re-enable `hp-restore-smoke.timer` if disabled** — and add monitoring so it never drifts 35+ days again. ENOTEMPTY-monitor pattern from iskra-openclaw#? could template this. - [ ] **Phase 8 gate flip**: update STATUS_NOW.md noting Phase 8 cutover-gate clear. --- ## Operator subtasks (you decide before Codex runs) These need Piotr's hands, not Codex's: 1. **Vault unseal during restore** — Codex cannot do this. 2. **Sandbox target choice** — Hetzner cloud spin-up vs reuse of staging? Operator decides cost vs cadence. 3. **Acceptance of restore outcome** — Codex documents, operator declares pass/fail. --- ## Codex Packet **Scope**: orchestration + verification only. Codex does NOT perform the Vault unseal step or production data movement. **Likely sequence**: 1. **Pre-flight PR**: verify dr-restore-test.md current; emit a `state/reports/2026-05-12-dr-restore-preflight.md` listing what's needed (target host, snapshots available, Vault unseal procedure). 2. **Execute (joint)**: operator + Codex run the test. Codex captures evidence, operator unseals Vault. 3. **Post-mortem PR**: `state/reports/2026-05-12-dr-restore-execution.md` + monitoring hook for future runs. --- ## Non-goals - ❌ "Implement backup encryption upgrade" — out of scope. Verify what we have works first. - ❌ Move to different backup tool (Restic, Borg, etc.) — separate decision. - ❌ Restore to production target — sandbox only. --- ## References - `state/L3/OPEN_LOOPS.md` — "Backup/DR validation gate" (HIGH risk per af8855ae) - DeepSeek 2026-05-11 review — flagged as `🔴 HIGH` - Pan Herbatka prep package (per STATUS_NOW.md) — `runbooks/dr-restore-test.md` - Phase 8 plan (Feb 18, commit `af8855ae`) --- *Filed by claude (Prof Kong intermezzo, 2026-05-12 morning). This is Pi/DeepSeek's #1 risk recommendation and probably should be next-week priority after operator + Codex finish their current Phase 3 chain.*
Collaborator

Codex W3b restore smoke checkpoint — 2026-05-24 16:28 CEST

Role: executor
Status: W3b green, W3 not closed
Evidence PR: #431

What passed:

  • Existing hp-restore-smoke.service restored Forgejo SQL from /opt/vps-home-platform-infra/backups/20260524-120007-critical into a disposable Postgres container.
  • Exit code 0, duration 6s, no unhealthy containers after run, restore container removed.

What remains for this issue:

  • Honcho Postgres partial restore into a disposable target.
  • Metadata-only validation of Honcho schema/table presence and non-zero counts.
  • Full sandbox DR decision for OpenClaw/Iskra/Vault semantics if operator wants to satisfy the full issue surface.

Next required operator phrase for W3c:

w3-honcho-partial-restore-approved
## Codex W3b restore smoke checkpoint — 2026-05-24 16:28 CEST **Role:** executor **Status:** W3b green, W3 not closed **Evidence PR:** #431 What passed: - Existing `hp-restore-smoke.service` restored Forgejo SQL from `/opt/vps-home-platform-infra/backups/20260524-120007-critical` into a disposable Postgres container. - Exit code `0`, duration 6s, no unhealthy containers after run, restore container removed. What remains for this issue: - Honcho Postgres partial restore into a disposable target. - Metadata-only validation of Honcho schema/table presence and non-zero counts. - Full sandbox DR decision for OpenClaw/Iskra/Vault semantics if operator wants to satisfy the full issue surface. Next required operator phrase for W3c: ```text w3-honcho-partial-restore-approved ```
Collaborator

Codex W3c Honcho partial restore — 2026-05-24 16:50 CEST

Role: executor
Status: green, metadata-only evidence PR opened: #432

Evidence:

  • Operator gate received: w3-honcho-partial-restore-approved.
  • Restored /opt/vps-home-platform-infra/backups/20260524-120007-critical/db/honcho.sql into disposable pgvector/pgvector:pg15 container with Docker network none.
  • Restore duration: 29s; exit code 0.
  • Restored extensions: plpgsql,vector.
  • Metadata-only validation: public tables/indexes/columns = 12 / 60 / 91.
  • Metadata-only counts: documents=26141, message_embeddings=13569, sessions=201, messages=13587.
  • Restored embedding dimensions: documents.embedding=1536:26141, message_embeddings.embedding=1536:13569.
  • Unhealthy containers after cleanup: 0.
  • Disposable W3 restore container count after cleanup: 0.

No raw session/message/memory content, prompts, emails, or embedding values were queried or stored.

Interpretation: W3a/W3b/W3c now give current restore-confidence evidence for Forgejo SQL smoke and Honcho pgvector restore. Remaining optional/full scope is W3d: full sandbox OpenClaw/Iskra/Vault semantic restore with a separate target decision.

## Codex W3c Honcho partial restore — 2026-05-24 16:50 CEST **Role:** executor **Status:** green, metadata-only evidence PR opened: #432 Evidence: - Operator gate received: `w3-honcho-partial-restore-approved`. - Restored `/opt/vps-home-platform-infra/backups/20260524-120007-critical/db/honcho.sql` into disposable `pgvector/pgvector:pg15` container with Docker network `none`. - Restore duration: 29s; exit code `0`. - Restored extensions: `plpgsql,vector`. - Metadata-only validation: public tables/indexes/columns = `12` / `60` / `91`. - Metadata-only counts: `documents=26141`, `message_embeddings=13569`, `sessions=201`, `messages=13587`. - Restored embedding dimensions: `documents.embedding=1536:26141`, `message_embeddings.embedding=1536:13569`. - Unhealthy containers after cleanup: `0`. - Disposable W3 restore container count after cleanup: `0`. No raw session/message/memory content, prompts, emails, or embedding values were queried or stored. Interpretation: W3a/W3b/W3c now give current restore-confidence evidence for Forgejo SQL smoke and Honcho pgvector restore. Remaining optional/full scope is W3d: full sandbox OpenClaw/Iskra/Vault semantic restore with a separate target decision.
Collaborator

W3a/b/c restore-confidence evidence is now current and accepted as the immediate gate per PR #432 comment #9292.

Evidence package:

  • #430: read-only W3 preflight.
  • #431: W3b current restore smoke green: hp-restore-smoke.service exit 0/SUCCESS, duration 6s, disposable restore target cleaned up, zero unhealthy containers.
  • #432: W3c Honcho partial restore green: pgvector backup restored into isolated disposable target, extensions/schema/count/vector-dimension metadata verified, zero unhealthy containers.

The original "35+ days stale" restore-test premise is no longer accurate. Full coordinated sandbox DR remains important, but it is now tracked as #433 and gates irreversible Class A/B/D cleanup plus broad W8 module upgrades, not all non-destructive Milestone 01 work.

Closing this issue as superseded by the W3a/b/c evidence package plus #433 follow-up.

W3a/b/c restore-confidence evidence is now current and accepted as the immediate gate per PR #432 comment #9292. Evidence package: - #430: read-only W3 preflight. - #431: W3b current restore smoke green: `hp-restore-smoke.service` exit `0/SUCCESS`, duration 6s, disposable restore target cleaned up, zero unhealthy containers. - #432: W3c Honcho partial restore green: pgvector backup restored into isolated disposable target, extensions/schema/count/vector-dimension metadata verified, zero unhealthy containers. The original "35+ days stale" restore-test premise is no longer accurate. Full coordinated sandbox DR remains important, but it is now tracked as #433 and gates irreversible Class A/B/D cleanup plus broad W8 module upgrades, not all non-destructive Milestone 01 work. Closing this issue as superseded by the W3a/b/c evidence package plus #433 follow-up.
codex closed this issue 2026-05-24 18:18:47 +02:00
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#238
No description provided.