ops(dr): execute backup/DR restore test — close Phase 8 cutover-gate (35+ days stale) #238
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform#238
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why this matters (and how it helps Iskra)
If Postgres on RS2000 (Iskra's Honcho memory backend) dies tomorrow, we don't know if the backups actually restore. Last successful restore test ran 2026-04-07 — that's 35+ days stale per DeepSeek 2026-05-11 review.
This is the single largest pointlessly-open risk to Iskra. Code quality is 7/10, security 9/10, but a stale backup makes both irrelevant if recovery fails when needed.
Per OPEN_LOOPS, this is also the formal
cutover-gateblocker for declaring Phase 8 / "production status." We can't say Iskra is "production-ready" until this loop is closed.Context (from
state/L3/OPEN_LOOPS.md—unresolved_activecluster "Backup/DR validation gate")af8855ae)hp-restore-smoke.timerlast ran 2026-04-07 per auditPan Herbatka's prep package (per STATUS_NOW.md) drafted
runbooks/dr-restore-test.mdfor this. If that runbook landed in main (check via git log), then the gap is just: execute it once + document outcome.Acceptance criteria
runbooks/dr-restore-test.mdexists and is current. If not, recover/regenerate it first.state/reports/2026-05-12-dr-restore-execution.md— RTO measured, RPO confirmed (last successful backup timestamp), any anomalies.hp-restore-smoke.timerif disabled — and add monitoring so it never drifts 35+ days again. ENOTEMPTY-monitor pattern from iskra-openclaw#? could template this.Operator subtasks (you decide before Codex runs)
These need Piotr's hands, not Codex's:
Codex Packet
Scope: orchestration + verification only. Codex does NOT perform the Vault unseal step or production data movement.
Likely sequence:
state/reports/2026-05-12-dr-restore-preflight.mdlisting what's needed (target host, snapshots available, Vault unseal procedure).state/reports/2026-05-12-dr-restore-execution.md+ monitoring hook for future runs.Non-goals
References
state/L3/OPEN_LOOPS.md— "Backup/DR validation gate" (HIGH risk per af8855ae)🔴 HIGHrunbooks/dr-restore-test.mdaf8855ae)Filed by claude (Prof Kong intermezzo, 2026-05-12 morning). This is Pi/DeepSeek's #1 risk recommendation and probably should be next-week priority after operator + Codex finish their current Phase 3 chain.
codex referenced this issue2026-05-19 08:54:45 +02:00
Codex W3b restore smoke checkpoint — 2026-05-24 16:28 CEST
Role: executor
Status: W3b green, W3 not closed
Evidence PR: #431
What passed:
hp-restore-smoke.servicerestored Forgejo SQL from/opt/vps-home-platform-infra/backups/20260524-120007-criticalinto a disposable Postgres container.0, duration 6s, no unhealthy containers after run, restore container removed.What remains for this issue:
Next required operator phrase for W3c:
Codex W3c Honcho partial restore — 2026-05-24 16:50 CEST
Role: executor
Status: green, metadata-only evidence PR opened: #432
Evidence:
w3-honcho-partial-restore-approved./opt/vps-home-platform-infra/backups/20260524-120007-critical/db/honcho.sqlinto disposablepgvector/pgvector:pg15container with Docker networknone.0.plpgsql,vector.12/60/91.documents=26141,message_embeddings=13569,sessions=201,messages=13587.documents.embedding=1536:26141,message_embeddings.embedding=1536:13569.0.0.No raw session/message/memory content, prompts, emails, or embedding values were queried or stored.
Interpretation: W3a/W3b/W3c now give current restore-confidence evidence for Forgejo SQL smoke and Honcho pgvector restore. Remaining optional/full scope is W3d: full sandbox OpenClaw/Iskra/Vault semantic restore with a separate target decision.
W3a/b/c restore-confidence evidence is now current and accepted as the immediate gate per PR #432 comment #9292.
Evidence package:
hp-restore-smoke.serviceexit0/SUCCESS, duration 6s, disposable restore target cleaned up, zero unhealthy containers.The original "35+ days stale" restore-test premise is no longer accurate. Full coordinated sandbox DR remains important, but it is now tracked as #433 and gates irreversible Class A/B/D cleanup plus broad W8 module upgrades, not all non-destructive Milestone 01 work.
Closing this issue as superseded by the W3a/b/c evidence package plus #433 follow-up.