dr(w3d): full sandbox DR drill — operator-gated #433

Closed
opened 2026-05-24 18:18:23 +02:00 by codex · 6 comments
Collaborator

Context

W3a/b/c are accepted as the immediate restore-confidence gate for the next non-destructive Milestone 01 work, per Pan Herbatka verdict on PR #432 comment #9292.

This issue tracks the deeper W3d drill that remains required before irreversible cleanup or broad module upgrade work. W3d is deliberately not a blanket blocker for non-destructive M01 planning or migration PRs.

Purpose

Prove a full sandbox restore choreography before either of these downstream actions:

  • irreversible mutation under /opt/vps-home-platform-infra for Class A/B/D cleanup in state/cutover/rs2000-post-soak-legacy-cleanup.md;
  • broad module upgrade waves under Milestone 09 / W8.

Acceptance criteria

  • Restore Honcho + Forgejo + Postgres + Redis + Traefik together on a disposable host with the correct startup/order contract.
  • At least one restored service accepts a real request from a fake gateway or equivalent disposable ingress path.
  • Document the operator-step path for a fresh machine: Bitwarden export plus Infisical Token Auth bootstrap. Dry-run is enough for the first W3d pass.
  • Record first-pass RTO from backup-on-disk to service-accepts-request.
  • Evidence is metadata-only: no raw private messages, prompts, emails, session names, memory contents, secrets, DSNs, API keys, or embedding values.

Decision points

  • Disposable target: local Linux VM, disposable VPS, or other staging host.
  • VPS1000 / Iskra / OpenClaw persona-side restore: include here or split to W3e. Current recommendation: split to W3e because semantic continuity differs from RS2000 schema restore.

Gates

  • w3d-target-approved before provisioning or using the disposable target.
  • w3d-full-sandbox-approved before running the full sandbox drill.
  • m01-destructive-cleanup-approved remains a later operator-only gate before Class A/B/D irreversible cleanup.
  • module-upgrade-dr-confirmed remains required before broad W8 upgrade waves.

References

  • #45
  • #238
  • #430
  • #431
  • #432
  • PR #432 comment #9292
  • state/reports/w3-dr-restore-preflight-2026-05-24.md
  • state/reports/w3-restore-smoke-2026-05-24.md
  • state/reports/w3-honcho-partial-restore-2026-05-24.md
  • state/cycle/W3-dr-restore-confidence-output.md
  • state/cutover/rs2000-post-soak-legacy-cleanup.md
  • decisions/0020-post-soak-legacy-cleanup-and-platform-modularization.md
## Context W3a/b/c are accepted as the immediate restore-confidence gate for the next non-destructive Milestone 01 work, per Pan Herbatka verdict on PR #432 comment #9292. This issue tracks the deeper W3d drill that remains required before irreversible cleanup or broad module upgrade work. W3d is deliberately not a blanket blocker for non-destructive M01 planning or migration PRs. ## Purpose Prove a full sandbox restore choreography before either of these downstream actions: - irreversible mutation under `/opt/vps-home-platform-infra` for Class A/B/D cleanup in `state/cutover/rs2000-post-soak-legacy-cleanup.md`; - broad module upgrade waves under Milestone 09 / W8. ## Acceptance criteria - Restore Honcho + Forgejo + Postgres + Redis + Traefik together on a disposable host with the correct startup/order contract. - At least one restored service accepts a real request from a fake gateway or equivalent disposable ingress path. - Document the operator-step path for a fresh machine: Bitwarden export plus Infisical Token Auth bootstrap. Dry-run is enough for the first W3d pass. - Record first-pass RTO from backup-on-disk to service-accepts-request. - Evidence is metadata-only: no raw private messages, prompts, emails, session names, memory contents, secrets, DSNs, API keys, or embedding values. ## Decision points - Disposable target: local Linux VM, disposable VPS, or other staging host. - VPS1000 / Iskra / OpenClaw persona-side restore: include here or split to W3e. Current recommendation: split to W3e because semantic continuity differs from RS2000 schema restore. ## Gates - `w3d-target-approved` before provisioning or using the disposable target. - `w3d-full-sandbox-approved` before running the full sandbox drill. - `m01-destructive-cleanup-approved` remains a later operator-only gate before Class A/B/D irreversible cleanup. - `module-upgrade-dr-confirmed` remains required before broad W8 upgrade waves. ## References - #45 - #238 - #430 - #431 - #432 - PR #432 comment #9292 - `state/reports/w3-dr-restore-preflight-2026-05-24.md` - `state/reports/w3-restore-smoke-2026-05-24.md` - `state/reports/w3-honcho-partial-restore-2026-05-24.md` - `state/cycle/W3-dr-restore-confidence-output.md` - `state/cutover/rs2000-post-soak-legacy-cleanup.md` - `decisions/0020-post-soak-legacy-cleanup-and-platform-modularization.md`
Author
Collaborator

W3d read-only preflight refreshed (2026-05-27)

Role: executor / codex

I refreshed the W3d starting point with read-only RS2000 evidence. No restore, restart, apply, release-root promotion, or production runtime mutation was performed.

Evidence recorded in state/reports/w3d-full-sandbox-preflight-2026-05-27.md:

  • latest critical backup root: /opt/vps-home-platform-infra/backups/20260527-060017-critical;
  • backup timers active: hp-backup-critical.timer, hp-backup-noncritical.timer, hp-restore-smoke.timer;
  • latest read-only service status for backup/smoke units: Result=success, ExecMainStatus=0;
  • current backup material includes Honcho SQL, Forgejo data, Honcho PG data, Forgejo SQL, platform PG data, Infisical SQL, Redis, Vault, and Honcho Redis artifacts;
  • unhealthy container count after read-only check: 0;
  • existing restore-test.sh is still Forgejo-only, so this does not satisfy W3d.

Recommended first W3d target: local-vm / isolated disposable Linux host. Use disposable-vps only if realistic remote-host/RTO friction is needed in the first pass.

Next gate before execution:

w3d-full-sandbox-dr-approved target=<local-vm|disposable-vps>

Until then, #433 remains open and blocks irreversible Class A/B/D cleanup and broad/stateful/Forgejo upgrade waves.

## W3d read-only preflight refreshed (2026-05-27) Role: executor / codex I refreshed the W3d starting point with read-only RS2000 evidence. No restore, restart, apply, release-root promotion, or production runtime mutation was performed. Evidence recorded in `state/reports/w3d-full-sandbox-preflight-2026-05-27.md`: - latest critical backup root: `/opt/vps-home-platform-infra/backups/20260527-060017-critical`; - backup timers active: `hp-backup-critical.timer`, `hp-backup-noncritical.timer`, `hp-restore-smoke.timer`; - latest read-only service status for backup/smoke units: `Result=success`, `ExecMainStatus=0`; - current backup material includes Honcho SQL, Forgejo data, Honcho PG data, Forgejo SQL, platform PG data, Infisical SQL, Redis, Vault, and Honcho Redis artifacts; - unhealthy container count after read-only check: `0`; - existing `restore-test.sh` is still Forgejo-only, so this does not satisfy W3d. Recommended first W3d target: `local-vm` / isolated disposable Linux host. Use `disposable-vps` only if realistic remote-host/RTO friction is needed in the first pass. Next gate before execution: ```text w3d-full-sandbox-dr-approved target=<local-vm|disposable-vps> ``` Until then, #433 remains open and blocks irreversible Class A/B/D cleanup and broad/stateful/Forgejo upgrade waves.
Author
Collaborator

Role: executor
Intent: lane-outcome
Needs owner: no

W3d local Docker-host pass is green and recorded in PR #520: #520

Metadata-only summary:

  • backup root: /opt/vps-home-platform-infra/backups/20260527-060017-critical
  • restored locally: Forgejo SQL/data, Honcho SQL, Postgres, pgvector Postgres, Redis, Honcho Redis, Traefik fake ingress
  • routed smoke: Traefik ping 200, Forgejo /api/healthz via fake ingress 200, Honcho /openapi.json via fake ingress 200
  • RTO from backup-on-disk locally to routed smoke: 101 seconds
  • total elapsed including read-only RS2000 staging: 210 seconds
  • production mutation: none; final local Docker resources cleaned up

Known residual: this local pass does not prove remote fresh-machine bootstrap friction. If the operator wants that depth before destructive cleanup, the next step is the same drill on a disposable VPS.

Next: merge #520 if the local W3d pass is accepted as evidence, then decide whether #433 needs a disposable VPS pass or can be narrowed to destructive-cleanup gate acceptance.

**Role:** executor **Intent:** lane-outcome **Needs owner:** no W3d local Docker-host pass is green and recorded in PR #520: https://git.pdurlej.com/pdurlej/platform/pulls/520 Metadata-only summary: - backup root: `/opt/vps-home-platform-infra/backups/20260527-060017-critical` - restored locally: Forgejo SQL/data, Honcho SQL, Postgres, pgvector Postgres, Redis, Honcho Redis, Traefik fake ingress - routed smoke: Traefik ping `200`, Forgejo `/api/healthz` via fake ingress `200`, Honcho `/openapi.json` via fake ingress `200` - RTO from backup-on-disk locally to routed smoke: `101` seconds - total elapsed including read-only RS2000 staging: `210` seconds - production mutation: none; final local Docker resources cleaned up Known residual: this local pass does not prove remote fresh-machine bootstrap friction. If the operator wants that depth before destructive cleanup, the next step is the same drill on a disposable VPS. **Next:** merge #520 if the local W3d pass is accepted as evidence, then decide whether #433 needs a disposable VPS pass or can be narrowed to destructive-cleanup gate acceptance.
Author
Collaborator

Role: executor
Intent: lane-outcome
Needs owner: yes

PR #521 is open and mergeable as pass-2 readiness: #521

It does not run the VPS drill yet. It adds the remote wrapper that can run the already-green W3d restore harness on a clean SSH target without giving that target direct RS2000 access. It also refuses live hosts (rs2000, vps1000, and their current direct IPs).

After #521 merges, the pass-2 command shape is:

W3D_TARGET_HOST=<clean-disposable-vps-ssh-alias> \
  scripts/dr/w3d-disposable-vps-drill.sh

The one missing input is the disposable target itself. I found SSH aliases for rs2000, vps1000, and mbp-ubuntu-companion; none is a clean disposable VPS. I will not run this on vps1000, because it is the live OpenClaw/Iskra runtime.

Next: merge #521, then provide/provision one clean disposable SSH target alias if you want pass 2 executed.

**Role:** executor **Intent:** lane-outcome **Needs owner:** yes PR #521 is open and mergeable as pass-2 readiness: https://git.pdurlej.com/pdurlej/platform/pulls/521 It does not run the VPS drill yet. It adds the remote wrapper that can run the already-green W3d restore harness on a clean SSH target without giving that target direct RS2000 access. It also refuses live hosts (`rs2000`, `vps1000`, and their current direct IPs). After #521 merges, the pass-2 command shape is: ```bash W3D_TARGET_HOST=<clean-disposable-vps-ssh-alias> \ scripts/dr/w3d-disposable-vps-drill.sh ``` The one missing input is the disposable target itself. I found SSH aliases for `rs2000`, `vps1000`, and `mbp-ubuntu-companion`; none is a clean disposable VPS. I will not run this on `vps1000`, because it is the live OpenClaw/Iskra runtime. **Next:** merge #521, then provide/provision one clean disposable SSH target alias if you want pass 2 executed.
Author
Collaborator

Role: executor
Intent: lane-outcome
Needs owner: no

W3d pass 2 completed on vps1000 as an isolated Docker sandbox. This used vps1000 as a remote test host, not as a production restore target.

Metadata-only summary:

  • backup root: /opt/vps-home-platform-infra/backups/20260527-120006-critical
  • restored in isolated Compose project: Forgejo SQL/data, Honcho SQL, Postgres, pgvector Postgres, Redis, Honcho Redis, Traefik fake ingress
  • routed smoke: Traefik ping 200, Forgejo /api/healthz via fake ingress 200, Honcho /openapi.json via fake ingress 200
  • RTO from backup-on-disk on vps1000 to routed smoke: 97 seconds
  • target drill elapsed after artifacts were staged: 130 seconds
  • production mutation: no RS2000 restore/restart/apply/promote; no live OpenClaw/Iskra service mutation
  • cleanup: no w3d* containers, volumes, or staging dirs left on vps1000

Policy correction recorded in the PR: no third standing VPS is expected for this platform. Use rs2000, vps1000, local Mac/Docker, or temporary/serverless per-minute compute only if a future task genuinely needs it.

Next: PR will record the report and status updates. After merge, W3d restore mechanics are green; remaining possible follow-up is a separate W3e persona-side OpenClaw/Iskra continuity drill if the operator wants it.

**Role:** executor **Intent:** lane-outcome **Needs owner:** no W3d pass 2 completed on `vps1000` as an isolated Docker sandbox. This used `vps1000` as a remote test host, not as a production restore target. Metadata-only summary: - backup root: `/opt/vps-home-platform-infra/backups/20260527-120006-critical` - restored in isolated Compose project: Forgejo SQL/data, Honcho SQL, Postgres, pgvector Postgres, Redis, Honcho Redis, Traefik fake ingress - routed smoke: Traefik ping `200`, Forgejo `/api/healthz` via fake ingress `200`, Honcho `/openapi.json` via fake ingress `200` - RTO from backup-on-disk on `vps1000` to routed smoke: `97` seconds - target drill elapsed after artifacts were staged: `130` seconds - production mutation: no RS2000 restore/restart/apply/promote; no live OpenClaw/Iskra service mutation - cleanup: no `w3d*` containers, volumes, or staging dirs left on `vps1000` Policy correction recorded in the PR: no third standing VPS is expected for this platform. Use `rs2000`, `vps1000`, local Mac/Docker, or temporary/serverless per-minute compute only if a future task genuinely needs it. **Next:** PR will record the report and status updates. After merge, W3d restore mechanics are green; remaining possible follow-up is a separate W3e persona-side OpenClaw/Iskra continuity drill if the operator wants it.
Author
Collaborator

Post-M01 W3d refresh evidence is now available.

Infrastructure-level W3d result: PASS on vps1000 disposable sandbox, using canonical post-M01 backup root /opt/pdurlej-platform/runtime/host-ops/backups/20260529-060017-critical.

What passed:

  • Forgejo SQL/data restore.
  • Honcho SQL restore into pgvector Postgres.
  • Redis + Honcho Redis volume restore.
  • Traefik fake ingress.
  • Routed HTTP smoke: Traefik 200, Forgejo 200, Honcho 200.
  • Service health: restored app/db/cache stack healthy/running.
  • RTO from backup-on-disk to first accepted routed requests: 86s.
  • Cleanup: 0 leftover W3d containers, 0 leftover W3d volumes, stage removed.

Boundary:

  • No RS2000 production restore/restart/apply/promote.
  • No live OpenClaw/Iskra production mutation.
  • Metadata-only evidence; no secrets/private contents printed.

Open operator decision:

  • Either accept this as satisfying #433's infrastructure full-sandbox gate after M01, or
  • keep #433 open/scope-reduced and split semantic OpenClaw/Iskra continuity into W3e.
Post-M01 W3d refresh evidence is now available. Infrastructure-level W3d result: **PASS** on `vps1000` disposable sandbox, using canonical post-M01 backup root `/opt/pdurlej-platform/runtime/host-ops/backups/20260529-060017-critical`. What passed: - Forgejo SQL/data restore. - Honcho SQL restore into pgvector Postgres. - Redis + Honcho Redis volume restore. - Traefik fake ingress. - Routed HTTP smoke: Traefik `200`, Forgejo `200`, Honcho `200`. - Service health: restored app/db/cache stack healthy/running. - RTO from backup-on-disk to first accepted routed requests: `86s`. - Cleanup: `0` leftover W3d containers, `0` leftover W3d volumes, stage removed. Boundary: - No RS2000 production restore/restart/apply/promote. - No live OpenClaw/Iskra production mutation. - Metadata-only evidence; no secrets/private contents printed. Open operator decision: - Either accept this as satisfying #433's infrastructure full-sandbox gate after M01, or - keep #433 open/scope-reduced and split semantic OpenClaw/Iskra continuity into W3e.
Author
Collaborator

Operator decision recorded: semantic Iskra/OpenClaw continuity is split out of #433 into #602 (W3e).

#433 is now closed as the infrastructure full-sandbox DR gate:

  • post-M01 host restore smoke passed;
  • post-M01 vps1000 disposable W3d sandbox passed;
  • canonical host-ops backup root used;
  • no production restore/restart/apply/promote.

Remaining persona-continuity concern is intentionally not hidden under infra DR. Track it in #602.

Operator decision recorded: semantic Iskra/OpenClaw continuity is split out of #433 into #602 (W3e). #433 is now closed as the infrastructure full-sandbox DR gate: - post-M01 host restore smoke passed; - post-M01 `vps1000` disposable W3d sandbox passed; - canonical host-ops backup root used; - no production restore/restart/apply/promote. Remaining persona-continuity concern is intentionally not hidden under infra DR. Track it in #602.
codex closed this issue 2026-05-29 15:22:10 +02:00
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#433
No description provided.