feat(platformctl): execute approved compose apply #165

Merged
pdurlej merged 1 commit from codex/issues/142-apply-compose-execution into codex/issues/142-apply-noop-safety 2026-05-12 01:28:55 +02:00
Collaborator

Canary status: missing — fire canary 3+3 manually before merge

Canary Context Pack

Product story

Once approval and plan provenance are locked, Phase 3 needs the smallest real runtime mutation primitive: re-apply one declared docker-compose service through the platform-host-agent lane.

What changed

  • Added compose command generation from module runtime metadata.
  • platformctl apply can run docker compose -f <compose_file> up -d <service> through TailscaleTransport when Phase 05 gate is enabled.
  • Remote stdout/stderr are redacted before returning.
  • Added tests with fake transport; no live production mutation was performed.

Why it changed

PR #160 mixed this runtime capability with approval, provenance, artifacts, and workflow changes. This PR isolates only the compose execution primitive.

Files touched

  • control-plane/platformctl/apply.py
  • control-plane/platformctl/tests/test_apply_phase3.py

Relevant context

Stacked after approval-binding, plan-provenance, and no-op splits. modules/honcho-redis/module.yaml was read to verify current compose metadata shape.

Runtime evidence

No live runtime mutation. Tests use fake transport and assert the exact rs2000 command target.

Known constraints

Requires existing platform-host-agent SSH transport and PLATFORMCTL_PHASE=05 for real apply. Does not change remote compose files.

Explicit out-of-scope

Durable status artifact and Forgejo Actions wiring are later split PRs.

Requested decision

Approve the minimal compose execution primitive.

Merge blockers

Any root SSH path, arbitrary command path, secret leakage in remote output, or unbound plan apply.

Spec sources read

  • AGENTS.md — security-sensitive lane and production mutation gate.
  • control-plane/platformctl/apply.py — runtime execution code.
  • control-plane/platformctl/transport/tailscale.py — transport contract.
  • control-plane/platformctl/safety.py — command/path safety behavior.
  • modules/honcho-redis/module.yaml — compose metadata example.
  • control-plane/platformctl/tests/test_apply_phase3.py — tests.

Refs #142
Supersedes part of #160

Canary status: missing — fire canary 3+3 manually before merge ## Canary Context Pack ### Product story Once approval and plan provenance are locked, Phase 3 needs the smallest real runtime mutation primitive: re-apply one declared docker-compose service through the platform-host-agent lane. ### What changed - Added compose command generation from module runtime metadata. - `platformctl apply` can run `docker compose -f <compose_file> up -d <service>` through `TailscaleTransport` when Phase 05 gate is enabled. - Remote stdout/stderr are redacted before returning. - Added tests with fake transport; no live production mutation was performed. ### Why it changed PR #160 mixed this runtime capability with approval, provenance, artifacts, and workflow changes. This PR isolates only the compose execution primitive. ### Files touched - `control-plane/platformctl/apply.py` - `control-plane/platformctl/tests/test_apply_phase3.py` ### Relevant context Stacked after approval-binding, plan-provenance, and no-op splits. `modules/honcho-redis/module.yaml` was read to verify current compose metadata shape. ### Runtime evidence No live runtime mutation. Tests use fake transport and assert the exact rs2000 command target. ### Known constraints Requires existing `platform-host-agent` SSH transport and `PLATFORMCTL_PHASE=05` for real apply. Does not change remote compose files. ### Explicit out-of-scope Durable status artifact and Forgejo Actions wiring are later split PRs. ### Requested decision Approve the minimal compose execution primitive. ### Merge blockers Any root SSH path, arbitrary command path, secret leakage in remote output, or unbound plan apply. ## Spec sources read - `AGENTS.md` — security-sensitive lane and production mutation gate. - `control-plane/platformctl/apply.py` — runtime execution code. - `control-plane/platformctl/transport/tailscale.py` — transport contract. - `control-plane/platformctl/safety.py` — command/path safety behavior. - `modules/honcho-redis/module.yaml` — compose metadata example. - `control-plane/platformctl/tests/test_apply_phase3.py` — tests. Refs #142 Supersedes part of #160
feat(platformctl): execute approved compose apply
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s
python-ci / Python 3.11 (pull_request) Successful in 45s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 12s
ffda4bd732
Author
Collaborator

Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9 ⚠️ PRODUCTION TOUCH

To jest PR który faktycznie odpala docker compose up -d na RS 2000. Najgorsze correctness w całym Phase 3 batch (2/9). Cztery confirmed-exploitable issues.

Per-dim scoring

  • Correctness: 2/9 ⚠️ (wrong exit code, silent multi-module drop, dead path validation)
  • Security: 4/9 (path check no-op, host unvalidated)
  • Observability: 6/9
  • Test coverage: 5/9
  • Scope discipline: 4/9 (dropped state/rollback z deleted TODO)

Evidence: ~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/165/.


BLOCKER 1 — Fix success exit code

WHAT: apply.py linia 444: return result, EXIT_NO_CHANGES after successful docker compose up -d. But result["exitCode"] jest też EXIT_NO_CHANGES (= 0). To znaczy: successful apply emits same exit as no-op.

WHY: Caller (CI, audit, monitoring) can't distinguish:

  • Plan applied successfully (changes deployed) → should be EXIT_APPLIED
  • Plan was no-op (in-sync, nothing done) → should be EXIT_NO_CHANGES

Wrong exit code = wrong audit signal = false sense of consistency.

HOW (drafted patch w plan.py + apply.py):

# plan.py — add new exit constant (after linia 33)
EXIT_APPLIED = 9  # successful production mutation; distinct from EXIT_NO_CHANGES=0
# apply.py — update import (linia 22-26):
from .plan import (
    EXIT_NO_CHANGES, EXIT_TOOL_ERROR, EXIT_POLICY_VIOLATION,
    EXIT_REMOTE_FAILED, EXIT_APPROVAL_MISMATCH,
    EXIT_DESTRUCTIVE_NEEDS_APPROVAL,
    EXIT_APPLIED,
)
# apply.py — update success path (linie 437-444):
result["status"] = "applied"
result["message"] = "docker compose service apply completed"
result["exitCode"] = EXIT_APPLIED  # ralph #165 BLOCKER 1
result["applied"].append({
    "module": manifest.id,
    "host": manifest.host,
    "compose_service": manifest.data.get("spec", {}).get("runtime", {}).get("compose_service") or manifest.id,
})
return result, EXIT_APPLIED

VERIFY (add to tests/test_apply_phase3.py):

def test_apply_success_returns_exit_applied(tmp_path, monkeypatch):
    """Successful apply must return EXIT_APPLIED, not EXIT_NO_CHANGES (ralph BLOCKER 1)."""
    # ... fixtures: valid plan with one change, mocked transport returning rc=0
    result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1)
    assert exit_code == EXIT_APPLIED
    assert exit_code != EXIT_NO_CHANGES
    assert result["status"] == "applied"

BLOCKER 2 — Reject multi-module plans

WHAT: apply.py linia 400: manifest = _load_manifest_for_plan(plan, modules_dir=modules_dir). _load_manifest_for_plan (linia 243) extracts plan.get("module") — single string. Loads ONE manifest, builds ONE command, runs ONCE. Plan with changes referencing multiple modules silently applies only first.

WHY: Plan może mieć:

{
  "module": "honcho-api",       // canonical module
  "changes": [
    {"target": "compose.service.honcho-api", ...},
    {"target": "compose.service.honcho-deriver", ...}  // DIFFERENT MODULE
  ]
}

Current code: applies honcho-api compose service, silently ignores honcho-deriver change. Failed update goes unnoticed. Critical security risk if multi-module plan contains both update AND security patch.

HOW (drafted patch w apply.py — przed call do _load_manifest_for_plan):

Insert validation przed linia 399:

# SECURITY (ralph #165 BLOCKER 2): reject multi-module plans until safely iterated.
# Each apply must be bound to single module's compose service. Multi-module
# orchestration is Phase 06+ scope.
plan_module = plan.get("module")
referenced_modules = {plan_module}
for ch in spec_changes:
    target = ch.get("target", "")
    # target patterns like "compose.service.<service-name>" — extract service
    if target.startswith("compose.service."):
        svc = target.removeprefix("compose.service.")
        # We don't know exact module ↔ service mapping here without manifest;
        # use heuristic: if service != plan_module's expected service, flag.
        # (Better: enforce plan generation only emits same-module changes.)
        referenced_modules.add(svc)

if len(referenced_modules) > 1:
    result.update({
        "status": "blocked",
        "error": (
            f"multi-module plan rejected: changes reference {sorted(referenced_modules)} "
            f"but apply supports only one module per plan (ralph BLOCKER 2). "
            "Split plan per-module or wait for Phase 06 multi-module apply."
        ),
        "exitCode": EXIT_POLICY_VIOLATION,
    })
    return result, EXIT_POLICY_VIOLATION

Alternatywa (cleaner): enforce w plan generation — make_plan zawsze emituje changes pod tym samym module. Wtedy apply może trust. Ale jako safety net w apply, hard reject jest defense-in-depth.

VERIFY:

def test_apply_rejects_multi_module_plan(tmp_path, monkeypatch):
    """Multi-module plan must reject (ralph BLOCKER 2)."""
    plan = {
        "module": "honcho-api",
        "source_sha": "a"*40, "source_dirty": False, "exitCode": 1,
        "changes": [
            {"target": "compose.service.honcho-api", "action": "update"},
            {"target": "compose.service.honcho-deriver", "action": "update"},
        ],
    }
    # ... assert EXIT_POLICY_VIOLATION + "multi-module" in error

BLOCKER 3 — Align path validation with executed command

WHAT: apply.py linia 265-269:

remote_compose_file = Path(remote_root) / compose_file
check_target(remote_compose_file)  # validates this Path
return (
    f"cd {shlex.quote(remote_root)} && "
    f"docker compose -f {shlex.quote(compose_file)} up -d ..."  # uses compose_file, NOT remote_compose_file
)

Validation runs przeciwko remote_compose_file, ale command sends compose_file (raw string from manifest). check_target validates path the shell never receives.

WHY: Dead variable check. If compose_file is absolute path (e.g., /etc/sudoers.d/x), validation runs on /opt/vps-home-platform-infra/etc/sudoers.d/x (joined) but command runs docker compose -f /etc/sudoers.d/x (absolute escapes cd). Sacred path enforcement bypassed.

HOW (drafted patch w apply.py linie 251-270):

def compose_apply_command(manifest: Manifest, *, remote_root: str = DEFAULT_REMOTE_ROOT) -> str:
    """Build the remote docker compose command for one module service.

    SECURITY (ralph #165 BLOCKER 3): validate the actual path the shell executes.
    """
    runtime = manifest.data.get("spec", {}).get("runtime", {})
    if runtime.get("orchestrator") != "docker-compose":
        raise ValueError(f"{manifest.id} is not a docker-compose module")

    compose_file = runtime.get("compose_file")
    if not isinstance(compose_file, str) or not compose_file:
        raise ValueError(f"{manifest.id} missing spec.runtime.compose_file")

    # SECURITY: reject absolute compose_file paths — they escape remote_root cd.
    if compose_file.startswith("/") or compose_file.startswith("~"):
        raise ValueError(
            f"{manifest.id} compose_file must be relative to remote_root, "
            f"got absolute: {compose_file!r}"
        )

    compose_service = runtime.get("compose_service") or manifest.id
    if not isinstance(compose_service, str) or not compose_service:
        raise ValueError(f"{manifest.id} missing spec.runtime.compose_service")

    # Validate the FULL resolved path that will be executed
    remote_compose_file = Path(remote_root) / compose_file
    check_target(remote_compose_file)
    # Pass the validated absolute path to docker compose explicitly
    return (
        f"cd {shlex.quote(remote_root)} && "
        f"docker compose -f {shlex.quote(str(remote_compose_file))} "
        f"up -d {shlex.quote(compose_service)}"
    )

Two safety improvements:

  1. Reject absolute compose_file (closes cd /opt && docker compose -f /etc/... escape)
  2. Send validated remote_compose_file (not raw compose_file) do docker compose

VERIFY:

def test_compose_apply_command_rejects_absolute_compose_file():
    """Absolute compose_file path must reject (ralph BLOCKER 3)."""
    manifest = make_manifest(runtime={
        "orchestrator": "docker-compose",
        "compose_file": "/etc/sudoers.d/evil",
        "compose_service": "x",
    })
    with pytest.raises(ValueError, match="absolute"):
        compose_apply_command(manifest)


def test_compose_apply_command_uses_validated_path():
    """Command must reference the validated full path (ralph BLOCKER 3)."""
    manifest = make_manifest(runtime={
        "orchestrator": "docker-compose",
        "compose_file": "compose/core/compose.yaml",
        "compose_service": "honcho-api",
    })
    cmd = compose_apply_command(manifest, remote_root="/opt/vps-home-platform-infra")
    assert "/opt/vps-home-platform-infra/compose/core/compose.yaml" in cmd

BLOCKER 4 — State persistence + rollback stubs

WHAT: Commit message + body claim "state update + rollback" jako część scope. Old TODO (line ~340 in #164 source: update .platform/state/modules/<id>.status.json with new last_apply_sha) został usunięty z PR-a #164/#165, ale implementacja nigdy nie pojawiła się.

WHY: Successful apply does NOT persist last_apply_sha. Next apply has no way to know last good state. Failed apply has no rollback path. This violates PLATFORM_CONSTITUTION promise that state is durable.

HOW (drafted patch w apply.py after successful apply, before return result, EXIT_APPLIED):

# Add to top of apply.py:
import logging
_LOG = logging.getLogger(__name__)


def _write_apply_state(
    module_id: str,
    *,
    applied_sha: str,
    pr_number: int | None,
    state_dir: Path = Path(".platform/state/modules"),
) -> Path | None:
    """Persist last_apply_sha for a module. Return path written, or None on error.

    SECURITY (ralph #165 BLOCKER 4): validate module_id against safe pattern;
    log failure but DON'T crash apply (state is enrichment, not gate).
    """
    if not re.fullmatch(r"^[a-zA-Z0-9_-]+$", module_id):
        _LOG.warning("refusing to write apply state for unsafe module_id=%r", module_id)
        return None
    state_dir.mkdir(parents=True, exist_ok=True)
    target = state_dir / f"{module_id}.status.json"
    payload = {
        "module": module_id,
        "last_apply_sha": applied_sha,
        "last_apply_pr": pr_number,
        "applied_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
    }
    tmp = target.with_suffix(f".{os.getpid()}.tmp")
    try:
        tmp.write_text(json.dumps(payload, indent=2))
        tmp.replace(target)
        return target
    except OSError as exc:
        _LOG.warning("failed to write apply state %s: %s", target, exc)
        try:
            tmp.unlink(missing_ok=True)
        except OSError:
            pass
        return None


def _rollback_stub(module_id: str, *, host: str, reason: str) -> None:
    """Rollback stub — emits structured log entry; full rollback is Phase 06.

    SECURITY (ralph #165 BLOCKER 4): Phase 03 cannot safely rollback yet
    (no previous-state record). At minimum, log loud so operator notices.
    """
    _LOG.error(
        "ROLLBACK_STUB module=%s host=%s reason=%r — manual recovery required",
        module_id, host, reason,
    )


# In apply_plan success path, BEFORE return result, EXIT_APPLIED:
state_path = _write_apply_state(
    manifest.id,
    applied_sha=approved_sha,
    pr_number=approval.pr_number,
)
if state_path:
    result["state_artifact"] = str(state_path)

# In apply_plan remote-failed path (linia 432), call rollback stub:
_rollback_stub(manifest.id, host=manifest.host, reason="remote apply returncode != 0")

VERIFY:

def test_apply_writes_state_on_success(tmp_path, monkeypatch):
    """Successful apply must persist last_apply_sha (ralph BLOCKER 4)."""
    # ... fixtures
    state_dir = tmp_path / ".platform/state/modules"
    monkeypatch.setattr("platformctl.apply._write_apply_state",
                        lambda mid, **kw: _write_apply_state(mid, state_dir=state_dir, **kw))
    result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1)
    assert exit_code == EXIT_APPLIED
    state_file = state_dir / f"{manifest_id}.status.json"
    assert state_file.exists()
    payload = json.loads(state_file.read_text())
    assert payload["last_apply_sha"] == "a"*40


def test_apply_calls_rollback_stub_on_remote_failure(monkeypatch, caplog):
    """Remote failure must invoke rollback stub (ralph BLOCKER 4)."""
    # ... mock transport to return rc=1
    with caplog.at_level(logging.ERROR):
        result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1)
    assert exit_code == EXIT_REMOTE_FAILED
    assert any("ROLLBACK_STUB" in rec.message for rec in caplog.records)

Follow-up issues

HOST-ALLOWLIST-01 — Validate manifest.host against Tailscale node set

apply.py linia 412: transport.run(host=manifest.host, ...). manifest.host is arbitrary string from module.yaml. Compromised/malicious manifest could redirect apply do attacker-controlled host w Tailnet.

Proposed: maintain allowlist ~/Developer/.../config/permitted-hosts.txt lub w platform repo. Reject hosts not on list with EXIT_POLICY_VIOLATION.

REDACT-COMMAND-01 — Redact command field w result dict

Linia 423: "command": command, w remote_result. Currently OK (compose command has no secrets), ale future commands may include env vars / paths. Defense-in-depth: redact command before result emission, same way as stdout/stderr.

REDACT-EXTEND-01 — Extend _redact_remote_output patterns

Current redaction (linia 273-280) covers password/token/secret/authorization + Bearer. Missing: api[_-]?key, passwd (no e), JSON "secret": patterns, URL query strings ?token=....

Proposed: extend regex + add unit tests dla każdy known sensitive pattern.

SMOKE-POST-APPLY-01 — Post-apply container health smoke test

Current success criterion: docker compose up -d returncode == 0. To znaczy "docker daemon accepted the spec", NIE "container is healthy". Container may immediately crashloop.

Proposed: after apply success, run docker inspect <container> checking State.Health.Status == "healthy" (jeśli healthcheck defined) lub State.Running == true przez ≥30s. EXIT_REMOTE_FAILED jeśli unhealthy.

TRANSPORT-TIMEOUT-01 — Transport command timeout

transport.run (linia 412) currently has no timeout enforced w apply call site. Hanging remote command (network partition, hung daemon) blocks control plane indefinitely.

Proposed: pass timeout_seconds parameter (default 120). Caller in apply uses 300s dla compose apply. Transport raises SSHError on timeout.


Action items

  1. Apply 4 BLOCKER patches → apply.py + plan.py + tests
  2. Force-push do codex/issues/142-apply-compose-execution
  3. File 5 follow-up issues
  4. Comment "ready for re-review" tutaj
  5. Operator może odpalić re-ralph (ten PR ⚠️ wymaga special attention bo production touch)

— ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity

## Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9 ⚠️ PRODUCTION TOUCH **To jest PR który faktycznie odpala `docker compose up -d` na RS 2000.** Najgorsze correctness w całym Phase 3 batch (2/9). Cztery confirmed-exploitable issues. ### Per-dim scoring - **Correctness: 2/9** ⚠️ (wrong exit code, silent multi-module drop, dead path validation) - Security: 4/9 (path check no-op, host unvalidated) - Observability: 6/9 - Test coverage: 5/9 - Scope discipline: 4/9 (dropped state/rollback z deleted TODO) Evidence: `~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/165/`. --- ## BLOCKER 1 — Fix success exit code **WHAT:** `apply.py` linia 444: `return result, EXIT_NO_CHANGES` after successful `docker compose up -d`. But `result["exitCode"]` jest też `EXIT_NO_CHANGES` (= 0). To znaczy: **successful apply emits same exit as no-op**. **WHY:** Caller (CI, audit, monitoring) can't distinguish: - Plan applied successfully (changes deployed) → should be `EXIT_APPLIED` - Plan was no-op (in-sync, nothing done) → should be `EXIT_NO_CHANGES` Wrong exit code = wrong audit signal = false sense of consistency. **HOW (drafted patch w `plan.py` + `apply.py`):** ```python # plan.py — add new exit constant (after linia 33) EXIT_APPLIED = 9 # successful production mutation; distinct from EXIT_NO_CHANGES=0 ``` ```python # apply.py — update import (linia 22-26): from .plan import ( EXIT_NO_CHANGES, EXIT_TOOL_ERROR, EXIT_POLICY_VIOLATION, EXIT_REMOTE_FAILED, EXIT_APPROVAL_MISMATCH, EXIT_DESTRUCTIVE_NEEDS_APPROVAL, EXIT_APPLIED, ) ``` ```python # apply.py — update success path (linie 437-444): result["status"] = "applied" result["message"] = "docker compose service apply completed" result["exitCode"] = EXIT_APPLIED # ralph #165 BLOCKER 1 result["applied"].append({ "module": manifest.id, "host": manifest.host, "compose_service": manifest.data.get("spec", {}).get("runtime", {}).get("compose_service") or manifest.id, }) return result, EXIT_APPLIED ``` **VERIFY (add to `tests/test_apply_phase3.py`):** ```python def test_apply_success_returns_exit_applied(tmp_path, monkeypatch): """Successful apply must return EXIT_APPLIED, not EXIT_NO_CHANGES (ralph BLOCKER 1).""" # ... fixtures: valid plan with one change, mocked transport returning rc=0 result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1) assert exit_code == EXIT_APPLIED assert exit_code != EXIT_NO_CHANGES assert result["status"] == "applied" ``` --- ## BLOCKER 2 — Reject multi-module plans **WHAT:** `apply.py` linia 400: `manifest = _load_manifest_for_plan(plan, modules_dir=modules_dir)`. `_load_manifest_for_plan` (linia 243) extracts `plan.get("module")` — single string. Loads ONE manifest, builds ONE command, runs ONCE. **Plan with `changes` referencing multiple modules silently applies only first.** **WHY:** Plan może mieć: ```json { "module": "honcho-api", // canonical module "changes": [ {"target": "compose.service.honcho-api", ...}, {"target": "compose.service.honcho-deriver", ...} // DIFFERENT MODULE ] } ``` Current code: applies honcho-api compose service, silently ignores honcho-deriver change. Failed update goes unnoticed. **Critical security risk if multi-module plan contains both update AND security patch.** **HOW (drafted patch w `apply.py` — przed call do `_load_manifest_for_plan`):** Insert validation przed linia 399: ```python # SECURITY (ralph #165 BLOCKER 2): reject multi-module plans until safely iterated. # Each apply must be bound to single module's compose service. Multi-module # orchestration is Phase 06+ scope. plan_module = plan.get("module") referenced_modules = {plan_module} for ch in spec_changes: target = ch.get("target", "") # target patterns like "compose.service.<service-name>" — extract service if target.startswith("compose.service."): svc = target.removeprefix("compose.service.") # We don't know exact module ↔ service mapping here without manifest; # use heuristic: if service != plan_module's expected service, flag. # (Better: enforce plan generation only emits same-module changes.) referenced_modules.add(svc) if len(referenced_modules) > 1: result.update({ "status": "blocked", "error": ( f"multi-module plan rejected: changes reference {sorted(referenced_modules)} " f"but apply supports only one module per plan (ralph BLOCKER 2). " "Split plan per-module or wait for Phase 06 multi-module apply." ), "exitCode": EXIT_POLICY_VIOLATION, }) return result, EXIT_POLICY_VIOLATION ``` **Alternatywa (cleaner):** enforce w plan generation — `make_plan` zawsze emituje changes pod tym samym `module`. Wtedy apply może trust. Ale jako safety net w apply, hard reject jest defense-in-depth. **VERIFY:** ```python def test_apply_rejects_multi_module_plan(tmp_path, monkeypatch): """Multi-module plan must reject (ralph BLOCKER 2).""" plan = { "module": "honcho-api", "source_sha": "a"*40, "source_dirty": False, "exitCode": 1, "changes": [ {"target": "compose.service.honcho-api", "action": "update"}, {"target": "compose.service.honcho-deriver", "action": "update"}, ], } # ... assert EXIT_POLICY_VIOLATION + "multi-module" in error ``` --- ## BLOCKER 3 — Align path validation with executed command **WHAT:** `apply.py` linia 265-269: ```python remote_compose_file = Path(remote_root) / compose_file check_target(remote_compose_file) # validates this Path return ( f"cd {shlex.quote(remote_root)} && " f"docker compose -f {shlex.quote(compose_file)} up -d ..." # uses compose_file, NOT remote_compose_file ) ``` Validation runs przeciwko `remote_compose_file`, ale command sends `compose_file` (raw string from manifest). **`check_target` validates path the shell never receives.** **WHY:** Dead variable check. If `compose_file` is absolute path (e.g., `/etc/sudoers.d/x`), validation runs on `/opt/vps-home-platform-infra/etc/sudoers.d/x` (joined) but command runs `docker compose -f /etc/sudoers.d/x` (absolute escapes `cd`). Sacred path enforcement bypassed. **HOW (drafted patch w `apply.py` linie 251-270):** ```python def compose_apply_command(manifest: Manifest, *, remote_root: str = DEFAULT_REMOTE_ROOT) -> str: """Build the remote docker compose command for one module service. SECURITY (ralph #165 BLOCKER 3): validate the actual path the shell executes. """ runtime = manifest.data.get("spec", {}).get("runtime", {}) if runtime.get("orchestrator") != "docker-compose": raise ValueError(f"{manifest.id} is not a docker-compose module") compose_file = runtime.get("compose_file") if not isinstance(compose_file, str) or not compose_file: raise ValueError(f"{manifest.id} missing spec.runtime.compose_file") # SECURITY: reject absolute compose_file paths — they escape remote_root cd. if compose_file.startswith("/") or compose_file.startswith("~"): raise ValueError( f"{manifest.id} compose_file must be relative to remote_root, " f"got absolute: {compose_file!r}" ) compose_service = runtime.get("compose_service") or manifest.id if not isinstance(compose_service, str) or not compose_service: raise ValueError(f"{manifest.id} missing spec.runtime.compose_service") # Validate the FULL resolved path that will be executed remote_compose_file = Path(remote_root) / compose_file check_target(remote_compose_file) # Pass the validated absolute path to docker compose explicitly return ( f"cd {shlex.quote(remote_root)} && " f"docker compose -f {shlex.quote(str(remote_compose_file))} " f"up -d {shlex.quote(compose_service)}" ) ``` Two safety improvements: 1. Reject absolute `compose_file` (closes `cd /opt && docker compose -f /etc/...` escape) 2. Send validated `remote_compose_file` (not raw `compose_file`) do docker compose **VERIFY:** ```python def test_compose_apply_command_rejects_absolute_compose_file(): """Absolute compose_file path must reject (ralph BLOCKER 3).""" manifest = make_manifest(runtime={ "orchestrator": "docker-compose", "compose_file": "/etc/sudoers.d/evil", "compose_service": "x", }) with pytest.raises(ValueError, match="absolute"): compose_apply_command(manifest) def test_compose_apply_command_uses_validated_path(): """Command must reference the validated full path (ralph BLOCKER 3).""" manifest = make_manifest(runtime={ "orchestrator": "docker-compose", "compose_file": "compose/core/compose.yaml", "compose_service": "honcho-api", }) cmd = compose_apply_command(manifest, remote_root="/opt/vps-home-platform-infra") assert "/opt/vps-home-platform-infra/compose/core/compose.yaml" in cmd ``` --- ## BLOCKER 4 — State persistence + rollback stubs **WHAT:** Commit message + body claim "state update + rollback" jako część scope. Old TODO (line ~340 in #164 source: `update .platform/state/modules/<id>.status.json with new last_apply_sha`) został usunięty z PR-a #164/#165, **ale implementacja nigdy nie pojawiła się**. **WHY:** Successful apply does NOT persist `last_apply_sha`. Next apply has no way to know last good state. Failed apply has no rollback path. **This violates PLATFORM_CONSTITUTION promise that state is durable.** **HOW (drafted patch w `apply.py` after successful apply, before `return result, EXIT_APPLIED`):** ```python # Add to top of apply.py: import logging _LOG = logging.getLogger(__name__) def _write_apply_state( module_id: str, *, applied_sha: str, pr_number: int | None, state_dir: Path = Path(".platform/state/modules"), ) -> Path | None: """Persist last_apply_sha for a module. Return path written, or None on error. SECURITY (ralph #165 BLOCKER 4): validate module_id against safe pattern; log failure but DON'T crash apply (state is enrichment, not gate). """ if not re.fullmatch(r"^[a-zA-Z0-9_-]+$", module_id): _LOG.warning("refusing to write apply state for unsafe module_id=%r", module_id) return None state_dir.mkdir(parents=True, exist_ok=True) target = state_dir / f"{module_id}.status.json" payload = { "module": module_id, "last_apply_sha": applied_sha, "last_apply_pr": pr_number, "applied_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"), } tmp = target.with_suffix(f".{os.getpid()}.tmp") try: tmp.write_text(json.dumps(payload, indent=2)) tmp.replace(target) return target except OSError as exc: _LOG.warning("failed to write apply state %s: %s", target, exc) try: tmp.unlink(missing_ok=True) except OSError: pass return None def _rollback_stub(module_id: str, *, host: str, reason: str) -> None: """Rollback stub — emits structured log entry; full rollback is Phase 06. SECURITY (ralph #165 BLOCKER 4): Phase 03 cannot safely rollback yet (no previous-state record). At minimum, log loud so operator notices. """ _LOG.error( "ROLLBACK_STUB module=%s host=%s reason=%r — manual recovery required", module_id, host, reason, ) # In apply_plan success path, BEFORE return result, EXIT_APPLIED: state_path = _write_apply_state( manifest.id, applied_sha=approved_sha, pr_number=approval.pr_number, ) if state_path: result["state_artifact"] = str(state_path) # In apply_plan remote-failed path (linia 432), call rollback stub: _rollback_stub(manifest.id, host=manifest.host, reason="remote apply returncode != 0") ``` **VERIFY:** ```python def test_apply_writes_state_on_success(tmp_path, monkeypatch): """Successful apply must persist last_apply_sha (ralph BLOCKER 4).""" # ... fixtures state_dir = tmp_path / ".platform/state/modules" monkeypatch.setattr("platformctl.apply._write_apply_state", lambda mid, **kw: _write_apply_state(mid, state_dir=state_dir, **kw)) result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1) assert exit_code == EXIT_APPLIED state_file = state_dir / f"{manifest_id}.status.json" assert state_file.exists() payload = json.loads(state_file.read_text()) assert payload["last_apply_sha"] == "a"*40 def test_apply_calls_rollback_stub_on_remote_failure(monkeypatch, caplog): """Remote failure must invoke rollback stub (ralph BLOCKER 4).""" # ... mock transport to return rc=1 with caplog.at_level(logging.ERROR): result, exit_code = apply_plan(plan_path, "a"*40, approved_pr=1) assert exit_code == EXIT_REMOTE_FAILED assert any("ROLLBACK_STUB" in rec.message for rec in caplog.records) ``` --- ## Follow-up issues ### `HOST-ALLOWLIST-01` — Validate manifest.host against Tailscale node set > `apply.py` linia 412: `transport.run(host=manifest.host, ...)`. `manifest.host` is arbitrary string from `module.yaml`. Compromised/malicious manifest could redirect apply do attacker-controlled host w Tailnet. > > Proposed: maintain allowlist `~/Developer/.../config/permitted-hosts.txt` lub w platform repo. Reject hosts not on list with `EXIT_POLICY_VIOLATION`. ### `REDACT-COMMAND-01` — Redact command field w result dict > Linia 423: `"command": command,` w `remote_result`. Currently OK (compose command has no secrets), ale future commands may include env vars / paths. Defense-in-depth: redact command before result emission, same way as stdout/stderr. ### `REDACT-EXTEND-01` — Extend `_redact_remote_output` patterns > Current redaction (linia 273-280) covers password/token/secret/authorization + Bearer. Missing: `api[_-]?key`, `passwd` (no e), JSON `"secret":` patterns, URL query strings `?token=...`. > > Proposed: extend regex + add unit tests dla każdy known sensitive pattern. ### `SMOKE-POST-APPLY-01` — Post-apply container health smoke test > Current success criterion: `docker compose up -d` returncode == 0. To znaczy "docker daemon accepted the spec", NIE "container is healthy". Container may immediately crashloop. > > Proposed: after apply success, run `docker inspect <container>` checking `State.Health.Status == "healthy"` (jeśli healthcheck defined) lub `State.Running == true` przez ≥30s. EXIT_REMOTE_FAILED jeśli unhealthy. ### `TRANSPORT-TIMEOUT-01` — Transport command timeout > `transport.run` (linia 412) currently has no timeout enforced w apply call site. Hanging remote command (network partition, hung daemon) blocks control plane indefinitely. > > Proposed: pass `timeout_seconds` parameter (default 120). Caller in apply uses 300s dla compose apply. Transport raises `SSHError` on timeout. --- ## Action items 1. Apply 4 BLOCKER patches → `apply.py` + `plan.py` + tests 2. Force-push do `codex/issues/142-apply-compose-execution` 3. File 5 follow-up issues 4. Comment "ready for re-review" tutaj 5. Operator może odpalić re-ralph (ten PR ⚠️ wymaga special attention bo production touch) — ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity
codex force-pushed codex/issues/142-apply-compose-execution from ffda4bd732
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s
python-ci / Python 3.11 (pull_request) Successful in 45s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 12s
to d404d5af73
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 45s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 12s
2026-05-12 00:53:10 +02:00
Compare
Author
Collaborator

Codex iteration — ralph BLOCKERs addressed

Updated branch codex/issues/142-apply-compose-execution at d404d5a, rebased on updated #164.

Applied:

  • BLOCKER 1: successful production mutation now returns distinct EXIT_APPLIED = 9, not EXIT_NO_CHANGES.
  • BLOCKER 2: apply rejects plan changes targeting compose services outside the single module/service being applied.
  • BLOCKER 3: compose_file must be relative; command uses the same validated absolute path checked by sacred-path guard.
  • BLOCKER 4: successful apply writes last-apply state; remote failures emit rollback stub log.

Verification:

  • PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_apply_phase3.py -q → 29 passed.
  • Full ./tests/run-verify.sh still has the same pre-existing origin/main prompt debt noted on #162; not part of this PR.

Filed follow-up issues:

  • #198 HOST-ALLOWLIST-01
  • #199 REDACT-COMMAND-01
  • #200 REDACT-EXTEND-01
  • #201 SMOKE-POST-APPLY-01
  • #202 TRANSPORT-TIMEOUT-01

Ready for re-review / operator review. This remains production-touching and should stay Full tier / operator merge only.

## Codex iteration — ralph BLOCKERs addressed Updated branch `codex/issues/142-apply-compose-execution` at `d404d5a`, rebased on updated #164. Applied: - BLOCKER 1: successful production mutation now returns distinct `EXIT_APPLIED = 9`, not `EXIT_NO_CHANGES`. - BLOCKER 2: apply rejects plan changes targeting compose services outside the single module/service being applied. - BLOCKER 3: `compose_file` must be relative; command uses the same validated absolute path checked by sacred-path guard. - BLOCKER 4: successful apply writes last-apply state; remote failures emit rollback stub log. Verification: - `PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_apply_phase3.py -q` → 29 passed. - Full `./tests/run-verify.sh` still has the same pre-existing `origin/main` prompt debt noted on #162; not part of this PR. Filed follow-up issues: - #198 HOST-ALLOWLIST-01 - #199 REDACT-COMMAND-01 - #200 REDACT-EXTEND-01 - #201 SMOKE-POST-APPLY-01 - #202 TRANSPORT-TIMEOUT-01 Ready for re-review / operator review. This remains production-touching and should stay Full tier / operator merge only.
pdurlej merged commit c231af7e8c into codex/issues/142-apply-noop-safety 2026-05-12 01:28:55 +02:00
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!165
No description provided.