feat(platformctl): add health rollup #161

Merged
pdurlej merged 2 commits from codex/issues/142-phase3-health into main 2026-05-12 01:29:43 +02:00
Collaborator

Canary status: iterating — canary rerun requested after Codex fixes for PR #161 blocker findings

Canary Context Pack

Product story

Phase 3 needs platformctl health to become the read-only confidence surface before later phases deploy or cut over RS2000 services. The operator should be able to ask whether a module is healthy without spelunking manually through manifests, smoke scripts, and Docker status.

What changed

  • Added control-plane/platformctl/health.py.
  • Replaced the CLI health skeleton with real rollup behavior.
  • Added platformctl health for all Phase 02 v2-cataloged modules.
  • Added platformctl health --module <id> and retained platformctl health <id>.
  • Health combines strict-v2 manifest validation, tests/smoke.sh --json, and read-only Docker container status through TailscaleTransport.
  • Added focused tests with fake transport and fake smoke outputs.

Why it changed

This is Packet 3.4 from the cutover flight. It unblocks Phase 3's operational control-plane surface, but does not perform production mutation.

Files touched

  • control-plane/platformctl/health.py
  • control-plane/platformctl/cli.py
  • control-plane/platformctl/tests/test_health_phase3.py

Relevant context

  • prompts/codex-cutover-flight/dispatch.md
  • prompts/codex-cutover-flight/phase-3-operational.md Packet 3.4
  • tests/smoke.sh
  • control-plane/platformctl/transport/tailscale.py
  • control-plane/platformctl/manifest.py
  • control-plane/platformctl/plan.py

Runtime evidence

Read-only manual probe was run:

PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json

Result: command executes and returns structured health JSON, but exits 5 because TailscaleTransport cannot currently SSH as platform-host-agent to RS2000:

  • manifest strict-v2: OK
  • smoke.sh: OK, 4 passed / 0 failed / 3 skipped
  • container check via platformctl transport: FAIL, ssh: connect to host 100.110.188.20 port 22: Connection refused

Additional read-only check: ssh rs2000 resolves to operator SSH config as root/public host, while platform-host-agent@rs2000 returns Permission denied (publickey). I did not bypass this with root. This is a runtime access gate, not a code-test failure.

Known constraints

  • Health is read-only.
  • Tests use only fake transport and fake smoke runner for SSH/container status.
  • The command intentionally uses platform-host-agent through TailscaleTransport, not the operator root SSH alias.
  • The runtime SSH access gap means Phase 3 manual smoke is not fully satisfied yet.

Explicit out-of-scope

  • No production mutation.
  • No SSH key provisioning.
  • No Tailscale ACL mutation.
  • No root SSH fallback.
  • No changes to tests/smoke.sh semantics.

Requested decision

Review the code path for merge readiness as Packet 3.4. Separately, treat the platform-host-agent SSH failure as a Phase 3 runtime gate before declaring Phase 3 complete.

Merge blockers

  • Any code path that mutates runtime during health.
  • Any secret or SSH credential leak.
  • Any hidden root fallback.
  • Any reviewer finding that the runtime SSH access gap must be solved in this PR rather than tracked as the Phase 3 manual-smoke gate.

Tests

  • PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests/test_health_phase3.py -q — 10 passed
  • python3 -m py_compile control-plane/platformctl/health.py control-plane/platformctl/cli.py — passed
  • PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests control-plane/platformctl/transport/tests -q — 349 passed
  • PYTHONPATH=control-plane python3 -m pytest tests -q — 318 passed / 15 skipped
  • git diff --check — clean
  • PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json — command works but exits 5 due platform-host-agent SSH runtime gate described above

Spec sources read

  • prompts/codex-cutover-flight/dispatch.md — phase sequencing and stop rules
  • prompts/codex-cutover-flight/phase-3-operational.md — Packet 3.4 acceptance criteria
  • control-plane/platformctl/cli.py — existing health skeleton and CLI wiring
  • control-plane/platformctl/manifest.py — strict-v2 load behavior
  • control-plane/platformctl/plan.py — container naming and exit constants
  • control-plane/platformctl/transport/tailscale.py — read-only SSH transport contract
  • tests/smoke.sh — smoke JSON output contract
  • control-plane/platformctl/tests/test_plan_phase3.py and test_apply_phase3.py — local Phase 3 test patterns

Refs #142

Canary status: iterating — canary rerun requested after Codex fixes for PR #161 blocker findings ## Canary Context Pack ### Product story Phase 3 needs `platformctl health` to become the read-only confidence surface before later phases deploy or cut over RS2000 services. The operator should be able to ask whether a module is healthy without spelunking manually through manifests, smoke scripts, and Docker status. ### What changed - Added `control-plane/platformctl/health.py`. - Replaced the CLI health skeleton with real rollup behavior. - Added `platformctl health` for all Phase 02 v2-cataloged modules. - Added `platformctl health --module <id>` and retained `platformctl health <id>`. - Health combines strict-v2 manifest validation, `tests/smoke.sh --json`, and read-only Docker container status through `TailscaleTransport`. - Added focused tests with fake transport and fake smoke outputs. ### Why it changed This is Packet 3.4 from the cutover flight. It unblocks Phase 3's operational control-plane surface, but does not perform production mutation. ### Files touched - `control-plane/platformctl/health.py` - `control-plane/platformctl/cli.py` - `control-plane/platformctl/tests/test_health_phase3.py` ### Relevant context - `prompts/codex-cutover-flight/dispatch.md` - `prompts/codex-cutover-flight/phase-3-operational.md` Packet 3.4 - `tests/smoke.sh` - `control-plane/platformctl/transport/tailscale.py` - `control-plane/platformctl/manifest.py` - `control-plane/platformctl/plan.py` ### Runtime evidence Read-only manual probe was run: `PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json` Result: command executes and returns structured health JSON, but exits `5` because `TailscaleTransport` cannot currently SSH as `platform-host-agent` to RS2000: - manifest strict-v2: OK - smoke.sh: OK, 4 passed / 0 failed / 3 skipped - container check via platformctl transport: FAIL, `ssh: connect to host 100.110.188.20 port 22: Connection refused` Additional read-only check: `ssh rs2000` resolves to operator SSH config as root/public host, while `platform-host-agent@rs2000` returns `Permission denied (publickey)`. I did not bypass this with root. This is a runtime access gate, not a code-test failure. ### Known constraints - Health is read-only. - Tests use only fake transport and fake smoke runner for SSH/container status. - The command intentionally uses `platform-host-agent` through `TailscaleTransport`, not the operator root SSH alias. - The runtime SSH access gap means Phase 3 manual smoke is not fully satisfied yet. ### Explicit out-of-scope - No production mutation. - No SSH key provisioning. - No Tailscale ACL mutation. - No root SSH fallback. - No changes to `tests/smoke.sh` semantics. ### Requested decision Review the code path for merge readiness as Packet 3.4. Separately, treat the `platform-host-agent` SSH failure as a Phase 3 runtime gate before declaring Phase 3 complete. ### Merge blockers - Any code path that mutates runtime during health. - Any secret or SSH credential leak. - Any hidden root fallback. - Any reviewer finding that the runtime SSH access gap must be solved in this PR rather than tracked as the Phase 3 manual-smoke gate. ## Tests - `PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests/test_health_phase3.py -q` — 10 passed - `python3 -m py_compile control-plane/platformctl/health.py control-plane/platformctl/cli.py` — passed - `PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests control-plane/platformctl/transport/tests -q` — 349 passed - `PYTHONPATH=control-plane python3 -m pytest tests -q` — 318 passed / 15 skipped - `git diff --check` — clean - `PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json` — command works but exits 5 due platform-host-agent SSH runtime gate described above ## Spec sources read - `prompts/codex-cutover-flight/dispatch.md` — phase sequencing and stop rules - `prompts/codex-cutover-flight/phase-3-operational.md` — Packet 3.4 acceptance criteria - `control-plane/platformctl/cli.py` — existing health skeleton and CLI wiring - `control-plane/platformctl/manifest.py` — strict-v2 load behavior - `control-plane/platformctl/plan.py` — container naming and exit constants - `control-plane/platformctl/transport/tailscale.py` — read-only SSH transport contract - `tests/smoke.sh` — smoke JSON output contract - `control-plane/platformctl/tests/test_plan_phase3.py` and `test_apply_phase3.py` — local Phase 3 test patterns Refs #142
feat(platformctl): add health rollup
All checks were successful
canary-required / collect-diff (pull_request) Successful in 4s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 41s
python-ci / Python 3.12 (pull_request) Successful in 43s
python-ci / Python 3.13 (pull_request) Successful in 43s
canary-required / canary (pull_request) Successful in 12s
5a1f3107bb
Author
Collaborator

Packet 3.4 external review checkpoint

Ollama Cloud expensive-review gate: 3/3 approve, no blockers.

Reviewers:

  • deepseek-v4-pro:cloud: APPROVE; blockers none.
  • kimi-k2.6:cloud: APPROVE; blockers none.
  • minimax-m2.7:cloud: APPROVE; blockers none.

Shared reviewer conclusion:

  • No production mutation path introduced.
  • Health output does not expose credentials, env vars, SSH args, or secrets beyond expected module/container/smoke status.
  • OK / DEGRADED / FAIL status model with mapped exit codes is coherent enough for Phase 3.
  • The platform-host-agent SSH failure is a runtime access gate before Phase 3 completion, not a code blocker for this PR.
  • Tests are production-safe via FakeTransport, injected smoke runners, and temp files.

Reviewer non-blockers captured:

  • EXIT_NO_CHANGES for healthy status is semantically odd but harmless because it maps to success/0.
  • Last-JSON-line smoke parsing could be brittle if tests/smoke.sh ever emits multiple JSON objects.
  • Any in the transport injection type weakens static type clarity but keeps tests simple.
  • Runtime prerequisite should be documented/resolved: platform-host-agent SSH access to RS2000.

Official platform canary remains missing and should still be fired before merge per repo policy.

## Packet 3.4 external review checkpoint Ollama Cloud expensive-review gate: **3/3 approve, no blockers**. Reviewers: - `deepseek-v4-pro:cloud`: APPROVE; blockers none. - `kimi-k2.6:cloud`: APPROVE; blockers none. - `minimax-m2.7:cloud`: APPROVE; blockers none. Shared reviewer conclusion: - No production mutation path introduced. - Health output does not expose credentials, env vars, SSH args, or secrets beyond expected module/container/smoke status. - OK / DEGRADED / FAIL status model with mapped exit codes is coherent enough for Phase 3. - The `platform-host-agent` SSH failure is a runtime access gate before Phase 3 completion, not a code blocker for this PR. - Tests are production-safe via `FakeTransport`, injected smoke runners, and temp files. Reviewer non-blockers captured: - `EXIT_NO_CHANGES` for healthy status is semantically odd but harmless because it maps to success/0. - Last-JSON-line smoke parsing could be brittle if `tests/smoke.sh` ever emits multiple JSON objects. - `Any` in the transport injection type weakens static type clarity but keeps tests simple. - Runtime prerequisite should be documented/resolved: `platform-host-agent` SSH access to RS2000. Official platform canary remains **missing** and should still be fired before merge per repo policy.
Author
Collaborator

Runtime evidence refresh after B-safe platform-host-agent bootstrap on RS2000:

  • restricted platform-host-agent SSH now works to 100.110.188.20
  • platformctl plan honcho-redis --json -> status: in-sync, exitCode: 0
  • platformctl health --module honcho-redis --json -> status: OK, manifest OK, container running, smoke 4 passed / 0 failed / 3 skipped, exitCode: 0
  • fake SHA apply still refuses before runtime (exitCode: 6)

Residual outside this PR: RS2000 Tailscale tags are still empty, so CI/Codex-tagged apply needs the separate tag/ACL gate before production apply is ready. Official canary is still missing.

Runtime evidence refresh after B-safe `platform-host-agent` bootstrap on RS2000: - restricted `platform-host-agent` SSH now works to `100.110.188.20` - `platformctl plan honcho-redis --json` -> `status: in-sync`, `exitCode: 0` - `platformctl health --module honcho-redis --json` -> `status: OK`, manifest OK, container running, smoke 4 passed / 0 failed / 3 skipped, `exitCode: 0` - fake SHA apply still refuses before runtime (`exitCode: 6`) Residual outside this PR: RS2000 Tailscale tags are still empty, so CI/Codex-tagged apply needs the separate tag/ACL gate before production apply is ready. Official canary is still missing.
Collaborator

3+3 ensemble review by claude — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: OK (confidence 0.72)

Risks

  • medium — SSHError path in container_status has zero test coverage
    • Evidence: control-plane/platformctl/health.py:177-180 catches SSHError and returns FAIL with the raw exception string; the runtime evidence in the PR description shows this is the actual production failure path right now (Connection refused, Permission denied). All container tests use a FakeTransport that
    • Recommendation: Add a test where FakeTransport.run raises SSHError, asserting the report has status=FAIL, state='unknown', and the error string is preserved. This is the only path the operator hit for real, so it should not be the untested one.
  • low — JSON output shape differs between health_module and health_all
    • Evidence: health.py:health_module returns {module, generated_at, checks, host, lifecycle, criticality, status, exitCode}; health_all returns a wrapper {command:'health', scope, generated_at, status, summary, results, exitCode}. A consumer that switches on commandorscope will break on per-module output.
    • Recommendation: Wrap single-module reports in the same envelope (e.g. {command:'health', scope:, results:[], summary, status, exitCode}) or document the two shapes explicitly in health.py docstring so consumers can branch deliberately.

Opportunities

  • Surface broken manifests in health all instead of silently dropping them — _is_v2_cataloged (health.py:55-62) catches every Exception from yaml.safe_load and returns False. A module with a syntactically broken module.yaml is silently excluded from health all, so the rollup reports OK while a manifest is corrupt. Consider returning the module with a DEGRADED/FAIL manifest entry so the operator sees the breakage on the same surface.
  • Empty v2 module set returns STATUS_OK — _summary_status (health.py:233-238) returns OK when summary['total']==0. If something accidentally hides v2 cataloging (e.g. modules dir misresolved, all manifests fail _is_v2_cataloged), health all reports OK with zero modules. A 'no v2 modules found' notice or DEGRADED status would prevent a green-but-empty rollup from masking visibility loss.

Product hat: OK (confidence 0.80)

Risks

  • medium — Health command will exit 5 / report FAIL on every module until platform-host-agent SSH gate is fixed
    • Evidence: PR description: platformctl health --module honcho-redis --json exits 5 because TailscaleTransport cannot SSH as platform-host-agent. health.py:165-180 returns STATUS_FAIL on SSHError, which rolls up to FAIL via _rollup_status.
    • Recommendation: Merge is fine — the code is correct and the runtime gap is honestly tracked as a separate Phase 3 gate. But: don't run platformctl health expecting useful output until the SSH gate is closed, and don't interpret the resulting FAIL wall as 'platform is broken.' Consider a follow-up to add a --skip-container flag so the command becomes useful for manifest+smoke validation in the meantime.

Opportunities

  • Surface non-v2-cataloged modules instead of silently excluding them — v2_module_ids() (health.py:60-67) and _is_v2_cataloged() (health.py:46-58) silently drop modules whose manifest fails YAML parse OR lacks v2 audit fields. After merge, platformctl health rollup may quietly omit modules the operator expects to see. A one-line note in the human output ('N modules excluded: not v2-cataloged') would prevent 'where did module X go?' confusion without changing scope.
<!-- platform-review:claude:pdurlej/platform:PR-161:5a1f3107 --> # 3+3 ensemble review by `claude` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ✅ **OK** (confidence 0.72) ### Risks - **`medium`** — SSHError path in container_status has zero test coverage - Evidence: `control-plane/platformctl/health.py:177-180 catches SSHError and returns FAIL with the raw exception string; the runtime evidence in the PR description shows this is the actual production failure path right now (`Connection refused`, `Permission denied`). All container tests use a FakeTransport that` - Recommendation: Add a test where FakeTransport.run raises SSHError, asserting the report has status=FAIL, state='unknown', and the error string is preserved. This is the only path the operator hit for real, so it should not be the untested one. - **`low`** — JSON output shape differs between health_module and health_all - Evidence: `health.py:health_module returns {module, generated_at, checks, host, lifecycle, criticality, status, exitCode}; health_all returns a wrapper {command:'health', scope, generated_at, status, summary, results, exitCode}. A consumer that switches on `command` or `scope` will break on per-module output.` - Recommendation: Wrap single-module reports in the same envelope (e.g. {command:'health', scope:<id>, results:[<report>], summary, status, exitCode}) or document the two shapes explicitly in health.py docstring so consumers can branch deliberately. ### Opportunities - **Surface broken manifests in `health all` instead of silently dropping them** — _is_v2_cataloged (health.py:55-62) catches every Exception from yaml.safe_load and returns False. A module with a syntactically broken module.yaml is silently excluded from `health all`, so the rollup reports OK while a manifest is corrupt. Consider returning the module with a DEGRADED/FAIL manifest entry so the operator sees the breakage on the same surface. - **Empty v2 module set returns STATUS_OK** — _summary_status (health.py:233-238) returns OK when summary['total']==0. If something accidentally hides v2 cataloging (e.g. modules dir misresolved, all manifests fail _is_v2_cataloged), `health all` reports OK with zero modules. A 'no v2 modules found' notice or DEGRADED status would prevent a green-but-empty rollup from masking visibility loss. ## Product hat: ✅ **OK** (confidence 0.80) ### Risks - **`medium`** — Health command will exit 5 / report FAIL on every module until platform-host-agent SSH gate is fixed - Evidence: `PR description: `platformctl health --module honcho-redis --json` exits 5 because TailscaleTransport cannot SSH as platform-host-agent. health.py:165-180 returns STATUS_FAIL on SSHError, which rolls up to FAIL via _rollup_status.` - Recommendation: Merge is fine — the code is correct and the runtime gap is honestly tracked as a separate Phase 3 gate. But: don't run `platformctl health` expecting useful output until the SSH gate is closed, and don't interpret the resulting FAIL wall as 'platform is broken.' Consider a follow-up to add a `--skip-container` flag so the command becomes useful for manifest+smoke validation in the meantime. ### Opportunities - **Surface non-v2-cataloged modules instead of silently excluding them** — v2_module_ids() (health.py:60-67) and _is_v2_cataloged() (health.py:46-58) silently drop modules whose manifest fails YAML parse OR lacks v2 audit fields. After merge, `platformctl health` rollup may quietly omit modules the operator expects to see. A one-line note in the human output ('N modules excluded: not v2-cataloged') would prevent 'where did module X go?' confusion without changing scope.
Author
Collaborator

3+3 ensemble review by codex — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: NOT_OK (confidence 0.93)

Risks

  • blocker — Health smoke path bypasses platform-host-agent identity
    • Evidence: control-plane/platformctl/health.py:104-106 invokes tests/smoke.sh, and tests/smoke.sh:105-108 runs plain ssh "$host". That uses the operator SSH config/user, while TailscaleTransport correctly forces platform-host-agent. The PR's own evidence shows smoke passed while the platform-host-agent contain
    • Recommendation: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while TailscaleTransport access fails.
  • blocker — Container check assumes wrong names for cataloged modules
    • Evidence: control-plane/platformctl/health.py:165 calls plan.container_name(); control-plane/platformctl/plan.py:72-75 defaults to home-platform-<compose_service>-1. But v2-cataloged modules include non-default containers, e.g. modules/agaria-nginx/module.yaml:30-32 has compose_service agaria-nginx while modu
    • Recommendation: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress.

Product hat: NOT_OK (confidence 0.86)

Risks

  • high — Default health command can become an attention trap
    • Evidence: control-plane/platformctl/health.py:205
    • Recommendation: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in.
  • medium — PR exceeds the operator's review-size norm
    • Evidence: Diff stats: +624/-19 across 3 files
    • Recommendation: Split implementation and tests/CLI wiring into smaller reviewable PRs, or get explicit operator acceptance that this Phase 3 packet is intentionally above the <=300 LOC norm.

Opportunities

  • Make the runtime gate impossible to miss — The PR description says platform-host-agent SSH failure is a Phase 3 gate, but the diff does not create a visible tracking artifact. Link or create the follow-up issue before merge so the operator does not have to remember it.
<!-- platform-review:codex:pdurlej/platform:PR-161:5a1f3107 --> # 3+3 ensemble review by `codex` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ❌ **NOT_OK** (confidence 0.93) ### Risks - **`blocker`** — Health smoke path bypasses platform-host-agent identity - Evidence: `control-plane/platformctl/health.py:104-106 invokes tests/smoke.sh, and tests/smoke.sh:105-108 runs plain ssh "$host". That uses the operator SSH config/user, while TailscaleTransport correctly forces platform-host-agent. The PR's own evidence shows smoke passed while the platform-host-agent contain` - Recommendation: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while TailscaleTransport access fails. - **`blocker`** — Container check assumes wrong names for cataloged modules - Evidence: `control-plane/platformctl/health.py:165 calls plan.container_name(); control-plane/platformctl/plan.py:72-75 defaults to home-platform-<compose_service>-1. But v2-cataloged modules include non-default containers, e.g. modules/agaria-nginx/module.yaml:30-32 has compose_service agaria-nginx while modu` - Recommendation: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress. ## Product hat: ❌ **NOT_OK** (confidence 0.86) ### Risks - **`high`** — Default health command can become an attention trap - Evidence: `control-plane/platformctl/health.py:205` - Recommendation: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in. - **`medium`** — PR exceeds the operator's review-size norm - Evidence: `Diff stats: +624/-19 across 3 files` - Recommendation: Split implementation and tests/CLI wiring into smaller reviewable PRs, or get explicit operator acceptance that this Phase 3 packet is intentionally above the <=300 LOC norm. ### Opportunities - **Make the runtime gate impossible to miss** — The PR description says platform-host-agent SSH failure is a Phase 3 gate, but the diff does not create a visible tracking artifact. Link or create the follow-up issue before merge so the operator does not have to remember it.
First-time contributor

3+3 ensemble review by glm — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: OK (confidence 0.95)

Risks

  • low — SSH transport timeout not explicitly passed to run()
    • Evidence: control-plane/platformctl/health.py:145
    • Recommendation: Consider passing timeout to transport.run() explicitly if TailscaleTransport.run() doesn't honor the constructor timeout. Current behavior likely safe but unclear from contract.
  • low — Exit code semantics may confuse operators
    • Evidence: control-plane/platformctl/health.py:236-241
    • Recommendation: Document that EXIT_NO_CHANGES (0) means healthy. This is platform contract-consistent but non-obvious for health commands.

Opportunities

  • Dependency injection design is well-executed — health_module() and health_all() accept transport and smoke_runner overrides, enabling comprehensive unit testing without mocking subprocess.run or SSH directly. test_health_phase3.py demonstrates this cleanly with FakeTransport and lambda smoke runners.
  • Robust JSON parsing handles smoke.sh output variations — _parse_smoke_json() iterates reversed lines and catches JSONDecodeError, making it resilient to non-JSON stdout noise that shell scripts sometimes emit before final JSON payload.
  • Security-conscious shell quoting — container_status() uses shlex.quote() around container name before embedding in docker inspect command, preventing shell injection if manifest.id contains special characters.

Product hat: OK (confidence 0.95)

Risks

  • medium — Runtime SSH access gap blocked by design
    • Evidence: control-plane/platformctl/health.py:212-218
    • Recommendation: Document in PLATFORM_CHARTER.md that Phase 3 is merge-ready code-first; runtime SSH key provision for platform-host-agent is tracked as a separate Phase 3 manual gate, not a blocker for this PR.
  • low — Exit code inconsistency between failures
    • Evidence: control-plane/platformctl/health.py:183 (EXIT_TOOL_ERROR) vs line 249 (EXIT_REMOTE_FAILED)
    • Recommendation: Accept as-is for Phase 3. Normalize exit semantics in Phase 4 when health surfaces are matured. The current distinction (manifest parse error vs runtime failure) is actually useful for debugging.

Opportunities

  • Add quick-fix hint for degraded smoke — When smoke returns DEGRADED (script missing), the output shows 'DEGRADED: smoke script not found'. Could add a parenthetical hint: '(run ./tests/smoke.sh honcho-redis)' to reduce cognitive load for the operator.
<!-- platform-review:glm:pdurlej/platform:PR-161:5a1f3107 --> # 3+3 ensemble review by `glm` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ✅ **OK** (confidence 0.95) ### Risks - **`low`** — SSH transport timeout not explicitly passed to run() - Evidence: `control-plane/platformctl/health.py:145` - Recommendation: Consider passing timeout to transport.run() explicitly if TailscaleTransport.run() doesn't honor the constructor timeout. Current behavior likely safe but unclear from contract. - **`low`** — Exit code semantics may confuse operators - Evidence: `control-plane/platformctl/health.py:236-241` - Recommendation: Document that EXIT_NO_CHANGES (0) means healthy. This is platform contract-consistent but non-obvious for health commands. ### Opportunities - **Dependency injection design is well-executed** — health_module() and health_all() accept transport and smoke_runner overrides, enabling comprehensive unit testing without mocking subprocess.run or SSH directly. test_health_phase3.py demonstrates this cleanly with FakeTransport and lambda smoke runners. - **Robust JSON parsing handles smoke.sh output variations** — _parse_smoke_json() iterates reversed lines and catches JSONDecodeError, making it resilient to non-JSON stdout noise that shell scripts sometimes emit before final JSON payload. - **Security-conscious shell quoting** — container_status() uses shlex.quote() around container name before embedding in docker inspect command, preventing shell injection if manifest.id contains special characters. ## Product hat: ✅ **OK** (confidence 0.95) ### Risks - **`medium`** — Runtime SSH access gap blocked by design - Evidence: `control-plane/platformctl/health.py:212-218` - Recommendation: Document in PLATFORM_CHARTER.md that Phase 3 is merge-ready code-first; runtime SSH key provision for platform-host-agent is tracked as a separate Phase 3 manual gate, not a blocker for this PR. - **`low`** — Exit code inconsistency between failures - Evidence: `control-plane/platformctl/health.py:183 (EXIT_TOOL_ERROR) vs line 249 (EXIT_REMOTE_FAILED)` - Recommendation: Accept as-is for Phase 3. Normalize exit semantics in Phase 4 when health surfaces are matured. The current distinction (manifest parse error vs runtime failure) is actually useful for debugging. ### Opportunities - **Add quick-fix hint for degraded smoke** — When smoke returns DEGRADED (script missing), the output shows 'DEGRADED: smoke script not found'. Could add a parenthetical hint: '(run ./tests/smoke.sh honcho-redis)' to reduce cognitive load for the operator.
Collaborator

Review decision

Status: BLOCKER — recommended action: defer

Single-reviewer blockers

  • Health smoke path bypasses platform-host-agent identity
    • Seen by: tech-gpt
    • Decision: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while Tails
  • Container check assumes wrong names for cataloged modules
    • Seen by: tech-gpt
    • Decision: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress.

Single-reviewer high-risk findings

  • Default health command can become an attention trap
    • Seen by: product-gpt
    • Decision: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in.

Reviewer dissents

  • product-gpt voted NOT_OK (confidence 0.86)
  • tech-gpt voted NOT_OK (confidence 0.93)

Operator decisions (yes/no)

  1. Risk 'Health smoke path bypasses platform-host-agent identity' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?
  2. Risk 'Container check assumes wrong names for cataloged modules' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?
  3. Risk 'Default health command can become an attention trap' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?

Per-actor evidence: see comments by claude, codex, glm above. Tech: 2/3 OK · Product: 2/3 OK.

<!-- platform-review-decision:pdurlej/platform:PR-161:5a1f3107 --> # Review decision **Status:** BLOCKER — recommended action: `defer` ### Single-reviewer blockers - **Health smoke path bypasses platform-host-agent identity** - Seen by: tech-gpt - Decision: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while Tails - **Container check assumes wrong names for cataloged modules** - Seen by: tech-gpt - Decision: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress. ### Single-reviewer high-risk findings - **Default health command can become an attention trap** - Seen by: product-gpt - Decision: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in. ### Reviewer dissents - `product-gpt` voted **NOT_OK** (confidence 0.86) - `tech-gpt` voted **NOT_OK** (confidence 0.93) ### Operator decisions (yes/no) 1. Risk 'Health smoke path bypasses platform-host-agent identity' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? 2. Risk 'Container check assumes wrong names for cataloged modules' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? 3. Risk 'Default health command can become an attention trap' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? --- _Per-actor evidence: see comments by `claude`, `codex`, `glm` above. Tech: 2/3 OK · Product: 2/3 OK._
codex force-pushed codex/issues/142-phase3-health from 5a1f3107bb
All checks were successful
canary-required / collect-diff (pull_request) Successful in 4s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 41s
python-ci / Python 3.12 (pull_request) Successful in 43s
python-ci / Python 3.13 (pull_request) Successful in 43s
canary-required / canary (pull_request) Successful in 12s
to 4261563b06
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 43s
canary-required / canary (pull_request) Successful in 11s
2026-05-10 08:47:14 +02:00
Compare
codex force-pushed codex/issues/142-phase3-health from 4261563b06
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 43s
canary-required / canary (pull_request) Successful in 11s
to 6e45755ca0
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 11s
2026-05-10 08:57:00 +02:00
Compare
codex force-pushed codex/issues/142-phase3-health from 6e45755ca0
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 11s
to 8a1e193c0e
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 13s
2026-05-10 09:06:29 +02:00
Compare
Author
Collaborator

Codex status: #161 was iterated and rebased on the latest #160, but I am not rerunning/advancing its canary while #160 remains BLOCKER/defer.

Latest local verification on 8a1e193:

  • control-plane/transport tests: 359 passed
  • root tests: 318 passed / 15 skipped
  • runtime smoke: platformctl health --module honcho-redis --json -> status: OK, container running via platform-host-agent, smoke 2 passed / 0 failed / 5 skipped

Blocked by base PR #160 terminal recommendation: split/rewrite before merge.

Codex status: #161 was iterated and rebased on the latest #160, but I am not rerunning/advancing its canary while #160 remains `BLOCKER/defer`. Latest local verification on `8a1e193`: - control-plane/transport tests: `359 passed` - root tests: `318 passed / 15 skipped` - runtime smoke: `platformctl health --module honcho-redis --json` -> `status: OK`, container running via `platform-host-agent`, smoke 2 passed / 0 failed / 5 skipped Blocked by base PR #160 terminal recommendation: split/rewrite before merge.
Author
Collaborator

Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9

Niezależny 5-iter ralph chain. Verdict + drafted patches poniżej.

Per-dim scoring

  • Correctness: 5/9
  • Security: 2/9 ⚠️ (path traversal + crash-on-bad-module)
  • Observability: 8/9 (rich rollup)
  • Test coverage: 7/9
  • Scope discipline: 9/9 (excellent — confined to health command)

Evidence: ~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/161/.


BLOCKER 1 — Strict module_id validation (path traversal)

WHAT: health_module linia 248: manifest_path = md / module_id / "module.yaml". No validation. CLI accepts arbitrary string for module_id.

WHY: Path traversal:

platformctl health "../../etc/sudoers.d/iskra-watch"
# md / "../../etc/sudoers.d/iskra-watch" / "module.yaml"
# Resolves outside modules_dir to filesystem location attacker chose
# Error message may leak file existence + path structure

Plus combined z line 258 load_manifest(manifest_path, strict_v2=True) — czyta YAML z arbitrary location, parse errors leak path.

HOW (drafted patch w health.py + cli.py):

# health.py — add near top (after imports)
_MODULE_ID_RE = re.compile(r"^[a-z0-9][a-z0-9-]*$")


def _validate_module_id(module_id: str) -> str | None:
    """Return error message if module_id is unsafe, None if OK.
    
    SECURITY (ralph #161 BLOCKER 1): prevent path traversal via crafted module_id.
    """
    if not isinstance(module_id, str) or not module_id:
        return "module_id must be a non-empty string"
    if not _MODULE_ID_RE.fullmatch(module_id):
        return (
            f"module_id {module_id!r} contains unsafe characters; "
            f"must match {_MODULE_ID_RE.pattern}"
        )
    return None


# health_module entry — first check (linia 247 area):
def health_module(
    module_id: str,
    *,
    modules_dir: Path | None = None,
    transport: TailscaleTransport | Any | None = None,
    smoke_runner: SmokeRunner | None = None,
) -> tuple[dict[str, Any], int]:
    """Return a health report for one module."""
    # SECURITY (ralph #161 BLOCKER 1): validate before Path construction.
    err = _validate_module_id(module_id)
    if err is not None:
        return {
            "module": module_id,
            "generated_at": _now(),
            "status": STATUS_FAIL,
            "error": err,
            "exitCode": EXIT_TOOL_ERROR,
        }, EXIT_TOOL_ERROR

    md = modules_dir or find_modules_dir()
    manifest_path = md / module_id / "module.yaml"
    # ...rest unchanged

CLI also needs same check (cli.py health command):

# cli.py - in the health command handler, before calling health_module:
@click.command("health")
@click.argument("module_id", required=False)
@click.option("--all", "all_modules", is_flag=True, help="Roll up all v2-cataloged modules")
def health_cmd(module_id: str | None, all_modules: bool):
    if all_modules:
        result, exit_code = health_all()
    else:
        if not module_id:
            click.echo("Error: provide module_id or --all", err=True)
        sys.exit(EXIT_TOOL_ERROR)
    err = _validate_module_id(module_id)  # ralph #161 BLOCKER 1
    if err:
        click.echo(f"Error: {err}", err=True)
        sys.exit(EXIT_TOOL_ERROR)
    result, exit_code = health_module(module_id)
    # ... output
    sys.exit(exit_code)

VERIFY (add to tests/test_health_phase3.py):

@pytest.mark.parametrize("bad_id", [
    "../../etc/passwd",
    "../escape",
    "foo/bar",
    "with space",
    "with.dot",
    "",
    "UPPER_case",  # regex rejects uppercase by design
    "starting-dash",  # must start z [a-z0-9]
])
def test_health_module_rejects_unsafe_module_id(bad_id):
    """module_id with path-unsafe chars must reject (ralph BLOCKER 1)."""
    result, exit_code = health_module(bad_id)
    assert exit_code == EXIT_TOOL_ERROR
    assert "unsafe characters" in result.get("error", "") or "non-empty" in result.get("error", "")

BLOCKER 2 — Exception isolation in health_all

WHAT: health_all linia 298-306 — list comprehension:

results = [
    health_module(module_id, ...)[0]
    for module_id in v2_module_ids(md)
]

health_module catches Exception w manifest load (linia 259), ale w container_status (linia 186-236) i runner() (linia 282-283), unexpected errors propagate. Same w _smoke_env, _rollup_status, _exit_code_for_status. Single bad module → entire health_all raises.

WHY: Operator runs platformctl health --all żeby zobaczyć fleet state. Single corrupted manifest / broken container / transport bug → no report at all. Catastrophic operational blind-spot exactly when needed most.

HOW (drafted patch w health.py linie 290-306):

def health_all(
    *,
    modules_dir: Path | None = None,
    transport: TailscaleTransport | Any | None = None,
    smoke_runner: SmokeRunner | None = None,
) -> tuple[dict[str, Any], int]:
    """Return a health rollup for all Phase 02 v2-cataloged modules.
    
    SECURITY (ralph #161 BLOCKER 2): isolate each module's check from
    siblings. Single broken module must NOT discard entire fleet report.
    """
    md = modules_dir or find_modules_dir()
    results: list[dict[str, Any]] = []
    for module_id in v2_module_ids(md):
        try:
            module_result, _exit = health_module(
                module_id,
                modules_dir=md,
                transport=transport,
                smoke_runner=smoke_runner,
            )
        except Exception as exc:  # noqa: BLE001 — explicit fleet-isolation
            module_result = {
                "module": module_id,
                "generated_at": _now(),
                "status": STATUS_FAIL,
                "error": f"unhandled error in health_module: {type(exc).__name__}: {exc}",
                "exitCode": EXIT_TOOL_ERROR,
            }
        results.append(module_result)
    summary = _summary(results)
    status = _summary_status(summary)
    report = {
        "command": "health",
        "scope": "all",
        "generated_at": _now(),
        "status": status,
        "summary": summary,
        "results": results,
        "exitCode": _exit_code_for_status(status),
    }
    return report, int(report["exitCode"])

VERIFY:

def test_health_all_continues_when_one_module_crashes(monkeypatch):
    """One bad module must NOT discard fleet report (ralph BLOCKER 2)."""
    # Mock v2_module_ids to return ["good-a", "bad-b", "good-c"]
    monkeypatch.setattr("platformctl.health.v2_module_ids", 
                       lambda md: ["good-a", "bad-b", "good-c"])
    
    def mock_health_module(mid, **kw):
        if mid == "bad-b":
            raise RuntimeError("transport.run hung on missing key")
        return ({"module": mid, "status": STATUS_OK, "exitCode": 0}, 0)
    
    monkeypatch.setattr("platformctl.health.health_module", mock_health_module)
    
    report, exit_code = health_all()
    
    assert len(report["results"]) == 3  # all three present
    statuses = [r["status"] for r in report["results"]]
    assert statuses.count(STATUS_OK) == 2
    assert statuses.count(STATUS_FAIL) == 1
    bad = next(r for r in report["results"] if r["module"] == "bad-b")
    assert "unhandled error" in bad["error"]
    assert "RuntimeError" in bad["error"]

BLOCKER 3 — Externalize hardcoded infrastructure constants

WHAT: health.py linia 32-36:

DEFAULT_SMOKE_SSH_USER = "platform-host-agent"
HOST_ENV = {
    "rs2000": ("PLATFORMCTL_RS2000_SSH_HOST", "100.110.188.20"),
    "vps1000": ("PLATFORMCTL_VPS1000_SSH_HOST", "100.79.239.52"),
}

Tailscale IPs hardcoded as fallback. SSH user hardcoded.

WHY:

  • Info disclosure: internal Tailscale IPs w source code (jeśli repo ever public)
  • Deployment fragility: moving servers / changing IPs requires code edit
  • Trust-based fallback: env var unset → silent use of hardcoded values, no warning operator

HOW (drafted patch w health.py linie 32-36):

# health.py — replace hardcoded fallbacks (linie 32-36):

# SECURITY (ralph #161 BLOCKER 3): infrastructure constants must come from env.
# No hardcoded IPs or usernames in source. Operator/CI configures explicitly.
HOSTS_REQUIRING_SSH_CONFIG = ("rs2000", "vps1000")


class HealthConfigError(RuntimeError):
    """Raised when required PLATFORMCTL_* env vars are missing."""


def _smoke_env() -> dict[str, str]:
    env = os.environ.copy()
    
    # SSH user — required, no fallback
    if "PLATFORMCTL_SMOKE_SSH_USER" not in env:
        raise HealthConfigError(
            "PLATFORMCTL_SMOKE_SSH_USER env var is required "
            "(no hardcoded fallback; ralph #161 BLOCKER 3)"
        )
    env.setdefault("PLATFORMCTL_SMOKE_REMOTE_MODE", "skip")
    
    # Tailscale host IPs — required per host
    missing = []
    for host in HOSTS_REQUIRING_SSH_CONFIG:
        var = f"PLATFORMCTL_{host.upper()}_SSH_HOST"
        if var not in env:
            missing.append(var)
    if missing:
        raise HealthConfigError(
            f"Missing required SSH host env vars: {missing} "
            "(no hardcoded fallback; ralph #161 BLOCKER 3)"
        )
    return env


# In CLI handler — catch HealthConfigError and exit cleanly:
# (in cli.py health command)
try:
    result, exit_code = health_all() if all_modules else health_module(module_id)
except HealthConfigError as exc:
    click.echo(f"Configuration error: {exc}", err=True)
    sys.exit(EXIT_TOOL_ERROR)

Plus document required env vars w docs/forgejo-agent-operations.md lub similar.

VERIFY:

def test_smoke_env_raises_when_ssh_user_missing(monkeypatch):
    """Missing PLATFORMCTL_SMOKE_SSH_USER must raise (ralph BLOCKER 3)."""
    monkeypatch.delenv("PLATFORMCTL_SMOKE_SSH_USER", raising=False)
    monkeypatch.setenv("PLATFORMCTL_RS2000_SSH_HOST", "x")
    monkeypatch.setenv("PLATFORMCTL_VPS1000_SSH_HOST", "y")
    with pytest.raises(HealthConfigError, match="SMOKE_SSH_USER"):
        _smoke_env()


def test_smoke_env_raises_when_host_ips_missing(monkeypatch):
    """Missing host IP env vars must raise (ralph BLOCKER 3)."""
    monkeypatch.setenv("PLATFORMCTL_SMOKE_SSH_USER", "test-user")
    monkeypatch.delenv("PLATFORMCTL_RS2000_SSH_HOST", raising=False)
    with pytest.raises(HealthConfigError, match="RS2000_SSH_HOST"):
        _smoke_env()

Follow-up issues

SMOKE-ENV-WHITELIST-01 — Sanitize smoke subprocess environment

Linia 104: env = os.environ.copy() — full env inheritance dla smoke.sh. If smoke.sh compromised, exfiltrates AWS_*, GITHUB_TOKEN, anything in env.

Proposed: explicit whitelist:

SMOKE_ENV_WHITELIST = {"PATH", "HOME", "USER", "PLATFORMCTL_SMOKE_SSH_USER", ...}
env = {k: v for k, v in os.environ.items() if k in SMOKE_ENV_WHITELIST or k.startswith("PLATFORMCTL_")}

RUNBOOK-CONTAINER-NAME-01 — Validate runbook-extracted container names

Linia 171: re.search(r"Container:\s*\([^`]+)`", runbook.read_text())— accepts ANY non-backtick content jako container name. Then passed dodocker inspect(linia 202). Malicious runbook may inject shell metachars even pastshlex.quote` (depends on shell version).

Proposed: regex-validate runbook output: ^[a-zA-Z0-9_.-]+$. Reject unsafe.

ADVERSARIAL-INPUT-TESTS-01 — Test suite for hostile inputs

Add test cases dla: malformed module.yaml, corrupted runbook.md, very long module_id (DoS), Unicode confusables, null bytes w paths.

OPS-PREREQUISITES-DOC-01 — Document operational prerequisites

health command silently depends on: Tailscale auth, smoke.sh trust, --all fleet size (>50 modules may timeout). Document docs/health-prerequisites.md listing all preconditions + failure modes.


Action items

  1. Apply 3 BLOCKER patches → health.py + cli.py + tests
  2. Force-push do codex/issues/142-phase3-health
  3. File 4 follow-up issues
  4. Comment "ready for re-review" tutaj
  5. Operator może odpalić re-ralph

Note: Po merge wszystkich Phase 3 PRów (#162-#167) + #161, verify_approved_sha w #162 i health w #161 będą używać tej samej infrastruktury (Forgejo token, sacred paths). Consider single follow-up OPS-PREREQUISITES-DOC-01 covering all platformctl env vars.

— ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity

## Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9 Niezależny 5-iter ralph chain. Verdict + drafted patches poniżej. ### Per-dim scoring - Correctness: 5/9 - **Security: 2/9** ⚠️ (path traversal + crash-on-bad-module) - Observability: 8/9 (rich rollup) - Test coverage: 7/9 - Scope discipline: 9/9 (excellent — confined to health command) Evidence: `~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/161/`. --- ## BLOCKER 1 — Strict `module_id` validation (path traversal) **WHAT:** `health_module` linia 248: `manifest_path = md / module_id / "module.yaml"`. No validation. CLI accepts arbitrary string for `module_id`. **WHY:** Path traversal: ```bash platformctl health "../../etc/sudoers.d/iskra-watch" # md / "../../etc/sudoers.d/iskra-watch" / "module.yaml" # Resolves outside modules_dir to filesystem location attacker chose # Error message may leak file existence + path structure ``` Plus combined z line 258 `load_manifest(manifest_path, strict_v2=True)` — czyta YAML z arbitrary location, parse errors leak path. **HOW (drafted patch w `health.py` + `cli.py`):** ```python # health.py — add near top (after imports) _MODULE_ID_RE = re.compile(r"^[a-z0-9][a-z0-9-]*$") def _validate_module_id(module_id: str) -> str | None: """Return error message if module_id is unsafe, None if OK. SECURITY (ralph #161 BLOCKER 1): prevent path traversal via crafted module_id. """ if not isinstance(module_id, str) or not module_id: return "module_id must be a non-empty string" if not _MODULE_ID_RE.fullmatch(module_id): return ( f"module_id {module_id!r} contains unsafe characters; " f"must match {_MODULE_ID_RE.pattern}" ) return None # health_module entry — first check (linia 247 area): def health_module( module_id: str, *, modules_dir: Path | None = None, transport: TailscaleTransport | Any | None = None, smoke_runner: SmokeRunner | None = None, ) -> tuple[dict[str, Any], int]: """Return a health report for one module.""" # SECURITY (ralph #161 BLOCKER 1): validate before Path construction. err = _validate_module_id(module_id) if err is not None: return { "module": module_id, "generated_at": _now(), "status": STATUS_FAIL, "error": err, "exitCode": EXIT_TOOL_ERROR, }, EXIT_TOOL_ERROR md = modules_dir or find_modules_dir() manifest_path = md / module_id / "module.yaml" # ...rest unchanged ``` CLI also needs same check (`cli.py` health command): ```python # cli.py - in the health command handler, before calling health_module: @click.command("health") @click.argument("module_id", required=False) @click.option("--all", "all_modules", is_flag=True, help="Roll up all v2-cataloged modules") def health_cmd(module_id: str | None, all_modules: bool): if all_modules: result, exit_code = health_all() else: if not module_id: click.echo("Error: provide module_id or --all", err=True) sys.exit(EXIT_TOOL_ERROR) err = _validate_module_id(module_id) # ralph #161 BLOCKER 1 if err: click.echo(f"Error: {err}", err=True) sys.exit(EXIT_TOOL_ERROR) result, exit_code = health_module(module_id) # ... output sys.exit(exit_code) ``` **VERIFY (add to `tests/test_health_phase3.py`):** ```python @pytest.mark.parametrize("bad_id", [ "../../etc/passwd", "../escape", "foo/bar", "with space", "with.dot", "", "UPPER_case", # regex rejects uppercase by design "starting-dash", # must start z [a-z0-9] ]) def test_health_module_rejects_unsafe_module_id(bad_id): """module_id with path-unsafe chars must reject (ralph BLOCKER 1).""" result, exit_code = health_module(bad_id) assert exit_code == EXIT_TOOL_ERROR assert "unsafe characters" in result.get("error", "") or "non-empty" in result.get("error", "") ``` --- ## BLOCKER 2 — Exception isolation in `health_all` **WHAT:** `health_all` linia 298-306 — list comprehension: ```python results = [ health_module(module_id, ...)[0] for module_id in v2_module_ids(md) ] ``` `health_module` catches `Exception` w manifest load (linia 259), ale w `container_status` (linia 186-236) i `runner()` (linia 282-283), unexpected errors **propagate**. Same w `_smoke_env`, `_rollup_status`, `_exit_code_for_status`. Single bad module → entire `health_all` raises. **WHY:** Operator runs `platformctl health --all` żeby zobaczyć fleet state. Single corrupted manifest / broken container / transport bug → no report at all. Catastrophic operational blind-spot exactly when needed most. **HOW (drafted patch w `health.py` linie 290-306):** ```python def health_all( *, modules_dir: Path | None = None, transport: TailscaleTransport | Any | None = None, smoke_runner: SmokeRunner | None = None, ) -> tuple[dict[str, Any], int]: """Return a health rollup for all Phase 02 v2-cataloged modules. SECURITY (ralph #161 BLOCKER 2): isolate each module's check from siblings. Single broken module must NOT discard entire fleet report. """ md = modules_dir or find_modules_dir() results: list[dict[str, Any]] = [] for module_id in v2_module_ids(md): try: module_result, _exit = health_module( module_id, modules_dir=md, transport=transport, smoke_runner=smoke_runner, ) except Exception as exc: # noqa: BLE001 — explicit fleet-isolation module_result = { "module": module_id, "generated_at": _now(), "status": STATUS_FAIL, "error": f"unhandled error in health_module: {type(exc).__name__}: {exc}", "exitCode": EXIT_TOOL_ERROR, } results.append(module_result) summary = _summary(results) status = _summary_status(summary) report = { "command": "health", "scope": "all", "generated_at": _now(), "status": status, "summary": summary, "results": results, "exitCode": _exit_code_for_status(status), } return report, int(report["exitCode"]) ``` **VERIFY:** ```python def test_health_all_continues_when_one_module_crashes(monkeypatch): """One bad module must NOT discard fleet report (ralph BLOCKER 2).""" # Mock v2_module_ids to return ["good-a", "bad-b", "good-c"] monkeypatch.setattr("platformctl.health.v2_module_ids", lambda md: ["good-a", "bad-b", "good-c"]) def mock_health_module(mid, **kw): if mid == "bad-b": raise RuntimeError("transport.run hung on missing key") return ({"module": mid, "status": STATUS_OK, "exitCode": 0}, 0) monkeypatch.setattr("platformctl.health.health_module", mock_health_module) report, exit_code = health_all() assert len(report["results"]) == 3 # all three present statuses = [r["status"] for r in report["results"]] assert statuses.count(STATUS_OK) == 2 assert statuses.count(STATUS_FAIL) == 1 bad = next(r for r in report["results"] if r["module"] == "bad-b") assert "unhandled error" in bad["error"] assert "RuntimeError" in bad["error"] ``` --- ## BLOCKER 3 — Externalize hardcoded infrastructure constants **WHAT:** `health.py` linia 32-36: ```python DEFAULT_SMOKE_SSH_USER = "platform-host-agent" HOST_ENV = { "rs2000": ("PLATFORMCTL_RS2000_SSH_HOST", "100.110.188.20"), "vps1000": ("PLATFORMCTL_VPS1000_SSH_HOST", "100.79.239.52"), } ``` Tailscale IPs hardcoded as fallback. SSH user hardcoded. **WHY:** - **Info disclosure:** internal Tailscale IPs w source code (jeśli repo ever public) - **Deployment fragility:** moving servers / changing IPs requires code edit - **Trust-based fallback:** env var unset → silent use of hardcoded values, no warning operator **HOW (drafted patch w `health.py` linie 32-36):** ```python # health.py — replace hardcoded fallbacks (linie 32-36): # SECURITY (ralph #161 BLOCKER 3): infrastructure constants must come from env. # No hardcoded IPs or usernames in source. Operator/CI configures explicitly. HOSTS_REQUIRING_SSH_CONFIG = ("rs2000", "vps1000") class HealthConfigError(RuntimeError): """Raised when required PLATFORMCTL_* env vars are missing.""" def _smoke_env() -> dict[str, str]: env = os.environ.copy() # SSH user — required, no fallback if "PLATFORMCTL_SMOKE_SSH_USER" not in env: raise HealthConfigError( "PLATFORMCTL_SMOKE_SSH_USER env var is required " "(no hardcoded fallback; ralph #161 BLOCKER 3)" ) env.setdefault("PLATFORMCTL_SMOKE_REMOTE_MODE", "skip") # Tailscale host IPs — required per host missing = [] for host in HOSTS_REQUIRING_SSH_CONFIG: var = f"PLATFORMCTL_{host.upper()}_SSH_HOST" if var not in env: missing.append(var) if missing: raise HealthConfigError( f"Missing required SSH host env vars: {missing} " "(no hardcoded fallback; ralph #161 BLOCKER 3)" ) return env # In CLI handler — catch HealthConfigError and exit cleanly: # (in cli.py health command) try: result, exit_code = health_all() if all_modules else health_module(module_id) except HealthConfigError as exc: click.echo(f"Configuration error: {exc}", err=True) sys.exit(EXIT_TOOL_ERROR) ``` Plus document required env vars w `docs/forgejo-agent-operations.md` lub similar. **VERIFY:** ```python def test_smoke_env_raises_when_ssh_user_missing(monkeypatch): """Missing PLATFORMCTL_SMOKE_SSH_USER must raise (ralph BLOCKER 3).""" monkeypatch.delenv("PLATFORMCTL_SMOKE_SSH_USER", raising=False) monkeypatch.setenv("PLATFORMCTL_RS2000_SSH_HOST", "x") monkeypatch.setenv("PLATFORMCTL_VPS1000_SSH_HOST", "y") with pytest.raises(HealthConfigError, match="SMOKE_SSH_USER"): _smoke_env() def test_smoke_env_raises_when_host_ips_missing(monkeypatch): """Missing host IP env vars must raise (ralph BLOCKER 3).""" monkeypatch.setenv("PLATFORMCTL_SMOKE_SSH_USER", "test-user") monkeypatch.delenv("PLATFORMCTL_RS2000_SSH_HOST", raising=False) with pytest.raises(HealthConfigError, match="RS2000_SSH_HOST"): _smoke_env() ``` --- ## Follow-up issues ### `SMOKE-ENV-WHITELIST-01` — Sanitize smoke subprocess environment > Linia 104: `env = os.environ.copy()` — full env inheritance dla `smoke.sh`. If smoke.sh compromised, exfiltrates AWS_*, GITHUB_TOKEN, anything in env. > > Proposed: explicit whitelist: > ```python > SMOKE_ENV_WHITELIST = {"PATH", "HOME", "USER", "PLATFORMCTL_SMOKE_SSH_USER", ...} > env = {k: v for k, v in os.environ.items() if k in SMOKE_ENV_WHITELIST or k.startswith("PLATFORMCTL_")} > ``` ### `RUNBOOK-CONTAINER-NAME-01` — Validate runbook-extracted container names > Linia 171: `re.search(r"Container:\s*\`([^\`]+)\`", runbook.read_text())` — accepts ANY non-backtick content jako container name. Then passed do `docker inspect` (linia 202). Malicious runbook may inject shell metachars even past `shlex.quote` (depends on shell version). > > Proposed: regex-validate runbook output: `^[a-zA-Z0-9_.-]+$`. Reject unsafe. ### `ADVERSARIAL-INPUT-TESTS-01` — Test suite for hostile inputs > Add test cases dla: malformed module.yaml, corrupted runbook.md, very long module_id (DoS), Unicode confusables, null bytes w paths. ### `OPS-PREREQUISITES-DOC-01` — Document operational prerequisites > health command silently depends on: Tailscale auth, smoke.sh trust, --all fleet size (>50 modules may timeout). Document `docs/health-prerequisites.md` listing all preconditions + failure modes. --- ## Action items 1. Apply 3 BLOCKER patches → `health.py` + `cli.py` + tests 2. Force-push do `codex/issues/142-phase3-health` 3. File 4 follow-up issues 4. Comment "ready for re-review" tutaj 5. Operator może odpalić re-ralph **Note:** Po merge wszystkich Phase 3 PRów (#162-#167) + #161, `verify_approved_sha` w #162 i `health` w #161 będą używać tej samej infrastruktury (Forgejo token, sacred paths). Consider single follow-up `OPS-PREREQUISITES-DOC-01` covering all platformctl env vars. — ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity
codex changed target branch from codex/issues/142-phase3-apply to main 2026-05-12 01:03:28 +02:00
codex force-pushed codex/issues/142-phase3-health from 8a1e193c0e
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s
python-ci / Python 3.11 (pull_request) Successful in 42s
python-ci / Python 3.12 (pull_request) Successful in 44s
python-ci / Python 3.13 (pull_request) Successful in 44s
canary-required / canary (pull_request) Successful in 13s
to 97bf232a1f
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s
python-ci / Python 3.11 (pull_request) Successful in 44s
python-ci / Python 3.12 (pull_request) Successful in 45s
python-ci / Python 3.13 (pull_request) Successful in 45s
canary-required / canary (pull_request) Successful in 12s
2026-05-12 01:08:04 +02:00
Compare
Author
Collaborator

Ready for re-review — ralph blockers addressed

Updated #161 on commit 97bf232a1f7cfcc43cfe0a1d73d632de8c8575e0 and reset the PR base to main.

Branch hygiene

  • Rebuilt the branch from origin/main and cherry-picked only the two health-rollup commits.
  • Removed the stale apply-stack commits from this PR diff; #161 is now health-only again.

Addressed

  • BLOCKER 1: health_module validates module_id before path construction; unsafe IDs fail closed without reading outside modules/.
  • BLOCKER 2: health_all isolates per-module exceptions so one broken module yields a FAIL row instead of discarding the fleet report.
  • BLOCKER 3: smoke SSH user and RS2000/VPS1000 host values now require explicit PLATFORMCTL_* env vars; no hardcoded Tailnet IP fallback remains in health.py.

Verification

  • PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_health_phase3.py control-plane/platformctl/tests/test_smoke.py -q → 24 passed
  • Full tests/run-verify.sh still blocked by pre-existing main prompt debt: prompts/codex-rs2000-close-2026-05-12.md token budget + missing P2 image-prune prompt reference.

Follow-ups filed

  • #210 SMOKE-ENV-WHITELIST-01
  • #211 RUNBOOK-CONTAINER-NAME-01
  • #212 ADVERSARIAL-INPUT-TESTS-01
  • #213 OPS-PREREQUISITES-DOC-01
## Ready for re-review — ralph blockers addressed Updated #161 on commit `97bf232a1f7cfcc43cfe0a1d73d632de8c8575e0` and reset the PR base to `main`. ### Branch hygiene - Rebuilt the branch from `origin/main` and cherry-picked only the two health-rollup commits. - Removed the stale apply-stack commits from this PR diff; #161 is now health-only again. ### Addressed - BLOCKER 1: `health_module` validates `module_id` before path construction; unsafe IDs fail closed without reading outside `modules/`. - BLOCKER 2: `health_all` isolates per-module exceptions so one broken module yields a FAIL row instead of discarding the fleet report. - BLOCKER 3: smoke SSH user and RS2000/VPS1000 host values now require explicit `PLATFORMCTL_*` env vars; no hardcoded Tailnet IP fallback remains in `health.py`. ### Verification - `PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_health_phase3.py control-plane/platformctl/tests/test_smoke.py -q` → 24 passed - Full `tests/run-verify.sh` still blocked by pre-existing main prompt debt: `prompts/codex-rs2000-close-2026-05-12.md` token budget + missing P2 image-prune prompt reference. ### Follow-ups filed - #210 SMOKE-ENV-WHITELIST-01 - #211 RUNBOOK-CONTAINER-NAME-01 - #212 ADVERSARIAL-INPUT-TESTS-01 - #213 OPS-PREREQUISITES-DOC-01
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!161
No description provided.