codex commented

Collaborator

Canary status: iterating — canary rerun requested after Codex fixes for PR #161 blocker findings

Canary Context Pack

Product story

Phase 3 needs platformctl health to become the read-only confidence surface before later phases deploy or cut over RS2000 services. The operator should be able to ask whether a module is healthy without spelunking manually through manifests, smoke scripts, and Docker status.

What changed

Added control-plane/platformctl/health.py.
Replaced the CLI health skeleton with real rollup behavior.
Added platformctl health for all Phase 02 v2-cataloged modules.
Added platformctl health --module <id> and retained platformctl health <id>.
Health combines strict-v2 manifest validation, tests/smoke.sh --json, and read-only Docker container status through TailscaleTransport.
Added focused tests with fake transport and fake smoke outputs.

Why it changed

This is Packet 3.4 from the cutover flight. It unblocks Phase 3's operational control-plane surface, but does not perform production mutation.

Files touched

control-plane/platformctl/health.py
control-plane/platformctl/cli.py
control-plane/platformctl/tests/test_health_phase3.py

Relevant context

prompts/codex-cutover-flight/dispatch.md
prompts/codex-cutover-flight/phase-3-operational.md Packet 3.4
tests/smoke.sh
control-plane/platformctl/transport/tailscale.py
control-plane/platformctl/manifest.py
control-plane/platformctl/plan.py

Runtime evidence

Read-only manual probe was run:

PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json

Result: command executes and returns structured health JSON, but exits 5 because TailscaleTransport cannot currently SSH as platform-host-agent to RS2000:

manifest strict-v2: OK
smoke.sh: OK, 4 passed / 0 failed / 3 skipped
container check via platformctl transport: FAIL, ssh: connect to host 100.110.188.20 port 22: Connection refused

Additional read-only check: ssh rs2000 resolves to operator SSH config as root/public host, while platform-host-agent@rs2000 returns Permission denied (publickey). I did not bypass this with root. This is a runtime access gate, not a code-test failure.

Known constraints

Health is read-only.
Tests use only fake transport and fake smoke runner for SSH/container status.
The command intentionally uses platform-host-agent through TailscaleTransport, not the operator root SSH alias.
The runtime SSH access gap means Phase 3 manual smoke is not fully satisfied yet.

Explicit out-of-scope

No production mutation.
No SSH key provisioning.
No Tailscale ACL mutation.
No root SSH fallback.
No changes to tests/smoke.sh semantics.

Requested decision

Review the code path for merge readiness as Packet 3.4. Separately, treat the platform-host-agent SSH failure as a Phase 3 runtime gate before declaring Phase 3 complete.

Merge blockers

Any code path that mutates runtime during health.
Any secret or SSH credential leak.
Any hidden root fallback.
Any reviewer finding that the runtime SSH access gap must be solved in this PR rather than tracked as the Phase 3 manual-smoke gate.

Tests

PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests/test_health_phase3.py -q — 10 passed
python3 -m py_compile control-plane/platformctl/health.py control-plane/platformctl/cli.py — passed
PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests control-plane/platformctl/transport/tests -q — 349 passed
PYTHONPATH=control-plane python3 -m pytest tests -q — 318 passed / 15 skipped
git diff --check — clean
PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json — command works but exits 5 due platform-host-agent SSH runtime gate described above

Spec sources read

prompts/codex-cutover-flight/dispatch.md — phase sequencing and stop rules
prompts/codex-cutover-flight/phase-3-operational.md — Packet 3.4 acceptance criteria
control-plane/platformctl/cli.py — existing health skeleton and CLI wiring
control-plane/platformctl/manifest.py — strict-v2 load behavior
control-plane/platformctl/plan.py — container naming and exit constants
control-plane/platformctl/transport/tailscale.py — read-only SSH transport contract
tests/smoke.sh — smoke JSON output contract
control-plane/platformctl/tests/test_plan_phase3.py and test_apply_phase3.py — local Phase 3 test patterns

Refs #142

Canary status: iterating — canary rerun requested after Codex fixes for PR #161 blocker findings ## Canary Context Pack ### Product story Phase 3 needs `platformctl health` to become the read-only confidence surface before later phases deploy or cut over RS2000 services. The operator should be able to ask whether a module is healthy without spelunking manually through manifests, smoke scripts, and Docker status. ### What changed - Added `control-plane/platformctl/health.py`. - Replaced the CLI health skeleton with real rollup behavior. - Added `platformctl health` for all Phase 02 v2-cataloged modules. - Added `platformctl health --module <id>` and retained `platformctl health <id>`. - Health combines strict-v2 manifest validation, `tests/smoke.sh --json`, and read-only Docker container status through `TailscaleTransport`. - Added focused tests with fake transport and fake smoke outputs. ### Why it changed This is Packet 3.4 from the cutover flight. It unblocks Phase 3's operational control-plane surface, but does not perform production mutation. ### Files touched - `control-plane/platformctl/health.py` - `control-plane/platformctl/cli.py` - `control-plane/platformctl/tests/test_health_phase3.py` ### Relevant context - `prompts/codex-cutover-flight/dispatch.md` - `prompts/codex-cutover-flight/phase-3-operational.md` Packet 3.4 - `tests/smoke.sh` - `control-plane/platformctl/transport/tailscale.py` - `control-plane/platformctl/manifest.py` - `control-plane/platformctl/plan.py` ### Runtime evidence Read-only manual probe was run: `PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json` Result: command executes and returns structured health JSON, but exits `5` because `TailscaleTransport` cannot currently SSH as `platform-host-agent` to RS2000: - manifest strict-v2: OK - smoke.sh: OK, 4 passed / 0 failed / 3 skipped - container check via platformctl transport: FAIL, `ssh: connect to host 100.110.188.20 port 22: Connection refused` Additional read-only check: `ssh rs2000` resolves to operator SSH config as root/public host, while `platform-host-agent@rs2000` returns `Permission denied (publickey)`. I did not bypass this with root. This is a runtime access gate, not a code-test failure. ### Known constraints - Health is read-only. - Tests use only fake transport and fake smoke runner for SSH/container status. - The command intentionally uses `platform-host-agent` through `TailscaleTransport`, not the operator root SSH alias. - The runtime SSH access gap means Phase 3 manual smoke is not fully satisfied yet. ### Explicit out-of-scope - No production mutation. - No SSH key provisioning. - No Tailscale ACL mutation. - No root SSH fallback. - No changes to `tests/smoke.sh` semantics. ### Requested decision Review the code path for merge readiness as Packet 3.4. Separately, treat the `platform-host-agent` SSH failure as a Phase 3 runtime gate before declaring Phase 3 complete. ### Merge blockers - Any code path that mutates runtime during health. - Any secret or SSH credential leak. - Any hidden root fallback. - Any reviewer finding that the runtime SSH access gap must be solved in this PR rather than tracked as the Phase 3 manual-smoke gate. ## Tests - `PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests/test_health_phase3.py -q` — 10 passed - `python3 -m py_compile control-plane/platformctl/health.py control-plane/platformctl/cli.py` — passed - `PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests control-plane/platformctl/transport/tests -q` — 349 passed - `PYTHONPATH=control-plane python3 -m pytest tests -q` — 318 passed / 15 skipped - `git diff --check` — clean - `PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json` — command works but exits 5 due platform-host-agent SSH runtime gate described above ## Spec sources read - `prompts/codex-cutover-flight/dispatch.md` — phase sequencing and stop rules - `prompts/codex-cutover-flight/phase-3-operational.md` — Packet 3.4 acceptance criteria - `control-plane/platformctl/cli.py` — existing health skeleton and CLI wiring - `control-plane/platformctl/manifest.py` — strict-v2 load behavior - `control-plane/platformctl/plan.py` — container naming and exit constants - `control-plane/platformctl/transport/tailscale.py` — read-only SSH transport contract - `tests/smoke.sh` — smoke JSON output contract - `control-plane/platformctl/tests/test_plan_phase3.py` and `test_apply_phase3.py` — local Phase 3 test patterns Refs #142

codex added 1 commit

2026-05-10 04:13:18 +02:00

feat(platformctl): add health rollup

canary-required / collect-diff (pull_request) Successful in 4s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s

Details

python-ci / Python 3.11 (pull_request) Successful in 41s

Details

python-ci / Python 3.12 (pull_request) Successful in 43s

Details

python-ci / Python 3.13 (pull_request) Successful in 43s

Details

canary-required / canary (pull_request) Successful in 12s

Details

5a1f3107bb

codex added the

class/security-sensitive

label

2026-05-10 04:13:25 +02:00

codex commented

2026-05-10 04:14:48 +02:00

Author

Collaborator

Packet 3.4 external review checkpoint

Ollama Cloud expensive-review gate: 3/3 approve, no blockers.

Reviewers:

deepseek-v4-pro:cloud: APPROVE; blockers none.
kimi-k2.6:cloud: APPROVE; blockers none.
minimax-m2.7:cloud: APPROVE; blockers none.

Shared reviewer conclusion:

No production mutation path introduced.
Health output does not expose credentials, env vars, SSH args, or secrets beyond expected module/container/smoke status.
OK / DEGRADED / FAIL status model with mapped exit codes is coherent enough for Phase 3.
The platform-host-agent SSH failure is a runtime access gate before Phase 3 completion, not a code blocker for this PR.
Tests are production-safe via FakeTransport, injected smoke runners, and temp files.

Reviewer non-blockers captured:

EXIT_NO_CHANGES for healthy status is semantically odd but harmless because it maps to success/0.
Last-JSON-line smoke parsing could be brittle if tests/smoke.sh ever emits multiple JSON objects.
Any in the transport injection type weakens static type clarity but keeps tests simple.
Runtime prerequisite should be documented/resolved: platform-host-agent SSH access to RS2000.

Official platform canary remains missing and should still be fired before merge per repo policy.

## Packet 3.4 external review checkpoint Ollama Cloud expensive-review gate: **3/3 approve, no blockers**. Reviewers: - `deepseek-v4-pro:cloud`: APPROVE; blockers none. - `kimi-k2.6:cloud`: APPROVE; blockers none. - `minimax-m2.7:cloud`: APPROVE; blockers none. Shared reviewer conclusion: - No production mutation path introduced. - Health output does not expose credentials, env vars, SSH args, or secrets beyond expected module/container/smoke status. - OK / DEGRADED / FAIL status model with mapped exit codes is coherent enough for Phase 3. - The `platform-host-agent` SSH failure is a runtime access gate before Phase 3 completion, not a code blocker for this PR. - Tests are production-safe via `FakeTransport`, injected smoke runners, and temp files. Reviewer non-blockers captured: - `EXIT_NO_CHANGES` for healthy status is semantically odd but harmless because it maps to success/0. - Last-JSON-line smoke parsing could be brittle if `tests/smoke.sh` ever emits multiple JSON objects. - `Any` in the transport injection type weakens static type clarity but keeps tests simple. - Runtime prerequisite should be documented/resolved: `platform-host-agent` SSH access to RS2000. Official platform canary remains **missing** and should still be fired before merge per repo policy.

codex referenced this pull request

2026-05-10 04:15:01 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex commented

2026-05-10 08:19:16 +02:00

Author

Collaborator

Runtime evidence refresh after B-safe platform-host-agent bootstrap on RS2000:

restricted platform-host-agent SSH now works to 100.110.188.20
platformctl plan honcho-redis --json -> status: in-sync, exitCode: 0
platformctl health --module honcho-redis --json -> status: OK, manifest OK, container running, smoke 4 passed / 0 failed / 3 skipped, exitCode: 0
fake SHA apply still refuses before runtime (exitCode: 6)

Residual outside this PR: RS2000 Tailscale tags are still empty, so CI/Codex-tagged apply needs the separate tag/ACL gate before production apply is ready. Official canary is still missing.

Runtime evidence refresh after B-safe `platform-host-agent` bootstrap on RS2000: - restricted `platform-host-agent` SSH now works to `100.110.188.20` - `platformctl plan honcho-redis --json` -> `status: in-sync`, `exitCode: 0` - `platformctl health --module honcho-redis --json` -> `status: OK`, manifest OK, container running, smoke 4 passed / 0 failed / 3 skipped, `exitCode: 0` - fake SHA apply still refuses before runtime (`exitCode: 6`) Residual outside this PR: RS2000 Tailscale tags are still empty, so CI/Codex-tagged apply needs the separate tag/ACL gate before production apply is ready. Official canary is still missing.

claude commented

2026-05-10 08:34:20 +02:00

Collaborator

3+3 ensemble review by `claude` — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: ✅ OK (confidence 0.72)

Risks

medium — SSHError path in container_status has zero test coverage
- Evidence: control-plane/platformctl/health.py:177-180 catches SSHError and returns FAIL with the raw exception string; the runtime evidence in the PR description shows this is the actual production failure path right now (Connection refused, Permission denied). All container tests use a FakeTransport that
- Recommendation: Add a test where FakeTransport.run raises SSHError, asserting the report has status=FAIL, state='unknown', and the error string is preserved. This is the only path the operator hit for real, so it should not be the untested one.
low — JSON output shape differs between health_module and health_all
- Evidence: health.py:health_module returns {module, generated_at, checks, host, lifecycle, criticality, status, exitCode}; health_all returns a wrapper {command:'health', scope, generated_at, status, summary, results, exitCode}. A consumer that switches on commandorscope will break on per-module output.
- Recommendation: Wrap single-module reports in the same envelope (e.g. {command:'health', scope:, results:[], summary, status, exitCode}) or document the two shapes explicitly in health.py docstring so consumers can branch deliberately.

Opportunities

Surface broken manifests in health all instead of silently dropping them — _is_v2_cataloged (health.py:55-62) catches every Exception from yaml.safe_load and returns False. A module with a syntactically broken module.yaml is silently excluded from health all, so the rollup reports OK while a manifest is corrupt. Consider returning the module with a DEGRADED/FAIL manifest entry so the operator sees the breakage on the same surface.
Empty v2 module set returns STATUS_OK — _summary_status (health.py:233-238) returns OK when summary['total']==0. If something accidentally hides v2 cataloging (e.g. modules dir misresolved, all manifests fail _is_v2_cataloged), health all reports OK with zero modules. A 'no v2 modules found' notice or DEGRADED status would prevent a green-but-empty rollup from masking visibility loss.

Product hat: ✅ OK (confidence 0.80)

Risks

medium — Health command will exit 5 / report FAIL on every module until platform-host-agent SSH gate is fixed
- Evidence: PR description: platformctl health --module honcho-redis --json exits 5 because TailscaleTransport cannot SSH as platform-host-agent. health.py:165-180 returns STATUS_FAIL on SSHError, which rolls up to FAIL via _rollup_status.
- Recommendation: Merge is fine — the code is correct and the runtime gap is honestly tracked as a separate Phase 3 gate. But: don't run platformctl health expecting useful output until the SSH gate is closed, and don't interpret the resulting FAIL wall as 'platform is broken.' Consider a follow-up to add a --skip-container flag so the command becomes useful for manifest+smoke validation in the meantime.

Opportunities

Surface non-v2-cataloged modules instead of silently excluding them — v2_module_ids() (health.py:60-67) and _is_v2_cataloged() (health.py:46-58) silently drop modules whose manifest fails YAML parse OR lacks v2 audit fields. After merge, platformctl health rollup may quietly omit modules the operator expects to see. A one-line note in the human output ('N modules excluded: not v2-cataloged') would prevent 'where did module X go?' confusion without changing scope.

# 3+3 ensemble review by `claude` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ✅ **OK** (confidence 0.72) ### Risks - **`medium`** — SSHError path in container_status has zero test coverage - Evidence: `control-plane/platformctl/health.py:177-180 catches SSHError and returns FAIL with the raw exception string; the runtime evidence in the PR description shows this is the actual production failure path right now (`Connection refused`, `Permission denied`). All container tests use a FakeTransport that` - Recommendation: Add a test where FakeTransport.run raises SSHError, asserting the report has status=FAIL, state='unknown', and the error string is preserved. This is the only path the operator hit for real, so it should not be the untested one. - **`low`** — JSON output shape differs between health_module and health_all - Evidence: `health.py:health_module returns {module, generated_at, checks, host, lifecycle, criticality, status, exitCode}; health_all returns a wrapper {command:'health', scope, generated_at, status, summary, results, exitCode}. A consumer that switches on `command` or `scope` will break on per-module output.` - Recommendation: Wrap single-module reports in the same envelope (e.g. {command:'health', scope:<id>, results:[<report>], summary, status, exitCode}) or document the two shapes explicitly in health.py docstring so consumers can branch deliberately. ### Opportunities - **Surface broken manifests in `health all` instead of silently dropping them** — _is_v2_cataloged (health.py:55-62) catches every Exception from yaml.safe_load and returns False. A module with a syntactically broken module.yaml is silently excluded from `health all`, so the rollup reports OK while a manifest is corrupt. Consider returning the module with a DEGRADED/FAIL manifest entry so the operator sees the breakage on the same surface. - **Empty v2 module set returns STATUS_OK** — _summary_status (health.py:233-238) returns OK when summary['total']==0. If something accidentally hides v2 cataloging (e.g. modules dir misresolved, all manifests fail _is_v2_cataloged), `health all` reports OK with zero modules. A 'no v2 modules found' notice or DEGRADED status would prevent a green-but-empty rollup from masking visibility loss. ## Product hat: ✅ **OK** (confidence 0.80) ### Risks - **`medium`** — Health command will exit 5 / report FAIL on every module until platform-host-agent SSH gate is fixed - Evidence: `PR description: `platformctl health --module honcho-redis --json` exits 5 because TailscaleTransport cannot SSH as platform-host-agent. health.py:165-180 returns STATUS_FAIL on SSHError, which rolls up to FAIL via _rollup_status.` - Recommendation: Merge is fine — the code is correct and the runtime gap is honestly tracked as a separate Phase 3 gate. But: don't run `platformctl health` expecting useful output until the SSH gate is closed, and don't interpret the resulting FAIL wall as 'platform is broken.' Consider a follow-up to add a `--skip-container` flag so the command becomes useful for manifest+smoke validation in the meantime. ### Opportunities - **Surface non-v2-cataloged modules instead of silently excluding them** — v2_module_ids() (health.py:60-67) and _is_v2_cataloged() (health.py:46-58) silently drop modules whose manifest fails YAML parse OR lacks v2 audit fields. After merge, `platformctl health` rollup may quietly omit modules the operator expects to see. A one-line note in the human output ('N modules excluded: not v2-cataloged') would prevent 'where did module X go?' confusion without changing scope.

codex commented

2026-05-10 08:34:20 +02:00

Author

Collaborator

3+3 ensemble review by `codex` — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: ❌ NOT_OK (confidence 0.93)

Risks

blocker — Health smoke path bypasses platform-host-agent identity
- Evidence: control-plane/platformctl/health.py:104-106 invokes tests/smoke.sh, and tests/smoke.sh:105-108 runs plain ssh "$host". That uses the operator SSH config/user, while TailscaleTransport correctly forces platform-host-agent. The PR's own evidence shows smoke passed while the platform-host-agent contain
- Recommendation: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while TailscaleTransport access fails.
blocker — Container check assumes wrong names for cataloged modules
- Evidence: control-plane/platformctl/health.py:165 calls plan.container_name(); control-plane/platformctl/plan.py:72-75 defaults to home-platform-<compose_service>-1. But v2-cataloged modules include non-default containers, e.g. modules/agaria-nginx/module.yaml:30-32 has compose_service agaria-nginx while modu
- Recommendation: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress.

Product hat: ❌ NOT_OK (confidence 0.86)

Risks

high — Default health command can become an attention trap
- Evidence: control-plane/platformctl/health.py:205
- Recommendation: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in.
medium — PR exceeds the operator's review-size norm
- Evidence: Diff stats: +624/-19 across 3 files
- Recommendation: Split implementation and tests/CLI wiring into smaller reviewable PRs, or get explicit operator acceptance that this Phase 3 packet is intentionally above the <=300 LOC norm.

Opportunities

Make the runtime gate impossible to miss — The PR description says platform-host-agent SSH failure is a Phase 3 gate, but the diff does not create a visible tracking artifact. Link or create the follow-up issue before merge so the operator does not have to remember it.

# 3+3 ensemble review by `codex` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ❌ **NOT_OK** (confidence 0.93) ### Risks - **`blocker`** — Health smoke path bypasses platform-host-agent identity - Evidence: `control-plane/platformctl/health.py:104-106 invokes tests/smoke.sh, and tests/smoke.sh:105-108 runs plain ssh "$host". That uses the operator SSH config/user, while TailscaleTransport correctly forces platform-host-agent. The PR's own evidence shows smoke passed while the platform-host-agent contain` - Recommendation: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while TailscaleTransport access fails. - **`blocker`** — Container check assumes wrong names for cataloged modules - Evidence: `control-plane/platformctl/health.py:165 calls plan.container_name(); control-plane/platformctl/plan.py:72-75 defaults to home-platform-<compose_service>-1. But v2-cataloged modules include non-default containers, e.g. modules/agaria-nginx/module.yaml:30-32 has compose_service agaria-nginx while modu` - Recommendation: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress. ## Product hat: ❌ **NOT_OK** (confidence 0.86) ### Risks - **`high`** — Default health command can become an attention trap - Evidence: `control-plane/platformctl/health.py:205` - Recommendation: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in. - **`medium`** — PR exceeds the operator's review-size norm - Evidence: `Diff stats: +624/-19 across 3 files` - Recommendation: Split implementation and tests/CLI wiring into smaller reviewable PRs, or get explicit operator acceptance that this Phase 3 packet is intentionally above the <=300 LOC norm. ### Opportunities - **Make the runtime gate impossible to miss** — The PR description says platform-host-agent SSH failure is a Phase 3 gate, but the diff does not create a visible tracking artifact. Link or create the follow-up issue before merge so the operator does not have to remember it.

glm commented

2026-05-10 08:34:21 +02:00

First-time contributor

3+3 ensemble review by `glm` — tech + product hats

One of three independent AI reviewers (claude / codex / glm). See state/reviews/PR-161/decision_packet.md for consolidated risk across all 6 outputs.

Tech hat: ✅ OK (confidence 0.95)

Risks

low — SSH transport timeout not explicitly passed to run()
- Evidence: control-plane/platformctl/health.py:145
- Recommendation: Consider passing timeout to transport.run() explicitly if TailscaleTransport.run() doesn't honor the constructor timeout. Current behavior likely safe but unclear from contract.
low — Exit code semantics may confuse operators
- Evidence: control-plane/platformctl/health.py:236-241
- Recommendation: Document that EXIT_NO_CHANGES (0) means healthy. This is platform contract-consistent but non-obvious for health commands.

Opportunities

Dependency injection design is well-executed — health_module() and health_all() accept transport and smoke_runner overrides, enabling comprehensive unit testing without mocking subprocess.run or SSH directly. test_health_phase3.py demonstrates this cleanly with FakeTransport and lambda smoke runners.
Robust JSON parsing handles smoke.sh output variations — _parse_smoke_json() iterates reversed lines and catches JSONDecodeError, making it resilient to non-JSON stdout noise that shell scripts sometimes emit before final JSON payload.
Security-conscious shell quoting — container_status() uses shlex.quote() around container name before embedding in docker inspect command, preventing shell injection if manifest.id contains special characters.

Product hat: ✅ OK (confidence 0.95)

Risks

medium — Runtime SSH access gap blocked by design
- Evidence: control-plane/platformctl/health.py:212-218
- Recommendation: Document in PLATFORM_CHARTER.md that Phase 3 is merge-ready code-first; runtime SSH key provision for platform-host-agent is tracked as a separate Phase 3 manual gate, not a blocker for this PR.
low — Exit code inconsistency between failures
- Evidence: control-plane/platformctl/health.py:183 (EXIT_TOOL_ERROR) vs line 249 (EXIT_REMOTE_FAILED)
- Recommendation: Accept as-is for Phase 3. Normalize exit semantics in Phase 4 when health surfaces are matured. The current distinction (manifest parse error vs runtime failure) is actually useful for debugging.

Opportunities

Add quick-fix hint for degraded smoke — When smoke returns DEGRADED (script missing), the output shows 'DEGRADED: smoke script not found'. Could add a parenthetical hint: '(run ./tests/smoke.sh honcho-redis)' to reduce cognitive load for the operator.

# 3+3 ensemble review by `glm` — tech + product hats > One of three independent AI reviewers (claude / codex / glm). See `state/reviews/PR-161/decision_packet.md` for consolidated risk across all 6 outputs. ## Tech hat: ✅ **OK** (confidence 0.95) ### Risks - **`low`** — SSH transport timeout not explicitly passed to run() - Evidence: `control-plane/platformctl/health.py:145` - Recommendation: Consider passing timeout to transport.run() explicitly if TailscaleTransport.run() doesn't honor the constructor timeout. Current behavior likely safe but unclear from contract. - **`low`** — Exit code semantics may confuse operators - Evidence: `control-plane/platformctl/health.py:236-241` - Recommendation: Document that EXIT_NO_CHANGES (0) means healthy. This is platform contract-consistent but non-obvious for health commands. ### Opportunities - **Dependency injection design is well-executed** — health_module() and health_all() accept transport and smoke_runner overrides, enabling comprehensive unit testing without mocking subprocess.run or SSH directly. test_health_phase3.py demonstrates this cleanly with FakeTransport and lambda smoke runners. - **Robust JSON parsing handles smoke.sh output variations** — _parse_smoke_json() iterates reversed lines and catches JSONDecodeError, making it resilient to non-JSON stdout noise that shell scripts sometimes emit before final JSON payload. - **Security-conscious shell quoting** — container_status() uses shlex.quote() around container name before embedding in docker inspect command, preventing shell injection if manifest.id contains special characters. ## Product hat: ✅ **OK** (confidence 0.95) ### Risks - **`medium`** — Runtime SSH access gap blocked by design - Evidence: `control-plane/platformctl/health.py:212-218` - Recommendation: Document in PLATFORM_CHARTER.md that Phase 3 is merge-ready code-first; runtime SSH key provision for platform-host-agent is tracked as a separate Phase 3 manual gate, not a blocker for this PR. - **`low`** — Exit code inconsistency between failures - Evidence: `control-plane/platformctl/health.py:183 (EXIT_TOOL_ERROR) vs line 249 (EXIT_REMOTE_FAILED)` - Recommendation: Accept as-is for Phase 3. Normalize exit semantics in Phase 4 when health surfaces are matured. The current distinction (manifest parse error vs runtime failure) is actually useful for debugging. ### Opportunities - **Add quick-fix hint for degraded smoke** — When smoke returns DEGRADED (script missing), the output shows 'DEGRADED: smoke script not found'. Could add a parenthetical hint: '(run ./tests/smoke.sh honcho-redis)' to reduce cognitive load for the operator.

claude commented

2026-05-10 08:34:22 +02:00

Collaborator

Review decision

Status: BLOCKER — recommended action: defer

Single-reviewer blockers

Health smoke path bypasses platform-host-agent identity
- Seen by: tech-gpt
- Decision: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while Tails
Container check assumes wrong names for cataloged modules
- Seen by: tech-gpt
- Decision: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress.

Single-reviewer high-risk findings

Default health command can become an attention trap
- Seen by: product-gpt
- Decision: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in.

Reviewer dissents

product-gpt voted NOT_OK (confidence 0.86)
tech-gpt voted NOT_OK (confidence 0.93)

Operator decisions (yes/no)

Risk 'Health smoke path bypasses platform-host-agent identity' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?
Risk 'Container check assumes wrong names for cataloged modules' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?
Risk 'Default health command can become an attention trap' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated?

Per-actor evidence: see comments by claude, codex, glm above. Tech: 2/3 OK · Product: 2/3 OK.

# Review decision **Status:** BLOCKER — recommended action: `defer` ### Single-reviewer blockers - **Health smoke path bypasses platform-host-agent identity** - Seen by: tech-gpt - Decision: Do not merge until the smoke leg uses the same platform-host-agent transport/identity as the container leg, or until platformctl health excludes smoke.sh checks that perform SSH through the operator alias. Add a regression test that fails if smoke execution can succeed via plain host SSH while Tails - **Container check assumes wrong names for cataloged modules** - Seen by: tech-gpt - Decision: Resolve container identity from an explicit manifest field or the same runbook/runtime source smoke.sh already uses before rolling up container status. Add a test fixture for a v2 module with compose_service agaria-nginx and container agaria-nginx so this cannot regress. ### Single-reviewer high-risk findings - **Default health command can become an attention trap** - Seen by: product-gpt - Decision: Do not merge the default all-module rollup until it has a bounded operator-facing mode, such as a timeout/concurrency limit or a narrower default with explicit all-scope opt-in. ### Reviewer dissents - `product-gpt` voted **NOT_OK** (confidence 0.86) - `tech-gpt` voted **NOT_OK** (confidence 0.93) ### Operator decisions (yes/no) 1. Risk 'Health smoke path bypasses platform-host-agent identity' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? 2. Risk 'Container check assumes wrong names for cataloged modules' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? 3. Risk 'Default health command can become an attention trap' raised by 1/6: do you accept this risk, or should this PR be deferred until it's mitigated? --- _Per-actor evidence: see comments by `claude`, `codex`, `glm` above. Tech: 2/3 OK · Product: 2/3 OK._

codex force-pushed codex/issues/142-phase3-health from 5a1f3107bb

canary-required / collect-diff (pull_request) Successful in 4s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s

Details

python-ci / Python 3.11 (pull_request) Successful in 41s

Details

python-ci / Python 3.12 (pull_request) Successful in 43s

Details

python-ci / Python 3.13 (pull_request) Successful in 43s

Details

canary-required / canary (pull_request) Successful in 12s

Details

to 4261563b06

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 43s

Details

canary-required / canary (pull_request) Successful in 11s

Details

2026-05-10 08:47:14 +02:00

Compare

codex force-pushed codex/issues/142-phase3-health from 4261563b06

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 17s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 43s

Details

canary-required / canary (pull_request) Successful in 11s

Details

to 6e45755ca0

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 44s

Details

canary-required / canary (pull_request) Successful in 11s

Details

2026-05-10 08:57:00 +02:00

Compare

codex force-pushed codex/issues/142-phase3-health from 6e45755ca0

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 44s

Details

canary-required / canary (pull_request) Successful in 11s

Details

to 8a1e193c0e

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 44s

Details

canary-required / canary (pull_request) Successful in 13s

Details

2026-05-10 09:06:29 +02:00

Compare

codex commented

2026-05-10 09:13:49 +02:00

Author

Collaborator

Codex status: #161 was iterated and rebased on the latest #160, but I am not rerunning/advancing its canary while #160 remains BLOCKER/defer.

Latest local verification on 8a1e193:

control-plane/transport tests: 359 passed
root tests: 318 passed / 15 skipped
runtime smoke: platformctl health --module honcho-redis --json -> status: OK, container running via platform-host-agent, smoke 2 passed / 0 failed / 5 skipped

Blocked by base PR #160 terminal recommendation: split/rewrite before merge.

Codex status: #161 was iterated and rebased on the latest #160, but I am not rerunning/advancing its canary while #160 remains `BLOCKER/defer`. Latest local verification on `8a1e193`: - control-plane/transport tests: `359 passed` - root tests: `318 passed / 15 skipped` - runtime smoke: `platformctl health --module honcho-redis --json` -> `status: OK`, container running via `platform-host-agent`, smoke 2 passed / 0 failed / 5 skipped Blocked by base PR #160 terminal recommendation: split/rewrite before merge.

codex referenced this pull request

2026-05-10 09:13:49 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142

codex commented

2026-05-11 01:01:06 +02:00

Author

Collaborator

Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9

Niezależny 5-iter ralph chain. Verdict + drafted patches poniżej.

Per-dim scoring

Correctness: 5/9
Security: 2/9 ⚠️ (path traversal + crash-on-bad-module)
Observability: 8/9 (rich rollup)
Test coverage: 7/9
Scope discipline: 9/9 (excellent — confined to health command)

Evidence: ~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/161/.

BLOCKER 1 — Strict `module_id` validation (path traversal)

WHAT: health_module linia 248: manifest_path = md / module_id / "module.yaml". No validation. CLI accepts arbitrary string for module_id.

WHY: Path traversal:

platformctl health "../../etc/sudoers.d/iskra-watch"
# md / "../../etc/sudoers.d/iskra-watch" / "module.yaml"
# Resolves outside modules_dir to filesystem location attacker chose
# Error message may leak file existence + path structure

Plus combined z line 258 load_manifest(manifest_path, strict_v2=True) — czyta YAML z arbitrary location, parse errors leak path.

HOW (drafted patch w health.py + cli.py):

# health.py — add near top (after imports)
_MODULE_ID_RE = re.compile(r"^[a-z0-9][a-z0-9-]*$")


def _validate_module_id(module_id: str) -> str | None:
    """Return error message if module_id is unsafe, None if OK.
    
    SECURITY (ralph #161 BLOCKER 1): prevent path traversal via crafted module_id.
    """
    if not isinstance(module_id, str) or not module_id:
        return "module_id must be a non-empty string"
    if not _MODULE_ID_RE.fullmatch(module_id):
        return (
            f"module_id {module_id!r} contains unsafe characters; "
            f"must match {_MODULE_ID_RE.pattern}"
        )
    return None


# health_module entry — first check (linia 247 area):
def health_module(
    module_id: str,
    *,
    modules_dir: Path | None = None,
    transport: TailscaleTransport | Any | None = None,
    smoke_runner: SmokeRunner | None = None,
) -> tuple[dict[str, Any], int]:
    """Return a health report for one module."""
    # SECURITY (ralph #161 BLOCKER 1): validate before Path construction.
    err = _validate_module_id(module_id)
    if err is not None:
        return {
            "module": module_id,
            "generated_at": _now(),
            "status": STATUS_FAIL,
            "error": err,
            "exitCode": EXIT_TOOL_ERROR,
        }, EXIT_TOOL_ERROR

    md = modules_dir or find_modules_dir()
    manifest_path = md / module_id / "module.yaml"
    # ...rest unchanged

CLI also needs same check (cli.py health command):

# cli.py - in the health command handler, before calling health_module:
@click.command("health")
@click.argument("module_id", required=False)
@click.option("--all", "all_modules", is_flag=True, help="Roll up all v2-cataloged modules")
def health_cmd(module_id: str | None, all_modules: bool):
    if all_modules:
        result, exit_code = health_all()
    else:
        if not module_id:
            click.echo("Error: provide module_id or --all", err=True)
        sys.exit(EXIT_TOOL_ERROR)
    err = _validate_module_id(module_id)  # ralph #161 BLOCKER 1
    if err:
        click.echo(f"Error: {err}", err=True)
        sys.exit(EXIT_TOOL_ERROR)
    result, exit_code = health_module(module_id)
    # ... output
    sys.exit(exit_code)

VERIFY (add to tests/test_health_phase3.py):

@pytest.mark.parametrize("bad_id", [
    "../../etc/passwd",
    "../escape",
    "foo/bar",
    "with space",
    "with.dot",
    "",
    "UPPER_case",  # regex rejects uppercase by design
    "starting-dash",  # must start z [a-z0-9]
])
def test_health_module_rejects_unsafe_module_id(bad_id):
    """module_id with path-unsafe chars must reject (ralph BLOCKER 1)."""
    result, exit_code = health_module(bad_id)
    assert exit_code == EXIT_TOOL_ERROR
    assert "unsafe characters" in result.get("error", "") or "non-empty" in result.get("error", "")

BLOCKER 2 — Exception isolation in `health_all`

WHAT: health_all linia 298-306 — list comprehension:

results = [
    health_module(module_id, ...)[0]
    for module_id in v2_module_ids(md)
]

health_module catches Exception w manifest load (linia 259), ale w container_status (linia 186-236) i runner() (linia 282-283), unexpected errors propagate. Same w _smoke_env, _rollup_status, _exit_code_for_status. Single bad module → entire health_all raises.

WHY: Operator runs platformctl health --all żeby zobaczyć fleet state. Single corrupted manifest / broken container / transport bug → no report at all. Catastrophic operational blind-spot exactly when needed most.

HOW (drafted patch w health.py linie 290-306):

def health_all(
    *,
    modules_dir: Path | None = None,
    transport: TailscaleTransport | Any | None = None,
    smoke_runner: SmokeRunner | None = None,
) -> tuple[dict[str, Any], int]:
    """Return a health rollup for all Phase 02 v2-cataloged modules.
    
    SECURITY (ralph #161 BLOCKER 2): isolate each module's check from
    siblings. Single broken module must NOT discard entire fleet report.
    """
    md = modules_dir or find_modules_dir()
    results: list[dict[str, Any]] = []
    for module_id in v2_module_ids(md):
        try:
            module_result, _exit = health_module(
                module_id,
                modules_dir=md,
                transport=transport,
                smoke_runner=smoke_runner,
            )
        except Exception as exc:  # noqa: BLE001 — explicit fleet-isolation
            module_result = {
                "module": module_id,
                "generated_at": _now(),
                "status": STATUS_FAIL,
                "error": f"unhandled error in health_module: {type(exc).__name__}: {exc}",
                "exitCode": EXIT_TOOL_ERROR,
            }
        results.append(module_result)
    summary = _summary(results)
    status = _summary_status(summary)
    report = {
        "command": "health",
        "scope": "all",
        "generated_at": _now(),
        "status": status,
        "summary": summary,
        "results": results,
        "exitCode": _exit_code_for_status(status),
    }
    return report, int(report["exitCode"])

VERIFY:

def test_health_all_continues_when_one_module_crashes(monkeypatch):
    """One bad module must NOT discard fleet report (ralph BLOCKER 2)."""
    # Mock v2_module_ids to return ["good-a", "bad-b", "good-c"]
    monkeypatch.setattr("platformctl.health.v2_module_ids", 
                       lambda md: ["good-a", "bad-b", "good-c"])
    
    def mock_health_module(mid, **kw):
        if mid == "bad-b":
            raise RuntimeError("transport.run hung on missing key")
        return ({"module": mid, "status": STATUS_OK, "exitCode": 0}, 0)
    
    monkeypatch.setattr("platformctl.health.health_module", mock_health_module)
    
    report, exit_code = health_all()
    
    assert len(report["results"]) == 3  # all three present
    statuses = [r["status"] for r in report["results"]]
    assert statuses.count(STATUS_OK) == 2
    assert statuses.count(STATUS_FAIL) == 1
    bad = next(r for r in report["results"] if r["module"] == "bad-b")
    assert "unhandled error" in bad["error"]
    assert "RuntimeError" in bad["error"]

BLOCKER 3 — Externalize hardcoded infrastructure constants

WHAT: health.py linia 32-36:

DEFAULT_SMOKE_SSH_USER = "platform-host-agent"
HOST_ENV = {
    "rs2000": ("PLATFORMCTL_RS2000_SSH_HOST", "100.110.188.20"),
    "vps1000": ("PLATFORMCTL_VPS1000_SSH_HOST", "100.79.239.52"),
}

Tailscale IPs hardcoded as fallback. SSH user hardcoded.

WHY:

Info disclosure: internal Tailscale IPs w source code (jeśli repo ever public)
Deployment fragility: moving servers / changing IPs requires code edit
Trust-based fallback: env var unset → silent use of hardcoded values, no warning operator

HOW (drafted patch w health.py linie 32-36):

# health.py — replace hardcoded fallbacks (linie 32-36):

# SECURITY (ralph #161 BLOCKER 3): infrastructure constants must come from env.
# No hardcoded IPs or usernames in source. Operator/CI configures explicitly.
HOSTS_REQUIRING_SSH_CONFIG = ("rs2000", "vps1000")


class HealthConfigError(RuntimeError):
    """Raised when required PLATFORMCTL_* env vars are missing."""


def _smoke_env() -> dict[str, str]:
    env = os.environ.copy()
    
    # SSH user — required, no fallback
    if "PLATFORMCTL_SMOKE_SSH_USER" not in env:
        raise HealthConfigError(
            "PLATFORMCTL_SMOKE_SSH_USER env var is required "
            "(no hardcoded fallback; ralph #161 BLOCKER 3)"
        )
    env.setdefault("PLATFORMCTL_SMOKE_REMOTE_MODE", "skip")
    
    # Tailscale host IPs — required per host
    missing = []
    for host in HOSTS_REQUIRING_SSH_CONFIG:
        var = f"PLATFORMCTL_{host.upper()}_SSH_HOST"
        if var not in env:
            missing.append(var)
    if missing:
        raise HealthConfigError(
            f"Missing required SSH host env vars: {missing} "
            "(no hardcoded fallback; ralph #161 BLOCKER 3)"
        )
    return env


# In CLI handler — catch HealthConfigError and exit cleanly:
# (in cli.py health command)
try:
    result, exit_code = health_all() if all_modules else health_module(module_id)
except HealthConfigError as exc:
    click.echo(f"Configuration error: {exc}", err=True)
    sys.exit(EXIT_TOOL_ERROR)

Plus document required env vars w docs/forgejo-agent-operations.md lub similar.

VERIFY:

def test_smoke_env_raises_when_ssh_user_missing(monkeypatch):
    """Missing PLATFORMCTL_SMOKE_SSH_USER must raise (ralph BLOCKER 3)."""
    monkeypatch.delenv("PLATFORMCTL_SMOKE_SSH_USER", raising=False)
    monkeypatch.setenv("PLATFORMCTL_RS2000_SSH_HOST", "x")
    monkeypatch.setenv("PLATFORMCTL_VPS1000_SSH_HOST", "y")
    with pytest.raises(HealthConfigError, match="SMOKE_SSH_USER"):
        _smoke_env()


def test_smoke_env_raises_when_host_ips_missing(monkeypatch):
    """Missing host IP env vars must raise (ralph BLOCKER 3)."""
    monkeypatch.setenv("PLATFORMCTL_SMOKE_SSH_USER", "test-user")
    monkeypatch.delenv("PLATFORMCTL_RS2000_SSH_HOST", raising=False)
    with pytest.raises(HealthConfigError, match="RS2000_SSH_HOST"):
        _smoke_env()

Follow-up issues

`SMOKE-ENV-WHITELIST-01` — Sanitize smoke subprocess environment

Linia 104: env = os.environ.copy() — full env inheritance dla smoke.sh. If smoke.sh compromised, exfiltrates AWS_*, GITHUB_TOKEN, anything in env.

Proposed: explicit whitelist:
SMOKE_ENV_WHITELIST = {"PATH", "HOME", "USER", "PLATFORMCTL_SMOKE_SSH_USER", ...}
env = {k: v for k, v in os.environ.items() if k in SMOKE_ENV_WHITELIST or k.startswith("PLATFORMCTL_")}

`RUNBOOK-CONTAINER-NAME-01` — Validate runbook-extracted container names

Linia 171: re.search(r"Container:\s*\([^`]+)`", runbook.read_text())— accepts ANY non-backtick content jako container name. Then passed dodocker inspect(linia 202). Malicious runbook may inject shell metachars even pastshlex.quote` (depends on shell version).

Proposed: regex-validate runbook output: ^[a-zA-Z0-9_.-]+$. Reject unsafe.

`ADVERSARIAL-INPUT-TESTS-01` — Test suite for hostile inputs

Add test cases dla: malformed module.yaml, corrupted runbook.md, very long module_id (DoS), Unicode confusables, null bytes w paths.

`OPS-PREREQUISITES-DOC-01` — Document operational prerequisites

health command silently depends on: Tailscale auth, smoke.sh trust, --all fleet size (>50 modules may timeout). Document docs/health-prerequisites.md listing all preconditions + failure modes.

Action items

Apply 3 BLOCKER patches → health.py + cli.py + tests
Force-push do codex/issues/142-phase3-health
File 4 follow-up issues
Comment "ready for re-review" tutaj
Operator może odpalić re-ralph

Note: Po merge wszystkich Phase 3 PRów (#162-#167) + #161, verify_approved_sha w #162 i health w #161 będą używać tej samej infrastruktury (Forgejo token, sacred paths). Consider single follow-up OPS-PREREQUISITES-DOC-01 covering all platformctl env vars.

— ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity

## Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9 Niezależny 5-iter ralph chain. Verdict + drafted patches poniżej. ### Per-dim scoring - Correctness: 5/9 - **Security: 2/9** ⚠️ (path traversal + crash-on-bad-module) - Observability: 8/9 (rich rollup) - Test coverage: 7/9 - Scope discipline: 9/9 (excellent — confined to health command) Evidence: `~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/161/`. --- ## BLOCKER 1 — Strict `module_id` validation (path traversal) **WHAT:** `health_module` linia 248: `manifest_path = md / module_id / "module.yaml"`. No validation. CLI accepts arbitrary string for `module_id`. **WHY:** Path traversal: ```bash platformctl health "../../etc/sudoers.d/iskra-watch" # md / "../../etc/sudoers.d/iskra-watch" / "module.yaml" # Resolves outside modules_dir to filesystem location attacker chose # Error message may leak file existence + path structure ``` Plus combined z line 258 `load_manifest(manifest_path, strict_v2=True)` — czyta YAML z arbitrary location, parse errors leak path. **HOW (drafted patch w `health.py` + `cli.py`):** ```python # health.py — add near top (after imports) _MODULE_ID_RE = re.compile(r"^[a-z0-9][a-z0-9-]*$") def _validate_module_id(module_id: str) -> str | None: """Return error message if module_id is unsafe, None if OK. SECURITY (ralph #161 BLOCKER 1): prevent path traversal via crafted module_id. """ if not isinstance(module_id, str) or not module_id: return "module_id must be a non-empty string" if not _MODULE_ID_RE.fullmatch(module_id): return ( f"module_id {module_id!r} contains unsafe characters; " f"must match {_MODULE_ID_RE.pattern}" ) return None # health_module entry — first check (linia 247 area): def health_module( module_id: str, *, modules_dir: Path | None = None, transport: TailscaleTransport | Any | None = None, smoke_runner: SmokeRunner | None = None, ) -> tuple[dict[str, Any], int]: """Return a health report for one module.""" # SECURITY (ralph #161 BLOCKER 1): validate before Path construction. err = _validate_module_id(module_id) if err is not None: return { "module": module_id, "generated_at": _now(), "status": STATUS_FAIL, "error": err, "exitCode": EXIT_TOOL_ERROR, }, EXIT_TOOL_ERROR md = modules_dir or find_modules_dir() manifest_path = md / module_id / "module.yaml" # ...rest unchanged ``` CLI also needs same check (`cli.py` health command): ```python # cli.py - in the health command handler, before calling health_module: @click.command("health") @click.argument("module_id", required=False) @click.option("--all", "all_modules", is_flag=True, help="Roll up all v2-cataloged modules") def health_cmd(module_id: str | None, all_modules: bool): if all_modules: result, exit_code = health_all() else: if not module_id: click.echo("Error: provide module_id or --all", err=True) sys.exit(EXIT_TOOL_ERROR) err = _validate_module_id(module_id) # ralph #161 BLOCKER 1 if err: click.echo(f"Error: {err}", err=True) sys.exit(EXIT_TOOL_ERROR) result, exit_code = health_module(module_id) # ... output sys.exit(exit_code) ``` **VERIFY (add to `tests/test_health_phase3.py`):** ```python @pytest.mark.parametrize("bad_id", [ "../../etc/passwd", "../escape", "foo/bar", "with space", "with.dot", "", "UPPER_case", # regex rejects uppercase by design "starting-dash", # must start z [a-z0-9] ]) def test_health_module_rejects_unsafe_module_id(bad_id): """module_id with path-unsafe chars must reject (ralph BLOCKER 1).""" result, exit_code = health_module(bad_id) assert exit_code == EXIT_TOOL_ERROR assert "unsafe characters" in result.get("error", "") or "non-empty" in result.get("error", "") ``` --- ## BLOCKER 2 — Exception isolation in `health_all` **WHAT:** `health_all` linia 298-306 — list comprehension: ```python results = [ health_module(module_id, ...)[0] for module_id in v2_module_ids(md) ] ``` `health_module` catches `Exception` w manifest load (linia 259), ale w `container_status` (linia 186-236) i `runner()` (linia 282-283), unexpected errors **propagate**. Same w `_smoke_env`, `_rollup_status`, `_exit_code_for_status`. Single bad module → entire `health_all` raises. **WHY:** Operator runs `platformctl health --all` żeby zobaczyć fleet state. Single corrupted manifest / broken container / transport bug → no report at all. Catastrophic operational blind-spot exactly when needed most. **HOW (drafted patch w `health.py` linie 290-306):** ```python def health_all( *, modules_dir: Path | None = None, transport: TailscaleTransport | Any | None = None, smoke_runner: SmokeRunner | None = None, ) -> tuple[dict[str, Any], int]: """Return a health rollup for all Phase 02 v2-cataloged modules. SECURITY (ralph #161 BLOCKER 2): isolate each module's check from siblings. Single broken module must NOT discard entire fleet report. """ md = modules_dir or find_modules_dir() results: list[dict[str, Any]] = [] for module_id in v2_module_ids(md): try: module_result, _exit = health_module( module_id, modules_dir=md, transport=transport, smoke_runner=smoke_runner, ) except Exception as exc: # noqa: BLE001 — explicit fleet-isolation module_result = { "module": module_id, "generated_at": _now(), "status": STATUS_FAIL, "error": f"unhandled error in health_module: {type(exc).__name__}: {exc}", "exitCode": EXIT_TOOL_ERROR, } results.append(module_result) summary = _summary(results) status = _summary_status(summary) report = { "command": "health", "scope": "all", "generated_at": _now(), "status": status, "summary": summary, "results": results, "exitCode": _exit_code_for_status(status), } return report, int(report["exitCode"]) ``` **VERIFY:** ```python def test_health_all_continues_when_one_module_crashes(monkeypatch): """One bad module must NOT discard fleet report (ralph BLOCKER 2).""" # Mock v2_module_ids to return ["good-a", "bad-b", "good-c"] monkeypatch.setattr("platformctl.health.v2_module_ids", lambda md: ["good-a", "bad-b", "good-c"]) def mock_health_module(mid, **kw): if mid == "bad-b": raise RuntimeError("transport.run hung on missing key") return ({"module": mid, "status": STATUS_OK, "exitCode": 0}, 0) monkeypatch.setattr("platformctl.health.health_module", mock_health_module) report, exit_code = health_all() assert len(report["results"]) == 3 # all three present statuses = [r["status"] for r in report["results"]] assert statuses.count(STATUS_OK) == 2 assert statuses.count(STATUS_FAIL) == 1 bad = next(r for r in report["results"] if r["module"] == "bad-b") assert "unhandled error" in bad["error"] assert "RuntimeError" in bad["error"] ``` --- ## BLOCKER 3 — Externalize hardcoded infrastructure constants **WHAT:** `health.py` linia 32-36: ```python DEFAULT_SMOKE_SSH_USER = "platform-host-agent" HOST_ENV = { "rs2000": ("PLATFORMCTL_RS2000_SSH_HOST", "100.110.188.20"), "vps1000": ("PLATFORMCTL_VPS1000_SSH_HOST", "100.79.239.52"), } ``` Tailscale IPs hardcoded as fallback. SSH user hardcoded. **WHY:** - **Info disclosure:** internal Tailscale IPs w source code (jeśli repo ever public) - **Deployment fragility:** moving servers / changing IPs requires code edit - **Trust-based fallback:** env var unset → silent use of hardcoded values, no warning operator **HOW (drafted patch w `health.py` linie 32-36):** ```python # health.py — replace hardcoded fallbacks (linie 32-36): # SECURITY (ralph #161 BLOCKER 3): infrastructure constants must come from env. # No hardcoded IPs or usernames in source. Operator/CI configures explicitly. HOSTS_REQUIRING_SSH_CONFIG = ("rs2000", "vps1000") class HealthConfigError(RuntimeError): """Raised when required PLATFORMCTL_* env vars are missing.""" def _smoke_env() -> dict[str, str]: env = os.environ.copy() # SSH user — required, no fallback if "PLATFORMCTL_SMOKE_SSH_USER" not in env: raise HealthConfigError( "PLATFORMCTL_SMOKE_SSH_USER env var is required " "(no hardcoded fallback; ralph #161 BLOCKER 3)" ) env.setdefault("PLATFORMCTL_SMOKE_REMOTE_MODE", "skip") # Tailscale host IPs — required per host missing = [] for host in HOSTS_REQUIRING_SSH_CONFIG: var = f"PLATFORMCTL_{host.upper()}_SSH_HOST" if var not in env: missing.append(var) if missing: raise HealthConfigError( f"Missing required SSH host env vars: {missing} " "(no hardcoded fallback; ralph #161 BLOCKER 3)" ) return env # In CLI handler — catch HealthConfigError and exit cleanly: # (in cli.py health command) try: result, exit_code = health_all() if all_modules else health_module(module_id) except HealthConfigError as exc: click.echo(f"Configuration error: {exc}", err=True) sys.exit(EXIT_TOOL_ERROR) ``` Plus document required env vars w `docs/forgejo-agent-operations.md` lub similar. **VERIFY:** ```python def test_smoke_env_raises_when_ssh_user_missing(monkeypatch): """Missing PLATFORMCTL_SMOKE_SSH_USER must raise (ralph BLOCKER 3).""" monkeypatch.delenv("PLATFORMCTL_SMOKE_SSH_USER", raising=False) monkeypatch.setenv("PLATFORMCTL_RS2000_SSH_HOST", "x") monkeypatch.setenv("PLATFORMCTL_VPS1000_SSH_HOST", "y") with pytest.raises(HealthConfigError, match="SMOKE_SSH_USER"): _smoke_env() def test_smoke_env_raises_when_host_ips_missing(monkeypatch): """Missing host IP env vars must raise (ralph BLOCKER 3).""" monkeypatch.setenv("PLATFORMCTL_SMOKE_SSH_USER", "test-user") monkeypatch.delenv("PLATFORMCTL_RS2000_SSH_HOST", raising=False) with pytest.raises(HealthConfigError, match="RS2000_SSH_HOST"): _smoke_env() ``` --- ## Follow-up issues ### `SMOKE-ENV-WHITELIST-01` — Sanitize smoke subprocess environment > Linia 104: `env = os.environ.copy()` — full env inheritance dla `smoke.sh`. If smoke.sh compromised, exfiltrates AWS_*, GITHUB_TOKEN, anything in env. > > Proposed: explicit whitelist: > ```python > SMOKE_ENV_WHITELIST = {"PATH", "HOME", "USER", "PLATFORMCTL_SMOKE_SSH_USER", ...} > env = {k: v for k, v in os.environ.items() if k in SMOKE_ENV_WHITELIST or k.startswith("PLATFORMCTL_")} > ``` ### `RUNBOOK-CONTAINER-NAME-01` — Validate runbook-extracted container names > Linia 171: `re.search(r"Container:\s*\`([^\`]+)\`", runbook.read_text())` — accepts ANY non-backtick content jako container name. Then passed do `docker inspect` (linia 202). Malicious runbook may inject shell metachars even past `shlex.quote` (depends on shell version). > > Proposed: regex-validate runbook output: `^[a-zA-Z0-9_.-]+$`. Reject unsafe. ### `ADVERSARIAL-INPUT-TESTS-01` — Test suite for hostile inputs > Add test cases dla: malformed module.yaml, corrupted runbook.md, very long module_id (DoS), Unicode confusables, null bytes w paths. ### `OPS-PREREQUISITES-DOC-01` — Document operational prerequisites > health command silently depends on: Tailscale auth, smoke.sh trust, --all fleet size (>50 modules may timeout). Document `docs/health-prerequisites.md` listing all preconditions + failure modes. --- ## Action items 1. Apply 3 BLOCKER patches → `health.py` + `cli.py` + tests 2. Force-push do `codex/issues/142-phase3-health` 3. File 4 follow-up issues 4. Comment "ready for re-review" tutaj 5. Operator może odpalić re-ralph **Note:** Po merge wszystkich Phase 3 PRów (#162-#167) + #161, `verify_approved_sha` w #162 i `health` w #161 będą używać tej samej infrastruktury (Forgejo token, sacred paths). Consider single follow-up `OPS-PREREQUISITES-DOC-01` covering all platformctl env vars. — ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity

codex referenced this pull request from a commit

2026-05-12 00:18:14 +02:00

chore(prompts): Codex RS2000 migration close-out dispatch for 2026-05-12

claude referenced this pull request

2026-05-12 00:18:33 +02:00

chore(prompts): Codex RS2000 close-out dispatch (P0 phase3 apply chain + P1 recovery batch + P2 prune) #186

codex changed target branch from codex/issues/142-phase3-apply to main

2026-05-12 01:03:28 +02:00

codex force-pushed codex/issues/142-phase3-health from 8a1e193c0e

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s

Details

python-ci / Python 3.11 (pull_request) Successful in 42s

Details

python-ci / Python 3.12 (pull_request) Successful in 44s

Details

python-ci / Python 3.13 (pull_request) Successful in 44s

Details

canary-required / canary (pull_request) Successful in 13s

Details

to 97bf232a1f

canary-required / collect-diff (pull_request) Successful in 3s

Details

pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s

Details

python-ci / Python 3.11 (pull_request) Successful in 44s

Details

python-ci / Python 3.12 (pull_request) Successful in 45s

Details

python-ci / Python 3.13 (pull_request) Successful in 45s

Details

canary-required / canary (pull_request) Successful in 12s

Details

2026-05-12 01:08:04 +02:00

Compare

codex referenced this pull request

2026-05-12 01:08:25 +02:00

fix(platformctl): whitelist smoke subprocess environment #210

codex referenced this pull request

2026-05-12 01:08:26 +02:00

fix(platformctl): validate runbook-derived container names #211

codex referenced this pull request

2026-05-12 01:08:26 +02:00

test(platformctl): add adversarial health input cases #212

codex referenced this pull request

2026-05-12 01:08:26 +02:00

docs(platformctl): document health command operational prerequisites #213

codex commented

2026-05-12 01:08:45 +02:00

Author

Collaborator

Ready for re-review — ralph blockers addressed

Updated #161 on commit 97bf232a1f7cfcc43cfe0a1d73d632de8c8575e0 and reset the PR base to main.

Branch hygiene

Rebuilt the branch from origin/main and cherry-picked only the two health-rollup commits.
Removed the stale apply-stack commits from this PR diff; #161 is now health-only again.

Addressed

BLOCKER 1: health_module validates module_id before path construction; unsafe IDs fail closed without reading outside modules/.
BLOCKER 2: health_all isolates per-module exceptions so one broken module yields a FAIL row instead of discarding the fleet report.
BLOCKER 3: smoke SSH user and RS2000/VPS1000 host values now require explicit PLATFORMCTL_* env vars; no hardcoded Tailnet IP fallback remains in health.py.

Verification

PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_health_phase3.py control-plane/platformctl/tests/test_smoke.py -q → 24 passed
Full tests/run-verify.sh still blocked by pre-existing main prompt debt: prompts/codex-rs2000-close-2026-05-12.md token budget + missing P2 image-prune prompt reference.

Follow-ups filed

#210 SMOKE-ENV-WHITELIST-01
#211 RUNBOOK-CONTAINER-NAME-01
#212 ADVERSARIAL-INPUT-TESTS-01
#213 OPS-PREREQUISITES-DOC-01

## Ready for re-review — ralph blockers addressed Updated #161 on commit `97bf232a1f7cfcc43cfe0a1d73d632de8c8575e0` and reset the PR base to `main`. ### Branch hygiene - Rebuilt the branch from `origin/main` and cherry-picked only the two health-rollup commits. - Removed the stale apply-stack commits from this PR diff; #161 is now health-only again. ### Addressed - BLOCKER 1: `health_module` validates `module_id` before path construction; unsafe IDs fail closed without reading outside `modules/`. - BLOCKER 2: `health_all` isolates per-module exceptions so one broken module yields a FAIL row instead of discarding the fleet report. - BLOCKER 3: smoke SSH user and RS2000/VPS1000 host values now require explicit `PLATFORMCTL_*` env vars; no hardcoded Tailnet IP fallback remains in `health.py`. ### Verification - `PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_health_phase3.py control-plane/platformctl/tests/test_smoke.py -q` → 24 passed - Full `tests/run-verify.sh` still blocked by pre-existing main prompt debt: `prompts/codex-rs2000-close-2026-05-12.md` token budget + missing P2 image-prune prompt reference. ### Follow-ups filed - #210 SMOKE-ENV-WHITELIST-01 - #211 RUNBOOK-CONTAINER-NAME-01 - #212 ADVERSARIAL-INPUT-TESTS-01 - #213 OPS-PREREQUISITES-DOC-01

codex referenced this pull request

2026-05-12 01:09:37 +02:00

[coordination-lane] Cutover Flight Phase 2: v2 catalog to gate #142