feat(platformctl): add health rollup #161
No reviewers
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform!161
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "codex/issues/142-phase3-health"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Canary status: iterating — canary rerun requested after Codex fixes for PR #161 blocker findings
Canary Context Pack
Product story
Phase 3 needs
platformctl healthto become the read-only confidence surface before later phases deploy or cut over RS2000 services. The operator should be able to ask whether a module is healthy without spelunking manually through manifests, smoke scripts, and Docker status.What changed
control-plane/platformctl/health.py.platformctl healthfor all Phase 02 v2-cataloged modules.platformctl health --module <id>and retainedplatformctl health <id>.tests/smoke.sh --json, and read-only Docker container status throughTailscaleTransport.Why it changed
This is Packet 3.4 from the cutover flight. It unblocks Phase 3's operational control-plane surface, but does not perform production mutation.
Files touched
control-plane/platformctl/health.pycontrol-plane/platformctl/cli.pycontrol-plane/platformctl/tests/test_health_phase3.pyRelevant context
prompts/codex-cutover-flight/dispatch.mdprompts/codex-cutover-flight/phase-3-operational.mdPacket 3.4tests/smoke.shcontrol-plane/platformctl/transport/tailscale.pycontrol-plane/platformctl/manifest.pycontrol-plane/platformctl/plan.pyRuntime evidence
Read-only manual probe was run:
PYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --jsonResult: command executes and returns structured health JSON, but exits
5becauseTailscaleTransportcannot currently SSH asplatform-host-agentto RS2000:ssh: connect to host 100.110.188.20 port 22: Connection refusedAdditional read-only check:
ssh rs2000resolves to operator SSH config as root/public host, whileplatform-host-agent@rs2000returnsPermission denied (publickey). I did not bypass this with root. This is a runtime access gate, not a code-test failure.Known constraints
platform-host-agentthroughTailscaleTransport, not the operator root SSH alias.Explicit out-of-scope
tests/smoke.shsemantics.Requested decision
Review the code path for merge readiness as Packet 3.4. Separately, treat the
platform-host-agentSSH failure as a Phase 3 runtime gate before declaring Phase 3 complete.Merge blockers
Tests
PYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests/test_health_phase3.py -q— 10 passedpython3 -m py_compile control-plane/platformctl/health.py control-plane/platformctl/cli.py— passedPYTHONPATH=control-plane python3 -m pytest control-plane/platformctl/tests control-plane/platformctl/transport/tests -q— 349 passedPYTHONPATH=control-plane python3 -m pytest tests -q— 318 passed / 15 skippedgit diff --check— cleanPYTHONPATH=control-plane python3 -m platformctl.cli health --module honcho-redis --json— command works but exits 5 due platform-host-agent SSH runtime gate described aboveSpec sources read
prompts/codex-cutover-flight/dispatch.md— phase sequencing and stop rulesprompts/codex-cutover-flight/phase-3-operational.md— Packet 3.4 acceptance criteriacontrol-plane/platformctl/cli.py— existing health skeleton and CLI wiringcontrol-plane/platformctl/manifest.py— strict-v2 load behaviorcontrol-plane/platformctl/plan.py— container naming and exit constantscontrol-plane/platformctl/transport/tailscale.py— read-only SSH transport contracttests/smoke.sh— smoke JSON output contractcontrol-plane/platformctl/tests/test_plan_phase3.pyandtest_apply_phase3.py— local Phase 3 test patternsRefs #142
Packet 3.4 external review checkpoint
Ollama Cloud expensive-review gate: 3/3 approve, no blockers.
Reviewers:
deepseek-v4-pro:cloud: APPROVE; blockers none.kimi-k2.6:cloud: APPROVE; blockers none.minimax-m2.7:cloud: APPROVE; blockers none.Shared reviewer conclusion:
platform-host-agentSSH failure is a runtime access gate before Phase 3 completion, not a code blocker for this PR.FakeTransport, injected smoke runners, and temp files.Reviewer non-blockers captured:
EXIT_NO_CHANGESfor healthy status is semantically odd but harmless because it maps to success/0.tests/smoke.shever emits multiple JSON objects.Anyin the transport injection type weakens static type clarity but keeps tests simple.platform-host-agentSSH access to RS2000.Official platform canary remains missing and should still be fired before merge per repo policy.
Runtime evidence refresh after B-safe
platform-host-agentbootstrap on RS2000:platform-host-agentSSH now works to100.110.188.20platformctl plan honcho-redis --json->status: in-sync,exitCode: 0platformctl health --module honcho-redis --json->status: OK, manifest OK, container running, smoke 4 passed / 0 failed / 3 skipped,exitCode: 0exitCode: 6)Residual outside this PR: RS2000 Tailscale tags are still empty, so CI/Codex-tagged apply needs the separate tag/ACL gate before production apply is ready. Official canary is still missing.
3+3 ensemble review by
claude— tech + product hatsTech hat: ✅ OK (confidence 0.72)
Risks
medium— SSHError path in container_status has zero test coveragecontrol-plane/platformctl/health.py:177-180 catches SSHError and returns FAIL with the raw exception string; the runtime evidence in the PR description shows this is the actual production failure path right now (Connection refused,Permission denied). All container tests use a FakeTransport thatlow— JSON output shape differs between health_module and health_allhealth.py:health_module returns {module, generated_at, checks, host, lifecycle, criticality, status, exitCode}; health_all returns a wrapper {command:'health', scope, generated_at, status, summary, results, exitCode}. A consumer that switches oncommandorscopewill break on per-module output.Opportunities
health allinstead of silently dropping them — _is_v2_cataloged (health.py:55-62) catches every Exception from yaml.safe_load and returns False. A module with a syntactically broken module.yaml is silently excluded fromhealth all, so the rollup reports OK while a manifest is corrupt. Consider returning the module with a DEGRADED/FAIL manifest entry so the operator sees the breakage on the same surface.health allreports OK with zero modules. A 'no v2 modules found' notice or DEGRADED status would prevent a green-but-empty rollup from masking visibility loss.Product hat: ✅ OK (confidence 0.80)
Risks
medium— Health command will exit 5 / report FAIL on every module until platform-host-agent SSH gate is fixedPR description:platformctl health --module honcho-redis --jsonexits 5 because TailscaleTransport cannot SSH as platform-host-agent. health.py:165-180 returns STATUS_FAIL on SSHError, which rolls up to FAIL via _rollup_status.platformctl healthexpecting useful output until the SSH gate is closed, and don't interpret the resulting FAIL wall as 'platform is broken.' Consider a follow-up to add a--skip-containerflag so the command becomes useful for manifest+smoke validation in the meantime.Opportunities
platformctl healthrollup may quietly omit modules the operator expects to see. A one-line note in the human output ('N modules excluded: not v2-cataloged') would prevent 'where did module X go?' confusion without changing scope.3+3 ensemble review by
codex— tech + product hatsTech hat: ❌ NOT_OK (confidence 0.93)
Risks
blocker— Health smoke path bypasses platform-host-agent identitycontrol-plane/platformctl/health.py:104-106 invokes tests/smoke.sh, and tests/smoke.sh:105-108 runs plain ssh "$host". That uses the operator SSH config/user, while TailscaleTransport correctly forces platform-host-agent. The PR's own evidence shows smoke passed while the platform-host-agent containblocker— Container check assumes wrong names for cataloged modulescontrol-plane/platformctl/health.py:165 calls plan.container_name(); control-plane/platformctl/plan.py:72-75 defaults to home-platform-<compose_service>-1. But v2-cataloged modules include non-default containers, e.g. modules/agaria-nginx/module.yaml:30-32 has compose_service agaria-nginx while moduProduct hat: ❌ NOT_OK (confidence 0.86)
Risks
high— Default health command can become an attention trapcontrol-plane/platformctl/health.py:205medium— PR exceeds the operator's review-size normDiff stats: +624/-19 across 3 filesOpportunities
3+3 ensemble review by
glm— tech + product hatsTech hat: ✅ OK (confidence 0.95)
Risks
low— SSH transport timeout not explicitly passed to run()control-plane/platformctl/health.py:145low— Exit code semantics may confuse operatorscontrol-plane/platformctl/health.py:236-241Opportunities
Product hat: ✅ OK (confidence 0.95)
Risks
medium— Runtime SSH access gap blocked by designcontrol-plane/platformctl/health.py:212-218low— Exit code inconsistency between failurescontrol-plane/platformctl/health.py:183 (EXIT_TOOL_ERROR) vs line 249 (EXIT_REMOTE_FAILED)Opportunities
Review decision
Status: BLOCKER — recommended action:
deferSingle-reviewer blockers
Single-reviewer high-risk findings
Reviewer dissents
product-gptvoted NOT_OK (confidence 0.86)tech-gptvoted NOT_OK (confidence 0.93)Operator decisions (yes/no)
Per-actor evidence: see comments by
claude,codex,glmabove. Tech: 2/3 OK · Product: 2/3 OK.5a1f3107bb4261563b064261563b066e45755ca06e45755ca08a1e193c0eCodex status: #161 was iterated and rebased on the latest #160, but I am not rerunning/advancing its canary while #160 remains
BLOCKER/defer.Latest local verification on
8a1e193:359 passed318 passed / 15 skippedplatformctl health --module honcho-redis --json->status: OK, container running viaplatform-host-agent, smoke 2 passed / 0 failed / 5 skippedBlocked by base PR #160 terminal recommendation: split/rewrite before merge.
Ralph review (5-iter chmurowy) — ITERATE_BLOCKER 4/9
Niezależny 5-iter ralph chain. Verdict + drafted patches poniżej.
Per-dim scoring
Evidence:
~/Iskra-i-Piotr/05 System/Swarmheart Backups/ralph-phase3-apply/161/.BLOCKER 1 — Strict
module_idvalidation (path traversal)WHAT:
health_modulelinia 248:manifest_path = md / module_id / "module.yaml". No validation. CLI accepts arbitrary string formodule_id.WHY: Path traversal:
Plus combined z line 258
load_manifest(manifest_path, strict_v2=True)— czyta YAML z arbitrary location, parse errors leak path.HOW (drafted patch w
health.py+cli.py):CLI also needs same check (
cli.pyhealth command):VERIFY (add to
tests/test_health_phase3.py):BLOCKER 2 — Exception isolation in
health_allWHAT:
health_alllinia 298-306 — list comprehension:health_modulecatchesExceptionw manifest load (linia 259), ale wcontainer_status(linia 186-236) irunner()(linia 282-283), unexpected errors propagate. Same w_smoke_env,_rollup_status,_exit_code_for_status. Single bad module → entirehealth_allraises.WHY: Operator runs
platformctl health --allżeby zobaczyć fleet state. Single corrupted manifest / broken container / transport bug → no report at all. Catastrophic operational blind-spot exactly when needed most.HOW (drafted patch w
health.pylinie 290-306):VERIFY:
BLOCKER 3 — Externalize hardcoded infrastructure constants
WHAT:
health.pylinia 32-36:Tailscale IPs hardcoded as fallback. SSH user hardcoded.
WHY:
HOW (drafted patch w
health.pylinie 32-36):Plus document required env vars w
docs/forgejo-agent-operations.mdlub similar.VERIFY:
Follow-up issues
SMOKE-ENV-WHITELIST-01— Sanitize smoke subprocess environmentRUNBOOK-CONTAINER-NAME-01— Validate runbook-extracted container namesADVERSARIAL-INPUT-TESTS-01— Test suite for hostile inputsOPS-PREREQUISITES-DOC-01— Document operational prerequisitesAction items
health.py+cli.py+ testscodex/issues/142-phase3-healthNote: Po merge wszystkich Phase 3 PRów (#162-#167) + #161,
verify_approved_shaw #162 ihealthw #161 będą używać tej samej infrastruktury (Forgejo token, sacred paths). Consider single follow-upOPS-PREREQUISITES-DOC-01covering all platformctl env vars.— ralph batch 2026-05-10, claude-opus-4.7 (Pan Herbata) dispatching via codex identity
8a1e193c0e97bf232a1fReady for re-review — ralph blockers addressed
Updated #161 on commit
97bf232a1f7cfcc43cfe0a1d73d632de8c8575e0and reset the PR base tomain.Branch hygiene
origin/mainand cherry-picked only the two health-rollup commits.Addressed
health_modulevalidatesmodule_idbefore path construction; unsafe IDs fail closed without reading outsidemodules/.health_allisolates per-module exceptions so one broken module yields a FAIL row instead of discarding the fleet report.PLATFORMCTL_*env vars; no hardcoded Tailnet IP fallback remains inhealth.py.Verification
PYTHONPATH=control-plane pytest control-plane/platformctl/tests/test_health_phase3.py control-plane/platformctl/tests/test_smoke.py -q→ 24 passedtests/run-verify.shstill blocked by pre-existing main prompt debt:prompts/codex-rs2000-close-2026-05-12.mdtoken budget + missing P2 image-prune prompt reference.Follow-ups filed