fix(agent-access): harden ssh-agent startup cleanup #84

Merged
pdurlej merged 5 commits from codex/issues/79-startup-cleanup into main 2026-05-05 23:39:38 +02:00
Collaborator

Canary status: missing — rerunning after signal-mask child fix on 339ff91cefecfa8f2b2451abd1dda9c1e0f826a9 with PR body supplied.

Refresh after #89 / #82 merge

  • Merged latest origin/main into codex/issues/79-startup-cleanup without force-push.
  • Resolved the SSH-agent TTL test conflict by keeping the mainline stable ttl=5s / sleep=6s timing.
  • Local verification: python3 -m pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py -> 8 passed.
  • Local verification: python3 -m pytest -q control-plane/platformctl/tests -> 164 passed.
  • Local verification: pyfallow analyze --root control-plane --format agent-fix-plan --fail-on error --min-confidence medium -> blocking_count=0.
  • Forgejo status on 8624f23: 6/6 checks success (canary-required, pyfallow, Python 3.11/3.12/3.13).

Canary Context Pack

Product story

The Codex -> OpenClaw SSH wrapper should fail closed. If the wrapper starts an ssh-agent and then fails before producing usable session state, it must clean up that agent instead of leaving an access-bearing process behind.

What changed

  • Runtime directory creation now uses a single mkdir path and rejects symlink/non-directory collisions.
  • If ssh-agent starts but its output cannot be parsed, the parsed PID is terminated best-effort.
  • SIGINT/SIGTERM enter the same cleanup path as normal wrapper failures.
  • Agent cleanup logic is centralized so post-start failure paths clear the private-key variable and call ssh-agent -k when socket/PID are known.
  • Test fake OpenSSH now records ssh-agent -k calls.
  • Added regression test for post-start failure after key load / before fingerprint success.
  • Real OpenSSH TTL test uses 3s instead of 1s to avoid test flake where the key expires before the first assertion.

Why it changed

This is the first split from superseded PR #80. It keeps the security-sensitive surface narrow: startup failure cleanup only, no --list, no --prune, no session status taxonomy.

Files touched

  • scripts/agent-access/codex-openclaw-ssh-agent
  • control-plane/platformctl/tests/test_agent_access_ssh_agent.py

Relevant context

  • Parent hardening issue #79.
  • Child issue #83.
  • Superseded combined PR #80.
  • Operator ruling: security-sensitive work should use smaller PRs and full canary; more friction is a feature.

Runtime evidence

  • PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py -> 8 passed
  • PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests -> 164 passed
  • git diff --check origin/main...HEAD -> pass
  • Secret-pattern diff scan -> no matches

Known constraints

Signal handling still performs cleanup from Python exception flow; this is good enough for this wrapper but not a general async-signal-safe framework.

Explicit out-of-scope

  • --list
  • --prune
  • Session lifecycle inventory/status classification
  • Runtime layout docs for new commands
  • Infisical scope changes
  • OpenClaw runtime changes

Requested decision

Approve if this narrow startup cleanup reduces residual access risk without widening the wrapper behavior.

Merge blockers

  • Any private-key leak to stdout/stderr, child env, argv, repo, or runtime files.
  • Any cleanup path that can leave a known started test agent alive after tested failure.
  • Scope creep into list/prune/session inventory.

Spec sources read

  • Issue #83 body: direct child scope.
  • Issue #79 context from prior canary findings.
  • scripts/agent-access/codex-openclaw-ssh-agent: implementation target.
  • control-plane/platformctl/tests/test_agent_access_ssh_agent.py: regression coverage target.
  • PR #80 canary findings: reason for split/rewrite.

Closes #83
Refs #79

Canary status: missing — rerunning after signal-mask child fix on `339ff91cefecfa8f2b2451abd1dda9c1e0f826a9` with PR body supplied. ## Refresh after #89 / #82 merge - Merged latest `origin/main` into `codex/issues/79-startup-cleanup` without force-push. - Resolved the SSH-agent TTL test conflict by keeping the mainline stable `ttl=5s` / `sleep=6s` timing. - Local verification: `python3 -m pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py` -> `8 passed`. - Local verification: `python3 -m pytest -q control-plane/platformctl/tests` -> `164 passed`. - Local verification: `pyfallow analyze --root control-plane --format agent-fix-plan --fail-on error --min-confidence medium` -> `blocking_count=0`. - Forgejo status on `8624f23`: 6/6 checks success (`canary-required`, `pyfallow`, Python 3.11/3.12/3.13). ## Canary Context Pack ### Product story The Codex -> OpenClaw SSH wrapper should fail closed. If the wrapper starts an `ssh-agent` and then fails before producing usable session state, it must clean up that agent instead of leaving an access-bearing process behind. ### What changed - Runtime directory creation now uses a single mkdir path and rejects symlink/non-directory collisions. - If `ssh-agent` starts but its output cannot be parsed, the parsed PID is terminated best-effort. - SIGINT/SIGTERM enter the same cleanup path as normal wrapper failures. - Agent cleanup logic is centralized so post-start failure paths clear the private-key variable and call `ssh-agent -k` when socket/PID are known. - Test fake OpenSSH now records `ssh-agent -k` calls. - Added regression test for post-start failure after key load / before fingerprint success. - Real OpenSSH TTL test uses 3s instead of 1s to avoid test flake where the key expires before the first assertion. ### Why it changed This is the first split from superseded PR #80. It keeps the security-sensitive surface narrow: startup failure cleanup only, no `--list`, no `--prune`, no session status taxonomy. ### Files touched - `scripts/agent-access/codex-openclaw-ssh-agent` - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py` ### Relevant context - Parent hardening issue #79. - Child issue #83. - Superseded combined PR #80. - Operator ruling: security-sensitive work should use smaller PRs and full canary; more friction is a feature. ### Runtime evidence - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py` -> `8 passed` - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests` -> `164 passed` - `git diff --check origin/main...HEAD` -> pass - Secret-pattern diff scan -> no matches ### Known constraints Signal handling still performs cleanup from Python exception flow; this is good enough for this wrapper but not a general async-signal-safe framework. ### Explicit out-of-scope - `--list` - `--prune` - Session lifecycle inventory/status classification - Runtime layout docs for new commands - Infisical scope changes - OpenClaw runtime changes ### Requested decision Approve if this narrow startup cleanup reduces residual access risk without widening the wrapper behavior. ### Merge blockers - Any private-key leak to stdout/stderr, child env, argv, repo, or runtime files. - Any cleanup path that can leave a known started test agent alive after tested failure. - Scope creep into list/prune/session inventory. ## Spec sources read - Issue #83 body: direct child scope. - Issue #79 context from prior canary findings. - `scripts/agent-access/codex-openclaw-ssh-agent`: implementation target. - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py`: regression coverage target. - PR #80 canary findings: reason for split/rewrite. Closes #83 Refs #79
fix(agent-access): harden ssh-agent startup cleanup
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
infra-docs-drift / docs-drift (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Successful in 26s
python-ci / Python 3.13 (pull_request) Successful in 25s
workflow-lint / lint (pull_request) Successful in 4s
canary-required / canary (pull_request) Failing after 2s
c130b7f83a
codex force-pushed codex/issues/79-startup-cleanup from c130b7f83a
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
infra-docs-drift / docs-drift (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Successful in 26s
python-ci / Python 3.13 (pull_request) Successful in 25s
workflow-lint / lint (pull_request) Successful in 4s
canary-required / canary (pull_request) Failing after 2s
to 818876c68c
Some checks failed
workflow-lint / lint (pull_request) Waiting to run
canary-required / collect-diff (pull_request) Successful in 4s
canary-required / canary (pull_request) Waiting to run
infra-docs-drift / docs-drift (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Has been cancelled
python-ci / Python 3.13 (pull_request) Has been cancelled
2026-05-05 08:55:02 +02:00
Compare
codex force-pushed codex/issues/79-startup-cleanup from 818876c68c
Some checks failed
workflow-lint / lint (pull_request) Waiting to run
canary-required / collect-diff (pull_request) Successful in 4s
canary-required / canary (pull_request) Waiting to run
infra-docs-drift / docs-drift (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Has been cancelled
python-ci / Python 3.13 (pull_request) Has been cancelled
to a39c3afc45
Some checks failed
canary-required / collect-diff (pull_request) Successful in 3s
infra-docs-drift / docs-drift (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Successful in 25s
python-ci / Python 3.13 (pull_request) Successful in 26s
workflow-lint / lint (pull_request) Successful in 4s
canary-required / canary (pull_request) Failing after 1s
2026-05-05 09:01:15 +02:00
Compare
Author
Collaborator

Canary terminal update for PR #84

  • Iter 1: defer; mitigated signal/PID capture weak signal.
  • Iter 2: defer; mitigated SIGINT/SIGTERM during ssh-agent startup with signal masking.
  • Iter 3: defer; tech reviewers are 3/3 OK. Remaining high dissent is PR description: none, but the PR body exists via Forgejo API.

Terminal action: defer_to_issue.

Tracked follow-up: #85 (fix(canary): include PR body in reviewer context). I am not running a 4th canary because the cap is reached.

Canary terminal update for PR #84 - Iter 1: `defer`; mitigated signal/PID capture weak signal. - Iter 2: `defer`; mitigated SIGINT/SIGTERM during `ssh-agent` startup with signal masking. - Iter 3: `defer`; tech reviewers are 3/3 OK. Remaining high dissent is `PR description: none`, but the PR body exists via Forgejo API. Terminal action: `defer_to_issue`. Tracked follow-up: #85 (`fix(canary): include PR body in reviewer context`). I am not running a 4th canary because the cap is reached.
Merge remote-tracking branch 'origin/main' into codex/issues/79-startup-cleanup
Some checks failed
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 18s
python-ci / Python 3.11 (pull_request) Successful in 29s
python-ci / Python 3.12 (pull_request) Successful in 29s
python-ci / Python 3.13 (pull_request) Successful in 28s
canary-required / canary (pull_request) Failing after 2s
2dfdbd661e
Merge remote-tracking branch 'origin/main' into codex/issues/79-startup-cleanup
All checks were successful
canary-required / collect-diff (pull_request) Successful in 3s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 15s
python-ci / Python 3.11 (pull_request) Successful in 29s
python-ci / Python 3.12 (pull_request) Successful in 31s
python-ci / Python 3.13 (pull_request) Successful in 30s
canary-required / canary (pull_request) Successful in 13s
8624f23eef
# Conflicts:
#	control-plane/platformctl/tests/test_agent_access_ssh_agent.py
Merge remote-tracking branch 'origin/codex/issues/79-startup-cleanup' into codex/issues/79-startup-cleanup
All checks were successful
canary-required / collect-diff (pull_request) Successful in 4s
pyfallow / Pyfallow gate (control-plane) (pull_request) Successful in 16s
python-ci / Python 3.11 (pull_request) Successful in 28s
python-ci / Python 3.12 (pull_request) Successful in 29s
python-ci / Python 3.13 (pull_request) Successful in 29s
canary-required / canary (pull_request) Successful in 13s
339ff91cef
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!84
No description provided.