fix(agent-access): harden ssh-agent session lifecycle #80

Closed
codex wants to merge 1 commit from codex/issues/79-agent-access-lifecycle into main
Collaborator

Canary status: approve_with_evidence_gap — 3-iteration cap reached; final SHA 852e04aa55740789033b0a1d52f955fc204bb86b mitigates the last code finding locally, with no 4th canary rerun

Canary Context Pack

Product story

The first Agent Access Plane SSH slice is merged, but the local session lifecycle still needs boring operator-safe cleanup. Codex should be able to inspect and clean non-secret ssh-agent session state without needing the OpenClaw private key again, and without accidentally deleting handles to a live or unknown session.

What changed

  • Added --list for non-secret session inventory with lifecycle status: active, stopped, expired, unknown.
  • Added --prune for stopped and expired session directories only.
  • --prune skips active and unknown/malformed sessions by default.
  • Before pruning stopped/expired state, the wrapper attempts to stop the recorded ssh-agent; if shutdown cannot be confirmed, the session directory is kept.
  • Added signal/error cleanup paths so a started ssh-agent is killed when the wrapper is interrupted or fails after start.
  • Tightened runtime/session directory checks, session id validation, timestamp parsing, and malformed metadata handling.
  • Metadata is trusted only when metadata.session_id exactly matches the session directory name; mismatches become unknown and are not pruned.
  • Updated docs and runtime layout for agent.sock, stopped_at, list/prune usage, and socket access-bearing semantics.
  • Expanded wrapper tests for list/prune, post-start cleanup, malformed sessions, non-object metadata, mismatched metadata/session ids, and TTL flake resistance.

Why it changed

PR #77 intentionally shipped the narrow SSH capability first and left lifecycle hardening as follow-up #79. This PR closes that gap before more Agent Access Plane capabilities reuse the same runtime shape.

Files touched

  • scripts/agent-access/codex-openclaw-ssh-agent
  • control-plane/platformctl/tests/test_agent_access_ssh_agent.py
  • docs/agent-access/codex-openclaw-ssh.md
  • state/runtime-layout.md

Relevant context

  • Issue #79: session lifecycle hardening after SSH-agent TTL slice.
  • Issue #73 / PR #77: original Codex -> OpenClaw ssh-agent TTL wrapper.
  • PR #78 / ADR 0004: Agent Access Plane boundaries.
  • state/runtime-layout.md: canonical non-secret runtime state structure.

Runtime evidence

  • PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py -> 10 passed
  • PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests -> 157 passed
  • git diff --check -> pass
  • Secret-pattern diff scan -> no matches
  • Runtime smoke before safe-default change: --list listed two stopped prior sessions, --prune pruned both, next --list returned NO_SESSIONS
  • Runtime smoke after safe-default/final hardening: --list -> NO_SESSIONS; --prune -> PRUNE_DONE count=0

Canary iteration notes

  • Iter 1 result: defer; mitigated unknown auto-prune, prune-before-agent-stop, and naive timestamp handling.
  • Iter 2 result: defer; mitigated non-object metadata.json, prune help text mismatch, and keeping state when recorded agent shutdown cannot be confirmed.
  • Iter 3 result: defer; mitigated the remaining real code finding locally by requiring metadata.session_id == session_dir.name before trusting metadata.
  • Terminal status: approve_with_evidence_gap, not approve_merge, because canary cap prevents a 4th rerun after the final local patch.
  • Residual evidence gap for operator: no canary rerun on final SHA after the last small code fix; product-gpt repeatedly reported empty PR description despite this body being present via Forgejo API; PR is above the 300 LOC norm mostly because regression tests cover the lifecycle safety surface.

Known constraints

  • agent.sock remains an access-bearing bearer capability while the ssh-agent is alive; this PR documents that and leaves active/unknown sessions untouched by prune.
  • Unknown/malformed sessions require manual inspection rather than automatic deletion.
  • This PR does not add generic secret command execution.
  • This PR does not change Infisical scope, Forgejo identity, CI, or OpenClaw runtime deployment.

Explicit out-of-scope

  • Max TTL policy changes beyond the existing 4h cap.
  • Generic Agent Access Plane catalog/resolver commands.
  • Forgejo/MCP identity split from #56.
  • Canary CI Infisical integration from #72.

Requested decision

Operator may merge if accepting the evidence gap above. Otherwise split/rewrite is the clean terminal alternative under the hard cap.

Merge blockers

  • Any path that can leak the private key into logs, argv, repo files, dotenv files, or child env.
  • --prune deleting active or unknown/malformed sessions.
  • --prune removing stopped/expired session handles when recorded agent shutdown cannot be confirmed.
  • Failure cleanup leaving a started ssh-agent behind in tested error paths.

Spec sources read

  • Issue #79 context from orchestrator thread: direct scope for lifecycle hardening.
  • scripts/agent-access/codex-openclaw-ssh-agent: implementation target from PR #77.
  • control-plane/platformctl/tests/test_agent_access_ssh_agent.py: existing acceptance tests and new regression coverage.
  • docs/agent-access/codex-openclaw-ssh.md: operator runbook for the wrapper.
  • state/runtime-layout.md: runtime state layout contract.

Closes #79

Canary status: approve_with_evidence_gap — 3-iteration cap reached; final SHA `852e04aa55740789033b0a1d52f955fc204bb86b` mitigates the last code finding locally, with no 4th canary rerun ## Canary Context Pack ### Product story The first Agent Access Plane SSH slice is merged, but the local session lifecycle still needs boring operator-safe cleanup. Codex should be able to inspect and clean non-secret ssh-agent session state without needing the OpenClaw private key again, and without accidentally deleting handles to a live or unknown session. ### What changed - Added `--list` for non-secret session inventory with lifecycle status: active, stopped, expired, unknown. - Added `--prune` for stopped and expired session directories only. - `--prune` skips active and unknown/malformed sessions by default. - Before pruning stopped/expired state, the wrapper attempts to stop the recorded `ssh-agent`; if shutdown cannot be confirmed, the session directory is kept. - Added signal/error cleanup paths so a started ssh-agent is killed when the wrapper is interrupted or fails after start. - Tightened runtime/session directory checks, session id validation, timestamp parsing, and malformed metadata handling. - Metadata is trusted only when `metadata.session_id` exactly matches the session directory name; mismatches become `unknown` and are not pruned. - Updated docs and runtime layout for `agent.sock`, `stopped_at`, list/prune usage, and socket access-bearing semantics. - Expanded wrapper tests for list/prune, post-start cleanup, malformed sessions, non-object metadata, mismatched metadata/session ids, and TTL flake resistance. ### Why it changed PR #77 intentionally shipped the narrow SSH capability first and left lifecycle hardening as follow-up #79. This PR closes that gap before more Agent Access Plane capabilities reuse the same runtime shape. ### Files touched - `scripts/agent-access/codex-openclaw-ssh-agent` - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py` - `docs/agent-access/codex-openclaw-ssh.md` - `state/runtime-layout.md` ### Relevant context - Issue #79: session lifecycle hardening after SSH-agent TTL slice. - Issue #73 / PR #77: original Codex -> OpenClaw ssh-agent TTL wrapper. - PR #78 / ADR 0004: Agent Access Plane boundaries. - `state/runtime-layout.md`: canonical non-secret runtime state structure. ### Runtime evidence - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py` -> `10 passed` - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests` -> `157 passed` - `git diff --check` -> pass - Secret-pattern diff scan -> no matches - Runtime smoke before safe-default change: `--list` listed two stopped prior sessions, `--prune` pruned both, next `--list` returned `NO_SESSIONS` - Runtime smoke after safe-default/final hardening: `--list` -> `NO_SESSIONS`; `--prune` -> `PRUNE_DONE count=0` ### Canary iteration notes - Iter 1 result: `defer`; mitigated unknown auto-prune, prune-before-agent-stop, and naive timestamp handling. - Iter 2 result: `defer`; mitigated non-object `metadata.json`, prune help text mismatch, and keeping state when recorded agent shutdown cannot be confirmed. - Iter 3 result: `defer`; mitigated the remaining real code finding locally by requiring `metadata.session_id == session_dir.name` before trusting metadata. - Terminal status: `approve_with_evidence_gap`, not `approve_merge`, because canary cap prevents a 4th rerun after the final local patch. - Residual evidence gap for operator: no canary rerun on final SHA after the last small code fix; product-gpt repeatedly reported empty PR description despite this body being present via Forgejo API; PR is above the 300 LOC norm mostly because regression tests cover the lifecycle safety surface. ### Known constraints - `agent.sock` remains an access-bearing bearer capability while the ssh-agent is alive; this PR documents that and leaves active/unknown sessions untouched by prune. - Unknown/malformed sessions require manual inspection rather than automatic deletion. - This PR does not add generic secret command execution. - This PR does not change Infisical scope, Forgejo identity, CI, or OpenClaw runtime deployment. ### Explicit out-of-scope - Max TTL policy changes beyond the existing 4h cap. - Generic Agent Access Plane catalog/resolver commands. - Forgejo/MCP identity split from #56. - Canary CI Infisical integration from #72. ### Requested decision Operator may merge if accepting the evidence gap above. Otherwise split/rewrite is the clean terminal alternative under the hard cap. ### Merge blockers - Any path that can leak the private key into logs, argv, repo files, dotenv files, or child env. - `--prune` deleting active or unknown/malformed sessions. - `--prune` removing stopped/expired session handles when recorded agent shutdown cannot be confirmed. - Failure cleanup leaving a started ssh-agent behind in tested error paths. ## Spec sources read - Issue #79 context from orchestrator thread: direct scope for lifecycle hardening. - `scripts/agent-access/codex-openclaw-ssh-agent`: implementation target from PR #77. - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py`: existing acceptance tests and new regression coverage. - `docs/agent-access/codex-openclaw-ssh.md`: operator runbook for the wrapper. - `state/runtime-layout.md`: runtime state layout contract. Closes #79
fix(agent-access): harden ssh-agent session lifecycle
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 26s
python-ci / Python 3.12 (pull_request) Successful in 27s
python-ci / Python 3.13 (pull_request) Successful in 25s
canary-required / canary (pull_request) Failing after 1s
1b3bbf004e
codex force-pushed codex/issues/79-agent-access-lifecycle from 1b3bbf004e
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 26s
python-ci / Python 3.12 (pull_request) Successful in 27s
python-ci / Python 3.13 (pull_request) Successful in 25s
canary-required / canary (pull_request) Failing after 1s
to de0188be35
Some checks failed
canary-required / collect-diff (pull_request) Successful in 3s
python-ci / Python 3.11 (pull_request) Successful in 26s
python-ci / Python 3.12 (pull_request) Successful in 27s
python-ci / Python 3.13 (pull_request) Successful in 26s
canary-required / canary (pull_request) Failing after 2s
2026-05-05 07:44:10 +02:00
Compare
codex force-pushed codex/issues/79-agent-access-lifecycle from de0188be35
Some checks failed
canary-required / collect-diff (pull_request) Successful in 3s
python-ci / Python 3.11 (pull_request) Successful in 26s
python-ci / Python 3.12 (pull_request) Successful in 27s
python-ci / Python 3.13 (pull_request) Successful in 26s
canary-required / canary (pull_request) Failing after 2s
to 94a6cddec6
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Successful in 26s
python-ci / Python 3.13 (pull_request) Successful in 27s
canary-required / canary (pull_request) Failing after 2s
2026-05-05 07:53:07 +02:00
Compare
codex force-pushed codex/issues/79-agent-access-lifecycle from 94a6cddec6
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 25s
python-ci / Python 3.12 (pull_request) Successful in 26s
python-ci / Python 3.13 (pull_request) Successful in 27s
canary-required / canary (pull_request) Failing after 2s
to 852e04aa55
Some checks failed
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 27s
python-ci / Python 3.12 (pull_request) Successful in 28s
python-ci / Python 3.13 (pull_request) Successful in 28s
canary-required / canary (pull_request) Failing after 1s
2026-05-05 08:00:23 +02:00
Compare
Author
Collaborator

Canary terminal update for PR #80

  • Iter 1: defer; mitigated unknown auto-prune and prune-before-agent-stop risks.
  • Iter 2: defer; mitigated non-object metadata, prune help text mismatch, and unconfirmed agent-shutdown cleanup.
  • Iter 3: defer; final real code finding was metadata/session-id mismatch. Final SHA 852e04aa55740789033b0a1d52f955fc204bb86b now treats mismatched metadata as unknown and skips prune; local tests pass.

Terminal action after hard cap: approve_with_evidence_gap.

Residual gap: no 4th canary rerun on final SHA; product-gpt kept reporting an empty PR description even though the PR body is present via Forgejo API; size concern remains visible. Operator can merge accepting this, or choose split/rewrite.

Canary terminal update for PR #80 - Iter 1: `defer`; mitigated unknown auto-prune and prune-before-agent-stop risks. - Iter 2: `defer`; mitigated non-object metadata, prune help text mismatch, and unconfirmed agent-shutdown cleanup. - Iter 3: `defer`; final real code finding was metadata/session-id mismatch. Final SHA `852e04aa55740789033b0a1d52f955fc204bb86b` now treats mismatched metadata as `unknown` and skips prune; local tests pass. Terminal action after hard cap: `approve_with_evidence_gap`. Residual gap: no 4th canary rerun on final SHA; product-gpt kept reporting an empty PR description even though the PR body is present via Forgejo API; size concern remains visible. Operator can merge accepting this, or choose split/rewrite.
Author
Collaborator

Superseded by split/rewrite path.

Operator decision: this access-plane hardening is security-sensitive, so more review friction is intentional. I am not asking to merge this combined PR. I will split it into smaller PRs with narrower canary surfaces.

Superseded by split/rewrite path. Operator decision: this access-plane hardening is security-sensitive, so more review friction is intentional. I am not asking to merge this combined PR. I will split it into smaller PRs with narrower canary surfaces.
codex closed this pull request 2026-05-05 08:39:08 +02:00
Some checks are pending
canary-required / collect-diff (pull_request) Successful in 4s
python-ci / Python 3.11 (pull_request) Successful in 27s
python-ci / Python 3.12 (pull_request) Successful in 28s
python-ci / Python 3.13 (pull_request) Successful in 28s
canary-required / canary (pull_request) Failing after 1s
base-is-main / guard (pull_request)
Required
patchwarden-pr-sanity / sanity (pull_request)
Required

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!80
No description provided.