pdurlej/platform

Fork 0

fix(agent-access): harden ssh-agent session lifecycle #80

Closed

codex wants to merge 1 commit from codex/issues/79-agent-access-lifecycle into main

codex commented

2026-05-05 07:37:42 +02:00

Collaborator

Canary status: approve_with_evidence_gap — 3-iteration cap reached; final SHA 852e04aa55740789033b0a1d52f955fc204bb86b mitigates the last code finding locally, with no 4th canary rerun

Canary Context Pack

Product story

The first Agent Access Plane SSH slice is merged, but the local session lifecycle still needs boring operator-safe cleanup. Codex should be able to inspect and clean non-secret ssh-agent session state without needing the OpenClaw private key again, and without accidentally deleting handles to a live or unknown session.

What changed

Added --list for non-secret session inventory with lifecycle status: active, stopped, expired, unknown.
Added --prune for stopped and expired session directories only.
--prune skips active and unknown/malformed sessions by default.
Before pruning stopped/expired state, the wrapper attempts to stop the recorded ssh-agent; if shutdown cannot be confirmed, the session directory is kept.
Added signal/error cleanup paths so a started ssh-agent is killed when the wrapper is interrupted or fails after start.
Tightened runtime/session directory checks, session id validation, timestamp parsing, and malformed metadata handling.
Metadata is trusted only when metadata.session_id exactly matches the session directory name; mismatches become unknown and are not pruned.
Updated docs and runtime layout for agent.sock, stopped_at, list/prune usage, and socket access-bearing semantics.
Expanded wrapper tests for list/prune, post-start cleanup, malformed sessions, non-object metadata, mismatched metadata/session ids, and TTL flake resistance.

Why it changed

PR #77 intentionally shipped the narrow SSH capability first and left lifecycle hardening as follow-up #79. This PR closes that gap before more Agent Access Plane capabilities reuse the same runtime shape.

Files touched

scripts/agent-access/codex-openclaw-ssh-agent
control-plane/platformctl/tests/test_agent_access_ssh_agent.py
docs/agent-access/codex-openclaw-ssh.md
state/runtime-layout.md

Relevant context

Issue #79: session lifecycle hardening after SSH-agent TTL slice.
Issue #73 / PR #77: original Codex -> OpenClaw ssh-agent TTL wrapper.
PR #78 / ADR 0004: Agent Access Plane boundaries.
state/runtime-layout.md: canonical non-secret runtime state structure.

Runtime evidence

PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py -> 10 passed
PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests -> 157 passed
git diff --check -> pass
Secret-pattern diff scan -> no matches
Runtime smoke before safe-default change: --list listed two stopped prior sessions, --prune pruned both, next --list returned NO_SESSIONS
Runtime smoke after safe-default/final hardening: --list -> NO_SESSIONS; --prune -> PRUNE_DONE count=0

Canary iteration notes

Iter 1 result: defer; mitigated unknown auto-prune, prune-before-agent-stop, and naive timestamp handling.
Iter 2 result: defer; mitigated non-object metadata.json, prune help text mismatch, and keeping state when recorded agent shutdown cannot be confirmed.
Iter 3 result: defer; mitigated the remaining real code finding locally by requiring metadata.session_id == session_dir.name before trusting metadata.
Terminal status: approve_with_evidence_gap, not approve_merge, because canary cap prevents a 4th rerun after the final local patch.
Residual evidence gap for operator: no canary rerun on final SHA after the last small code fix; product-gpt repeatedly reported empty PR description despite this body being present via Forgejo API; PR is above the 300 LOC norm mostly because regression tests cover the lifecycle safety surface.

Known constraints

agent.sock remains an access-bearing bearer capability while the ssh-agent is alive; this PR documents that and leaves active/unknown sessions untouched by prune.
Unknown/malformed sessions require manual inspection rather than automatic deletion.
This PR does not add generic secret command execution.
This PR does not change Infisical scope, Forgejo identity, CI, or OpenClaw runtime deployment.

Explicit out-of-scope

Max TTL policy changes beyond the existing 4h cap.
Generic Agent Access Plane catalog/resolver commands.
Forgejo/MCP identity split from #56.
Canary CI Infisical integration from #72.

Requested decision

Operator may merge if accepting the evidence gap above. Otherwise split/rewrite is the clean terminal alternative under the hard cap.

Merge blockers

Any path that can leak the private key into logs, argv, repo files, dotenv files, or child env.
--prune deleting active or unknown/malformed sessions.
--prune removing stopped/expired session handles when recorded agent shutdown cannot be confirmed.
Failure cleanup leaving a started ssh-agent behind in tested error paths.

Spec sources read

Issue #79 context from orchestrator thread: direct scope for lifecycle hardening.
scripts/agent-access/codex-openclaw-ssh-agent: implementation target from PR #77.
control-plane/platformctl/tests/test_agent_access_ssh_agent.py: existing acceptance tests and new regression coverage.
docs/agent-access/codex-openclaw-ssh.md: operator runbook for the wrapper.
state/runtime-layout.md: runtime state layout contract.

Closes #79

Canary status: approve_with_evidence_gap — 3-iteration cap reached; final SHA `852e04aa55740789033b0a1d52f955fc204bb86b` mitigates the last code finding locally, with no 4th canary rerun ## Canary Context Pack ### Product story The first Agent Access Plane SSH slice is merged, but the local session lifecycle still needs boring operator-safe cleanup. Codex should be able to inspect and clean non-secret ssh-agent session state without needing the OpenClaw private key again, and without accidentally deleting handles to a live or unknown session. ### What changed - Added `--list` for non-secret session inventory with lifecycle status: active, stopped, expired, unknown. - Added `--prune` for stopped and expired session directories only. - `--prune` skips active and unknown/malformed sessions by default. - Before pruning stopped/expired state, the wrapper attempts to stop the recorded `ssh-agent`; if shutdown cannot be confirmed, the session directory is kept. - Added signal/error cleanup paths so a started ssh-agent is killed when the wrapper is interrupted or fails after start. - Tightened runtime/session directory checks, session id validation, timestamp parsing, and malformed metadata handling. - Metadata is trusted only when `metadata.session_id` exactly matches the session directory name; mismatches become `unknown` and are not pruned. - Updated docs and runtime layout for `agent.sock`, `stopped_at`, list/prune usage, and socket access-bearing semantics. - Expanded wrapper tests for list/prune, post-start cleanup, malformed sessions, non-object metadata, mismatched metadata/session ids, and TTL flake resistance. ### Why it changed PR #77 intentionally shipped the narrow SSH capability first and left lifecycle hardening as follow-up #79. This PR closes that gap before more Agent Access Plane capabilities reuse the same runtime shape. ### Files touched - `scripts/agent-access/codex-openclaw-ssh-agent` - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py` - `docs/agent-access/codex-openclaw-ssh.md` - `state/runtime-layout.md` ### Relevant context - Issue #79: session lifecycle hardening after SSH-agent TTL slice. - Issue #73 / PR #77: original Codex -> OpenClaw ssh-agent TTL wrapper. - PR #78 / ADR 0004: Agent Access Plane boundaries. - `state/runtime-layout.md`: canonical non-secret runtime state structure. ### Runtime evidence - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests/test_agent_access_ssh_agent.py` -> `10 passed` - `PYTHONPATH=control-plane pytest -q control-plane/platformctl/tests` -> `157 passed` - `git diff --check` -> pass - Secret-pattern diff scan -> no matches - Runtime smoke before safe-default change: `--list` listed two stopped prior sessions, `--prune` pruned both, next `--list` returned `NO_SESSIONS` - Runtime smoke after safe-default/final hardening: `--list` -> `NO_SESSIONS`; `--prune` -> `PRUNE_DONE count=0` ### Canary iteration notes - Iter 1 result: `defer`; mitigated unknown auto-prune, prune-before-agent-stop, and naive timestamp handling. - Iter 2 result: `defer`; mitigated non-object `metadata.json`, prune help text mismatch, and keeping state when recorded agent shutdown cannot be confirmed. - Iter 3 result: `defer`; mitigated the remaining real code finding locally by requiring `metadata.session_id == session_dir.name` before trusting metadata. - Terminal status: `approve_with_evidence_gap`, not `approve_merge`, because canary cap prevents a 4th rerun after the final local patch. - Residual evidence gap for operator: no canary rerun on final SHA after the last small code fix; product-gpt repeatedly reported empty PR description despite this body being present via Forgejo API; PR is above the 300 LOC norm mostly because regression tests cover the lifecycle safety surface. ### Known constraints - `agent.sock` remains an access-bearing bearer capability while the ssh-agent is alive; this PR documents that and leaves active/unknown sessions untouched by prune. - Unknown/malformed sessions require manual inspection rather than automatic deletion. - This PR does not add generic secret command execution. - This PR does not change Infisical scope, Forgejo identity, CI, or OpenClaw runtime deployment. ### Explicit out-of-scope - Max TTL policy changes beyond the existing 4h cap. - Generic Agent Access Plane catalog/resolver commands. - Forgejo/MCP identity split from #56. - Canary CI Infisical integration from #72. ### Requested decision Operator may merge if accepting the evidence gap above. Otherwise split/rewrite is the clean terminal alternative under the hard cap. ### Merge blockers - Any path that can leak the private key into logs, argv, repo files, dotenv files, or child env. - `--prune` deleting active or unknown/malformed sessions. - `--prune` removing stopped/expired session handles when recorded agent shutdown cannot be confirmed. - Failure cleanup leaving a started ssh-agent behind in tested error paths. ## Spec sources read - Issue #79 context from orchestrator thread: direct scope for lifecycle hardening. - `scripts/agent-access/codex-openclaw-ssh-agent`: implementation target from PR #77. - `control-plane/platformctl/tests/test_agent_access_ssh_agent.py`: existing acceptance tests and new regression coverage. - `docs/agent-access/codex-openclaw-ssh.md`: operator runbook for the wrapper. - `state/runtime-layout.md`: runtime state layout contract. Closes #79

codex added 1 commit

2026-05-05 07:37:42 +02:00

fix(agent-access): harden ssh-agent session lifecycle

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 26s

Details

python-ci / Python 3.12 (pull_request) Successful in 27s

Details

python-ci / Python 3.13 (pull_request) Successful in 25s

Details

canary-required / canary (pull_request) Failing after 1s

Details

1b3bbf004e

codex force-pushed codex/issues/79-agent-access-lifecycle from 1b3bbf004e

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 26s

Details

python-ci / Python 3.12 (pull_request) Successful in 27s

Details

python-ci / Python 3.13 (pull_request) Successful in 25s

Details

canary-required / canary (pull_request) Failing after 1s

Details

to de0188be35

canary-required / collect-diff (pull_request) Successful in 3s

Details

python-ci / Python 3.11 (pull_request) Successful in 26s

Details

python-ci / Python 3.12 (pull_request) Successful in 27s

Details

python-ci / Python 3.13 (pull_request) Successful in 26s

Details

canary-required / canary (pull_request) Failing after 2s

Details

2026-05-05 07:44:10 +02:00

Compare

codex force-pushed codex/issues/79-agent-access-lifecycle from de0188be35

canary-required / collect-diff (pull_request) Successful in 3s

Details

python-ci / Python 3.11 (pull_request) Successful in 26s

Details

python-ci / Python 3.12 (pull_request) Successful in 27s

Details

python-ci / Python 3.13 (pull_request) Successful in 26s

Details

canary-required / canary (pull_request) Failing after 2s

Details

to 94a6cddec6

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 25s

Details

python-ci / Python 3.12 (pull_request) Successful in 26s

Details

python-ci / Python 3.13 (pull_request) Successful in 27s

Details

canary-required / canary (pull_request) Failing after 2s

Details

2026-05-05 07:53:07 +02:00

Compare

codex force-pushed codex/issues/79-agent-access-lifecycle from 94a6cddec6

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 25s

Details

python-ci / Python 3.12 (pull_request) Successful in 26s

Details

python-ci / Python 3.13 (pull_request) Successful in 27s

Details

canary-required / canary (pull_request) Failing after 2s

Details

to 852e04aa55

canary-required / collect-diff (pull_request) Successful in 4s

Details

python-ci / Python 3.11 (pull_request) Successful in 27s

Details

python-ci / Python 3.12 (pull_request) Successful in 28s

Details

python-ci / Python 3.13 (pull_request) Successful in 28s

Details

canary-required / canary (pull_request) Failing after 1s

Details

2026-05-05 08:00:23 +02:00

Compare

codex commented

2026-05-05 08:01:07 +02:00

Author

Collaborator

Canary terminal update for PR #80

Iter 1: defer; mitigated unknown auto-prune and prune-before-agent-stop risks.
Iter 2: defer; mitigated non-object metadata, prune help text mismatch, and unconfirmed agent-shutdown cleanup.
Iter 3: defer; final real code finding was metadata/session-id mismatch. Final SHA 852e04aa55740789033b0a1d52f955fc204bb86b now treats mismatched metadata as unknown and skips prune; local tests pass.

Terminal action after hard cap: approve_with_evidence_gap.

Residual gap: no 4th canary rerun on final SHA; product-gpt kept reporting an empty PR description even though the PR body is present via Forgejo API; size concern remains visible. Operator can merge accepting this, or choose split/rewrite.

Canary terminal update for PR #80 - Iter 1: `defer`; mitigated unknown auto-prune and prune-before-agent-stop risks. - Iter 2: `defer`; mitigated non-object metadata, prune help text mismatch, and unconfirmed agent-shutdown cleanup. - Iter 3: `defer`; final real code finding was metadata/session-id mismatch. Final SHA `852e04aa55740789033b0a1d52f955fc204bb86b` now treats mismatched metadata as `unknown` and skips prune; local tests pass. Terminal action after hard cap: `approve_with_evidence_gap`. Residual gap: no 4th canary rerun on final SHA; product-gpt kept reporting an empty PR description even though the PR body is present via Forgejo API; size concern remains visible. Operator can merge accepting this, or choose split/rewrite.

codex commented

2026-05-05 08:39:08 +02:00

Author

Collaborator

Superseded by split/rewrite path.

Operator decision: this access-plane hardening is security-sensitive, so more review friction is intentional. I am not asking to merge this combined PR. I will split it into smaller PRs with narrower canary surfaces.

Superseded by split/rewrite path. Operator decision: this access-plane hardening is security-sensitive, so more review friction is intentional. I am not asking to merge this combined PR. I will split it into smaller PRs with narrower canary surfaces.

codex closed this pull request

2026-05-05 08:39:08 +02:00

codex referenced this pull request

2026-05-05 08:42:44 +02:00

docs(process): add security-sensitive class of service #81

codex referenced this pull request

2026-05-05 08:43:23 +02:00

docs(process): add security-sensitive work lane #82

codex referenced this pull request

2026-05-05 08:46:05 +02:00

fix(agent-access): harden ssh-agent startup cleanup #83

codex referenced this pull request