pdurlej/platform

Fork 0

explore(autonomy): tiered agent-execution gate — sandbox + soft-classifier tiers (Cursor Auto-review-inspired) #673

New issue

Closed

opened 2026-06-01 17:10:11 +02:00 by claude · 3 comments

claude commented

2026-06-01 17:10:11 +02:00

Collaborator

Why now

Cursor shipped Auto-review (2026-05-29): a tiered gate on agent tool-calls (Shell/MCP/Fetch) —

allowlist → run immediately,
sandbox → run contained,
classifier subagent → decide allow / retry / ask (steerable with custom instructions).

It maps almost exactly onto our autonomy model — and shows us which rungs we're missing. This is a design-stage exploration of bringing the two missing rungs (sandbox + soft-classifier) to the platform.

The principle (the spine)

Deterministic on hard, learned on soft.

Hard boundaries (data-loss, secrets, destructive ops, public exposure, runtime-apply) stay deterministic + auditable — rules, not a learned classifier. A classifier is non-deterministic + prompt-injectable; we don't want a guessing layer on the irreversible stuff.
Soft / medium decisions (where rules are clumsy and the single operator pays an attention-tax on every prompt) get a sandbox or a steerable classifier — to reduce operator interruptions without weakening the hard rails.

Single-operator nuance (important): for a solo operator, every hard-gate is his own interruption. So the goal of the middle tiers is less friction on reversible work, not more rigor everywhere. Rigor where it's irreversible; autonomy where it's reversible. Otherwise "more gates" = more asking = more friction = more errors = more theater — the opposite of the point.

Where we are: strong 0 + 3, thin 1 + 2

Tier	Auto-review	Us	State
0 — deterministic allowlist	allowlist	capability catalog (#566)	✅ strong
1 — sandbox (run-contained)	sandbox	~`platformctl --dry-run`/`plan`, no true exec-sandbox	⚠️ thin
2 — soft classifier (steerable)	classifier subagent	—	❌ missing
3 — operator-gate / hard-stops	"ask"	apply-gates + hard-stops	✅ strong

How I see the two missing rungs

Tier 1 — Sandbox (high-value, low-risk — do this first).
Run a risky-but-reversible action in isolation (container / namespace / extended dry-run-then-commit), observe, then commit-or-discard. Key property: agents can run their own sandbox autonomously — no operator needed for the reversible case. Best first target: #634 (autonomous deploy / nightly / repair) — instead of rule-gating it, run the deploy in a sandbox first. Safer than rules, less friction than "ask."

Independent confirmation (GPT-5.5 Pro red-team): ranked agent/MCP sandboxing HIGH — "filesystem/network isolation is the control that prevents tool compromise from becoming host compromise." So this isn't only an ergonomics borrow; it's a security control. It dovetails with an MCP capability inventory (Tier-0 capability tags: exec_command, secret_resolve, deploy, write_file, memory_write = high-risk → must be sandboxed).

Tier 2 — Soft classifier (the killer idea — but build last, carefully).
A classifier-subagent that decides allow / retry / ask for medium actions rules can't cleanly classify. Platform-fit detail: we're policy-as-text (AGENTS.md, capability catalog) → a steerable classifier means our existing policy text becomes the gate logic, without hand-coding every rule. A bridge between declarative policy and an adaptive gate.
But: (a) classifier ONLY on the soft tier — never on a hard boundary; (b) don't build it as a project — prefer depending on an existing classifier over rolling our own, and only invest if a real approval-fatigue point actually emerges. Sandbox-first; classifier-if-needed.

Scope (design-stage — no runtime change)

Exploration / ADR-precursor, not impl:

Map our concrete agent actions onto the 4 tiers (which are sandbox-eligible vs hard-gated).
Pick a sandbox mechanism (container? user-namespace? extend platformctl plan into a true throwaway-apply?).
Build-vs-depend call on the classifier (lean: depend, don't build).
Define how a steerable classifier ingests our policy-as-text.

Hard rule carried in: never classifier-gate a hard boundary; those stay deterministic + auditable.

Ties

#76 (Agent Access Plane) — this is its autonomy/execution-gate layer.
#634 (autonomous deploy/repair) — first sandbox-tier customer.
#566 (capability catalog) — Tier 0; capability tags feed which actions need sandboxing.
apply-gates / hard-stops — Tier 3, unchanged.

Acceptance (design)

A short design note (ADR-precursor): the 4-tier map + sandbox-mechanism recommendation + build-vs-depend classifier call.
No runtime mutation; operator-gated before any impl.

Authored by claude (design/architecture lane), Cursor-Auto-review-inspired, cross-checked against a GPT-5.5 Pro red-team. The HOW stays Codex's lane.

## Why now Cursor shipped **Auto-review** (2026-05-29): a tiered gate on agent tool-calls (Shell/MCP/Fetch) — 1. **allowlist** → run immediately, 2. **sandbox** → run contained, 3. **classifier subagent** → decide allow / retry / ask (steerable with custom instructions). It maps almost exactly onto our autonomy model — and shows us **which rungs we're missing.** This is a design-stage exploration of bringing the two missing rungs (sandbox + soft-classifier) to the platform. ## The principle (the spine) **Deterministic on hard, learned on soft.** - **Hard boundaries** (data-loss, secrets, destructive ops, public exposure, runtime-apply) stay **deterministic + auditable** — rules, not a learned classifier. A classifier is non-deterministic + prompt-injectable; we don't want a guessing layer on the irreversible stuff. - **Soft / medium decisions** (where rules are clumsy and the *single operator pays an attention-tax on every prompt*) get a **sandbox** or a **steerable classifier** — to *reduce* operator interruptions without weakening the hard rails. **Single-operator nuance (important):** for a solo operator, every hard-gate is *his own interruption*. So the goal of the middle tiers is **less friction on reversible work, not more rigor everywhere.** Rigor where it's irreversible; autonomy where it's reversible. Otherwise "more gates" = more asking = more friction = more errors = more theater — the opposite of the point. ## Where we are: strong 0 + 3, thin 1 + 2 | Tier | Auto-review | Us | State | |---|---|---|---| | **0 — deterministic allowlist** | allowlist | capability catalog (#566) | ✅ strong | | **1 — sandbox (run-contained)** | sandbox | ~`platformctl --dry-run`/`plan`, no true exec-sandbox | ⚠️ thin | | **2 — soft classifier (steerable)** | classifier subagent | — | ❌ missing | | **3 — operator-gate / hard-stops** | "ask" | apply-gates + hard-stops | ✅ strong | ## How I see the two missing rungs **Tier 1 — Sandbox (high-value, low-risk — do this first).** Run a risky-but-*reversible* action in **isolation** (container / namespace / extended dry-run-then-commit), observe, then **commit-or-discard**. Key property: **agents can run their own sandbox autonomously** — no operator needed for the reversible case. Best first target: **#634** (autonomous deploy / nightly / repair) — instead of rule-gating it, run the deploy in a sandbox first. Safer than rules, less friction than "ask." > **Independent confirmation (GPT-5.5 Pro red-team):** ranked **agent/MCP sandboxing HIGH** — *"filesystem/network isolation is the control that prevents tool compromise from becoming host compromise."* So this isn't only an ergonomics borrow; it's a security control. It dovetails with an **MCP capability inventory** (Tier-0 capability tags: `exec_command`, `secret_resolve`, `deploy`, `write_file`, `memory_write` = high-risk → must be sandboxed). **Tier 2 — Soft classifier (the killer idea — but build last, carefully).** A classifier-subagent that decides allow / retry / ask for *medium* actions rules can't cleanly classify. Platform-fit detail: we're **policy-as-text** (AGENTS.md, capability catalog) → a *steerable* classifier means **our existing policy text becomes the gate logic**, without hand-coding every rule. A bridge between declarative policy and an adaptive gate. **But:** (a) classifier ONLY on the soft tier — **never** on a hard boundary; (b) **don't build it as a project** — prefer *depending on* an existing classifier over rolling our own, and only invest if a real approval-fatigue point actually emerges. Sandbox-first; classifier-if-needed. ## Scope (design-stage — no runtime change) Exploration / ADR-precursor, not impl: 1. Map our concrete agent actions onto the 4 tiers (which are sandbox-eligible vs hard-gated). 2. Pick a **sandbox mechanism** (container? user-namespace? extend `platformctl plan` into a true throwaway-apply?). 3. **Build-vs-depend** call on the classifier (lean: depend, don't build). 4. Define how a steerable classifier ingests our policy-as-text. Hard rule carried in: **never classifier-gate a hard boundary; those stay deterministic + auditable.** ## Ties - **#76** (Agent Access Plane) — this is its autonomy/execution-gate layer. - **#634** (autonomous deploy/repair) — first sandbox-tier customer. - **#566** (capability catalog) — Tier 0; capability tags feed which actions need sandboxing. - apply-gates / hard-stops — Tier 3, unchanged. ## Acceptance (design) - [ ] A short design note (ADR-precursor): the 4-tier map + sandbox-mechanism recommendation + build-vs-depend classifier call. - [ ] No runtime mutation; operator-gated before any impl. *Authored by claude (design/architecture lane), Cursor-Auto-review-inspired, cross-checked against a GPT-5.5 Pro red-team. The HOW stays Codex's lane.*

claude added this to the 10 - Improvements milestone

2026-06-01 17:10:11 +02:00

claude added the

agent/claude-code

priority:p2

labels

2026-06-01 17:10:11 +02:00

claude referenced this issue

2026-06-01 17:19:15 +02:00

docs(strategy): platform security & maturity roadmap v1 (audit-grounded) #674

claude referenced this issue

2026-06-01 17:22:48 +02:00

ops(ci-security): harden forgejo-runner — Docker socket + Infisical token + persistent = host-compromise surface #675

claude referenced this issue

2026-06-02 11:32:07 +02:00

design(autonomy): tiered agent-execution gate — 4-tier cascade (#673) #686

claude referenced this issue from a commit

2026-06-02 11:32:07 +02:00

design(autonomy): tiered agent-execution gate — 4-tier cascade (#673)

claude commented

2026-06-02 11:32:07 +02:00

Author

Collaborator

Design complete (claude) → PR #686.

Delivered the 4-tier cascade model. The structural result worth flagging: because hard-stops are matched first + deterministically, the soft classifier is only ever reached after the hard-stop filter — so it cannot gate a hard boundary by construction (the hard rule is guaranteed by cascade order, not by trust).

Sandbox (Tier 1): reuse git+CI and container-isolation; build ONE new thing — platformctl apply --sandbox (disposable-target apply, reusing the just-hardened pipeline; first customer #634).
Classifier (Tier 2): depend, don't build — a cheap-model subagent steered by our policy-as-text (edit AGENTS.md → re-steer the gate, no code).
Hard rails: unchanged + deterministic; soft layer only touches reversible ground; fail-closed to ask.

Becomes a codex-ready impl spec once you accept the 3 open-question calls (sandbox target / classifier model / fail-closed default). The impl is sequenced + each step is safe-or-gated.

**Design complete (claude) → PR #686.** Delivered the 4-tier **cascade** model. The structural result worth flagging: because hard-stops are matched *first + deterministically*, the soft classifier is only ever reached *after* the hard-stop filter — so it **cannot gate a hard boundary by construction** (the hard rule is guaranteed by cascade order, not by trust). - **Sandbox (Tier 1):** reuse git+CI and container-isolation; build ONE new thing — `platformctl apply --sandbox` (disposable-target apply, reusing the just-hardened pipeline; first customer #634). - **Classifier (Tier 2):** *depend, don't build* — a cheap-model subagent steered by our policy-as-text (edit AGENTS.md → re-steer the gate, no code). - **Hard rails:** unchanged + deterministic; soft layer only touches reversible ground; fail-closed to `ask`. Becomes a **codex-ready impl spec** once you accept the 3 open-question calls (sandbox target / classifier model / fail-closed default). The impl is sequenced + each step is safe-or-gated.

claude referenced this issue

2026-06-02 11:33:44 +02:00

docs(strategy): platform security & maturity roadmap v1 (audit-grounded) #674

claude referenced this issue

2026-06-02 11:40:43 +02:00

impl(autonomy): tiered execution gate — cascade router, sandbox, classifier (per #673 / PR #686) #687

claude commented

2026-06-02 11:40:43 +02:00

Author

Collaborator

Open questions resolved (operator, 2026-06-02) → codex-ready impl-spec is #687.

Sandbox target: reuse hermes-style preview slots.
Classifier model: Ollama-cloud, mistral-small-4-class default (RS2000-local too tight for a frequent gate), pluggable — benchmark & swap freely.
Tier-2 fail-closed: ask.

Design (this issue / PR #686) is complete; implementation continues in #687 (agent/codex, status:codex-ready). Codex starts with the deterministic cascade router (incl. the 'classifier-never-reaches-a-hard-stop' invariant), then apply --sandbox, classifier subagent, and receipts.

**Open questions resolved (operator, 2026-06-02) → codex-ready impl-spec is #687.** - Sandbox target: **reuse hermes-style preview slots**. - Classifier model: **Ollama-cloud, mistral-small-4-class default** (RS2000-local too tight for a frequent gate), **pluggable** — benchmark & swap freely. - Tier-2 fail-closed: **ask**. Design (this issue / PR #686) is complete; implementation continues in **#687** (agent/codex, status:codex-ready). Codex starts with the deterministic cascade router (incl. the 'classifier-never-reaches-a-hard-stop' invariant), then apply --sandbox, classifier subagent, and receipts.

pdurlej referenced this issue from a commit

2026-06-02 12:02:25 +02:00

Merge pull request 'design(autonomy): tiered agent-execution gate — 4-tier cascade (#673)' (#686) from claude/673-autonomy-design into main

codex commented

2026-06-02 12:07:51 +02:00

Collaborator

Closing as design-landed.

PR #686 merged the tiered agent-execution gate design note. Implementation moved to #687 as the codex-ready execution issue, split into the cascade router, sandbox apply, classifier interface, and decision receipt slices.

No runtime mutation happened in the design issue.

Closing as design-landed. PR #686 merged the tiered agent-execution gate design note. Implementation moved to #687 as the codex-ready execution issue, split into the cascade router, sandbox apply, classifier interface, and decision receipt slices. No runtime mutation happened in the design issue.

codex closed this issue

2026-06-02 12:07:54 +02:00

codex referenced this issue

2026-06-02 12:12:53 +02:00

feat(autonomy): add deterministic cascade router #689

codex referenced this issue

2026-06-02 14:01:25 +02:00

feat(apply): add sandbox apply receipt mode #694