explore(autonomy): tiered agent-execution gate — sandbox + soft-classifier tiers (Cursor Auto-review-inspired) #673

Closed
opened 2026-06-01 17:10:11 +02:00 by claude · 3 comments
Collaborator

Why now

Cursor shipped Auto-review (2026-05-29): a tiered gate on agent tool-calls (Shell/MCP/Fetch) —

  1. allowlist → run immediately,
  2. sandbox → run contained,
  3. classifier subagent → decide allow / retry / ask (steerable with custom instructions).

It maps almost exactly onto our autonomy model — and shows us which rungs we're missing. This is a design-stage exploration of bringing the two missing rungs (sandbox + soft-classifier) to the platform.

The principle (the spine)

Deterministic on hard, learned on soft.

  • Hard boundaries (data-loss, secrets, destructive ops, public exposure, runtime-apply) stay deterministic + auditable — rules, not a learned classifier. A classifier is non-deterministic + prompt-injectable; we don't want a guessing layer on the irreversible stuff.
  • Soft / medium decisions (where rules are clumsy and the single operator pays an attention-tax on every prompt) get a sandbox or a steerable classifier — to reduce operator interruptions without weakening the hard rails.

Single-operator nuance (important): for a solo operator, every hard-gate is his own interruption. So the goal of the middle tiers is less friction on reversible work, not more rigor everywhere. Rigor where it's irreversible; autonomy where it's reversible. Otherwise "more gates" = more asking = more friction = more errors = more theater — the opposite of the point.

Where we are: strong 0 + 3, thin 1 + 2

Tier Auto-review Us State
0 — deterministic allowlist allowlist capability catalog (#566) strong
1 — sandbox (run-contained) sandbox ~platformctl --dry-run/plan, no true exec-sandbox ⚠️ thin
2 — soft classifier (steerable) classifier subagent missing
3 — operator-gate / hard-stops "ask" apply-gates + hard-stops strong

How I see the two missing rungs

Tier 1 — Sandbox (high-value, low-risk — do this first).
Run a risky-but-reversible action in isolation (container / namespace / extended dry-run-then-commit), observe, then commit-or-discard. Key property: agents can run their own sandbox autonomously — no operator needed for the reversible case. Best first target: #634 (autonomous deploy / nightly / repair) — instead of rule-gating it, run the deploy in a sandbox first. Safer than rules, less friction than "ask."

Independent confirmation (GPT-5.5 Pro red-team): ranked agent/MCP sandboxing HIGH"filesystem/network isolation is the control that prevents tool compromise from becoming host compromise." So this isn't only an ergonomics borrow; it's a security control. It dovetails with an MCP capability inventory (Tier-0 capability tags: exec_command, secret_resolve, deploy, write_file, memory_write = high-risk → must be sandboxed).

Tier 2 — Soft classifier (the killer idea — but build last, carefully).
A classifier-subagent that decides allow / retry / ask for medium actions rules can't cleanly classify. Platform-fit detail: we're policy-as-text (AGENTS.md, capability catalog) → a steerable classifier means our existing policy text becomes the gate logic, without hand-coding every rule. A bridge between declarative policy and an adaptive gate.
But: (a) classifier ONLY on the soft tier — never on a hard boundary; (b) don't build it as a project — prefer depending on an existing classifier over rolling our own, and only invest if a real approval-fatigue point actually emerges. Sandbox-first; classifier-if-needed.

Scope (design-stage — no runtime change)

Exploration / ADR-precursor, not impl:

  1. Map our concrete agent actions onto the 4 tiers (which are sandbox-eligible vs hard-gated).
  2. Pick a sandbox mechanism (container? user-namespace? extend platformctl plan into a true throwaway-apply?).
  3. Build-vs-depend call on the classifier (lean: depend, don't build).
  4. Define how a steerable classifier ingests our policy-as-text.

Hard rule carried in: never classifier-gate a hard boundary; those stay deterministic + auditable.

Ties

  • #76 (Agent Access Plane) — this is its autonomy/execution-gate layer.
  • #634 (autonomous deploy/repair) — first sandbox-tier customer.
  • #566 (capability catalog) — Tier 0; capability tags feed which actions need sandboxing.
  • apply-gates / hard-stops — Tier 3, unchanged.

Acceptance (design)

  • A short design note (ADR-precursor): the 4-tier map + sandbox-mechanism recommendation + build-vs-depend classifier call.
  • No runtime mutation; operator-gated before any impl.

Authored by claude (design/architecture lane), Cursor-Auto-review-inspired, cross-checked against a GPT-5.5 Pro red-team. The HOW stays Codex's lane.

## Why now Cursor shipped **Auto-review** (2026-05-29): a tiered gate on agent tool-calls (Shell/MCP/Fetch) — 1. **allowlist** → run immediately, 2. **sandbox** → run contained, 3. **classifier subagent** → decide allow / retry / ask (steerable with custom instructions). It maps almost exactly onto our autonomy model — and shows us **which rungs we're missing.** This is a design-stage exploration of bringing the two missing rungs (sandbox + soft-classifier) to the platform. ## The principle (the spine) **Deterministic on hard, learned on soft.** - **Hard boundaries** (data-loss, secrets, destructive ops, public exposure, runtime-apply) stay **deterministic + auditable** — rules, not a learned classifier. A classifier is non-deterministic + prompt-injectable; we don't want a guessing layer on the irreversible stuff. - **Soft / medium decisions** (where rules are clumsy and the *single operator pays an attention-tax on every prompt*) get a **sandbox** or a **steerable classifier** — to *reduce* operator interruptions without weakening the hard rails. **Single-operator nuance (important):** for a solo operator, every hard-gate is *his own interruption*. So the goal of the middle tiers is **less friction on reversible work, not more rigor everywhere.** Rigor where it's irreversible; autonomy where it's reversible. Otherwise "more gates" = more asking = more friction = more errors = more theater — the opposite of the point. ## Where we are: strong 0 + 3, thin 1 + 2 | Tier | Auto-review | Us | State | |---|---|---|---| | **0 — deterministic allowlist** | allowlist | capability catalog (#566) | ✅ strong | | **1 — sandbox (run-contained)** | sandbox | ~`platformctl --dry-run`/`plan`, no true exec-sandbox | ⚠️ thin | | **2 — soft classifier (steerable)** | classifier subagent | — | ❌ missing | | **3 — operator-gate / hard-stops** | "ask" | apply-gates + hard-stops | ✅ strong | ## How I see the two missing rungs **Tier 1 — Sandbox (high-value, low-risk — do this first).** Run a risky-but-*reversible* action in **isolation** (container / namespace / extended dry-run-then-commit), observe, then **commit-or-discard**. Key property: **agents can run their own sandbox autonomously** — no operator needed for the reversible case. Best first target: **#634** (autonomous deploy / nightly / repair) — instead of rule-gating it, run the deploy in a sandbox first. Safer than rules, less friction than "ask." > **Independent confirmation (GPT-5.5 Pro red-team):** ranked **agent/MCP sandboxing HIGH** — *"filesystem/network isolation is the control that prevents tool compromise from becoming host compromise."* So this isn't only an ergonomics borrow; it's a security control. It dovetails with an **MCP capability inventory** (Tier-0 capability tags: `exec_command`, `secret_resolve`, `deploy`, `write_file`, `memory_write` = high-risk → must be sandboxed). **Tier 2 — Soft classifier (the killer idea — but build last, carefully).** A classifier-subagent that decides allow / retry / ask for *medium* actions rules can't cleanly classify. Platform-fit detail: we're **policy-as-text** (AGENTS.md, capability catalog) → a *steerable* classifier means **our existing policy text becomes the gate logic**, without hand-coding every rule. A bridge between declarative policy and an adaptive gate. **But:** (a) classifier ONLY on the soft tier — **never** on a hard boundary; (b) **don't build it as a project** — prefer *depending on* an existing classifier over rolling our own, and only invest if a real approval-fatigue point actually emerges. Sandbox-first; classifier-if-needed. ## Scope (design-stage — no runtime change) Exploration / ADR-precursor, not impl: 1. Map our concrete agent actions onto the 4 tiers (which are sandbox-eligible vs hard-gated). 2. Pick a **sandbox mechanism** (container? user-namespace? extend `platformctl plan` into a true throwaway-apply?). 3. **Build-vs-depend** call on the classifier (lean: depend, don't build). 4. Define how a steerable classifier ingests our policy-as-text. Hard rule carried in: **never classifier-gate a hard boundary; those stay deterministic + auditable.** ## Ties - **#76** (Agent Access Plane) — this is its autonomy/execution-gate layer. - **#634** (autonomous deploy/repair) — first sandbox-tier customer. - **#566** (capability catalog) — Tier 0; capability tags feed which actions need sandboxing. - apply-gates / hard-stops — Tier 3, unchanged. ## Acceptance (design) - [ ] A short design note (ADR-precursor): the 4-tier map + sandbox-mechanism recommendation + build-vs-depend classifier call. - [ ] No runtime mutation; operator-gated before any impl. *Authored by claude (design/architecture lane), Cursor-Auto-review-inspired, cross-checked against a GPT-5.5 Pro red-team. The HOW stays Codex's lane.*
Author
Collaborator

Design complete (claude) → PR #686.

Delivered the 4-tier cascade model. The structural result worth flagging: because hard-stops are matched first + deterministically, the soft classifier is only ever reached after the hard-stop filter — so it cannot gate a hard boundary by construction (the hard rule is guaranteed by cascade order, not by trust).

  • Sandbox (Tier 1): reuse git+CI and container-isolation; build ONE new thing — platformctl apply --sandbox (disposable-target apply, reusing the just-hardened pipeline; first customer #634).
  • Classifier (Tier 2): depend, don't build — a cheap-model subagent steered by our policy-as-text (edit AGENTS.md → re-steer the gate, no code).
  • Hard rails: unchanged + deterministic; soft layer only touches reversible ground; fail-closed to ask.

Becomes a codex-ready impl spec once you accept the 3 open-question calls (sandbox target / classifier model / fail-closed default). The impl is sequenced + each step is safe-or-gated.

**Design complete (claude) → PR #686.** Delivered the 4-tier **cascade** model. The structural result worth flagging: because hard-stops are matched *first + deterministically*, the soft classifier is only ever reached *after* the hard-stop filter — so it **cannot gate a hard boundary by construction** (the hard rule is guaranteed by cascade order, not by trust). - **Sandbox (Tier 1):** reuse git+CI and container-isolation; build ONE new thing — `platformctl apply --sandbox` (disposable-target apply, reusing the just-hardened pipeline; first customer #634). - **Classifier (Tier 2):** *depend, don't build* — a cheap-model subagent steered by our policy-as-text (edit AGENTS.md → re-steer the gate, no code). - **Hard rails:** unchanged + deterministic; soft layer only touches reversible ground; fail-closed to `ask`. Becomes a **codex-ready impl spec** once you accept the 3 open-question calls (sandbox target / classifier model / fail-closed default). The impl is sequenced + each step is safe-or-gated.
Author
Collaborator

Open questions resolved (operator, 2026-06-02) → codex-ready impl-spec is #687.

  • Sandbox target: reuse hermes-style preview slots.
  • Classifier model: Ollama-cloud, mistral-small-4-class default (RS2000-local too tight for a frequent gate), pluggable — benchmark & swap freely.
  • Tier-2 fail-closed: ask.

Design (this issue / PR #686) is complete; implementation continues in #687 (agent/codex, status:codex-ready). Codex starts with the deterministic cascade router (incl. the 'classifier-never-reaches-a-hard-stop' invariant), then apply --sandbox, classifier subagent, and receipts.

**Open questions resolved (operator, 2026-06-02) → codex-ready impl-spec is #687.** - Sandbox target: **reuse hermes-style preview slots**. - Classifier model: **Ollama-cloud, mistral-small-4-class default** (RS2000-local too tight for a frequent gate), **pluggable** — benchmark & swap freely. - Tier-2 fail-closed: **ask**. Design (this issue / PR #686) is complete; implementation continues in **#687** (agent/codex, status:codex-ready). Codex starts with the deterministic cascade router (incl. the 'classifier-never-reaches-a-hard-stop' invariant), then apply --sandbox, classifier subagent, and receipts.
Collaborator

Closing as design-landed.

PR #686 merged the tiered agent-execution gate design note. Implementation moved to #687 as the codex-ready execution issue, split into the cascade router, sandbox apply, classifier interface, and decision receipt slices.

No runtime mutation happened in the design issue.

Closing as design-landed. PR #686 merged the tiered agent-execution gate design note. Implementation moved to #687 as the codex-ready execution issue, split into the cascade router, sandbox apply, classifier interface, and decision receipt slices. No runtime mutation happened in the design issue.
codex closed this issue 2026-06-02 12:07:54 +02:00
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#673
No description provided.