ops(ci-security): harden forgejo-runner — Docker socket + Infisical token + persistent = host-compromise surface #675

New issue

Closed

opened 2026-06-01 17:22:48 +02:00 by claude · 0 comments

claude commented

2026-06-01 17:22:48 +02:00

Collaborator

Finding (Phase-0 security audit, 2026-06-01)

Live infra/forgejo-runner/docker-compose.yml on rs2000:

/var/run/docker.sock:/var/run/docker.sock is mounted into the runner → the runner (and any job it runs) has effectively root on the host. Since agents (claude/codex/…) author the code and the workflows that run on this runner, a prompt-injected agent or a malicious dependency in a job inherits host-root.
The runner holds an Infisical token (INFISICAL_TOKEN_AUTH_FILE=/data/infisical-token-auth-token) on a host bind-mount (./data:/data) → CI has secret-resolution; a compromised job can read/exfiltrate it.
The runner is persistent (not ephemeral) → contamination carries between jobs; caps build-integrity at ~SLSA L2.
Jobs run with docker:host label.

Why this is the standout finding: the build pipeline is part of the supply chain here, and agents are in that pipeline. Host-root + secret-access from job code is the platform's single highest-value privilege-escalation surface. (Tailnet-gated + single-operator lowers the likelihood — no untrusted external job authors — but prompt-injection / malicious-dependency keep it live.)

Proposed hardening (Codex lane — runtime/critical-infra change = operator-gated)

Remove or proxy the Docker socket. Prefer a socket-proxy (e.g. tecnativa/docker-socket-proxy) exposing only the minimal API the jobs actually need, or a rootless / sysbox runner. If raw socket is truly required for image builds, isolate it to a build-only runner.
Scope the Infisical token tightly — least-privilege to only the paths CI needs, short TTL, ideally per-job rather than a long-lived token on a bind-mount.
Separate build vs apply/deploy runners — the deploy runner (can mutate prod) must not share a trust boundary with the build runner (runs agent-authored job code).
Workspace hygiene — wipe job workspace between runs; no broad host mounts beyond what's needed.
Workflow-change review — .forgejo/workflows/* changes require operator review (a CI change is a trust-boundary change).
Ephemeral runners only if hygiene can't be made reliable (hygiene-first, ephemeral-later).

Acceptance

Runner no longer exposes the raw Docker socket to job code (removed, proxied, or isolated to a build-only runner).
Infisical token scoped least-privilege + short-TTL.
Build vs apply/deploy runner separation evaluated + documented.
Workflow-change-review expectation documented.
No runtime change applied without the operator gate.

Ties

Maturity roadmap Phase 5 (runner hygiene → practical SLSA-L2) — PR #674.
#673 (sandbox tier complements: contain job execution).
Found in the Phase-0 audit; first item of the post-audit round of fixes.

Authored by claude (audit/design lane). The HOW + the gated apply are Codex's lane.

## Finding (Phase-0 security audit, 2026-06-01) Live `infra/forgejo-runner/docker-compose.yml` on rs2000: - **`/var/run/docker.sock:/var/run/docker.sock` is mounted into the runner** → the runner (and any job it runs) has effectively **root on the host**. Since agents (claude/codex/…) author the code *and the workflows* that run on this runner, a prompt-injected agent or a malicious dependency in a job inherits host-root. - The runner **holds an Infisical token** (`INFISICAL_TOKEN_AUTH_FILE=/data/infisical-token-auth-token`) on a host bind-mount (`./data:/data`) → CI has secret-resolution; a compromised job can read/exfiltrate it. - The runner is **persistent** (not ephemeral) → contamination carries between jobs; caps build-integrity at ~SLSA L2. - Jobs run with `docker:host` label. **Why this is the standout finding:** the build pipeline *is* part of the supply chain here, and agents are in that pipeline. Host-root + secret-access from job code is the platform's single highest-value privilege-escalation surface. (Tailnet-gated + single-operator lowers the likelihood — no untrusted external job authors — but prompt-injection / malicious-dependency keep it live.) ## Proposed hardening (Codex lane — runtime/critical-infra change = operator-gated) 1. **Remove or proxy the Docker socket.** Prefer a **socket-proxy** (e.g. `tecnativa/docker-socket-proxy`) exposing only the minimal API the jobs actually need, or a **rootless / sysbox** runner. If raw socket is truly required for image builds, isolate it to a **build-only** runner. 2. **Scope the Infisical token tightly** — least-privilege to only the paths CI needs, short TTL, ideally per-job rather than a long-lived token on a bind-mount. 3. **Separate build vs apply/deploy runners** — the deploy runner (can mutate prod) must not share a trust boundary with the build runner (runs agent-authored job code). 4. **Workspace hygiene** — wipe job workspace between runs; no broad host mounts beyond what's needed. 5. **Workflow-change review** — `.forgejo/workflows/*` changes require operator review (a CI change *is* a trust-boundary change). 6. **Ephemeral runners** only if hygiene can't be made reliable (hygiene-first, ephemeral-later). ## Acceptance - [ ] Runner no longer exposes the raw Docker socket to job code (removed, proxied, or isolated to a build-only runner). - [ ] Infisical token scoped least-privilege + short-TTL. - [ ] Build vs apply/deploy runner separation evaluated + documented. - [ ] Workflow-change-review expectation documented. - [ ] No runtime change applied without the operator gate. ## Ties - Maturity roadmap **Phase 5** (runner hygiene → practical SLSA-L2) — PR #674. - #673 (sandbox tier complements: contain job execution). - Found in the Phase-0 audit; first item of the post-audit round of fixes. *Authored by claude (audit/design lane). The HOW + the gated apply are Codex's lane.*

claude added this to the 10 - Improvements milestone

2026-06-01 17:22:48 +02:00

claude added the

agent/codex

priority:p1

labels

2026-06-01 17:22:48 +02:00