audit(agent-readiness): ProgramBench-style architecture/manageability audit for platform repo #384

Open
opened 2026-05-19 02:12:32 +02:00 by Iskra · 5 comments
Collaborator

Source

Signal request from Piotr to create a Platform issue after discussing arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch?

Why this matters

ProgramBench is useful not because “models are bad at coding”, but because it shows a specific failure mode: when asked to build/rebuild whole programs, agents tend to produce compressed, monolithic imitations of behavior instead of well-decomposed, maintainable systems.

That risk applies directly to our platform repo: if the repo shape, docs, tests, contracts, and ownership boundaries are unclear, future Codex/Claude/Iskra work will drift toward big patches, hidden coupling, and brittle “passes the local task” fixes.

Goal

Run a ProgramBench-style agent-readiness audit of pdurlej/platform:

Can an agent understand, modify, test, and improve this repo in small safe steps without turning it into monolithic sludge?

This is not a benchmark vanity exercise. It is an operational audit for whether the platform codebase is manageable by AI agents and human reviewers.

Audit questions

1. Repo orientation

  • Is there a clear map of major subsystems?
  • Can a fresh agent tell what owns what within 10–15 minutes?
  • Are entrypoints, services, workers, MCP/server pieces, and deployment paths discoverable?
  • Is there an AGENTS.md / operator guide that reflects reality?

2. Decomposition quality

  • Are modules/folders aligned to product/runtime concepts?
  • Are boundaries explicit enough to avoid one huge “fix everything” edit?
  • Are there obvious god files, overloaded scripts, or implicit cross-coupling?
  • Can a change be scoped to one subsystem with predictable blast radius?

3. Behavioral testability

  • Are core user-visible workflows covered by tests or smoke probes?
  • Can an agent verify a change without needing tribal knowledge?
  • Are there cheap gates: lint, typecheck, unit, integration, smoke, fixture-based contracts?
  • Are failure modes represented as regression tests, not just issues/chat memory?

4. Contract/evidence layer

  • Are API/MCP/runtime contracts written down and tested?
  • Do important outputs have schemas or stable fixtures?
  • Can an agent cite evidence paths/logs/test output after a change?
  • Does health distinguish liveness, readiness, and actual usefulness?

5. Agent workflow safety

  • Can Codex/Claude work in small PRs without hidden credentials or operator identity leaks?
  • Are write paths least-privilege and clearly gated?
  • Are destructive/runtime mutations separated from read-only audit paths?
  • Is there a safe “proposal first, apply after approval” lane where needed?

6. Maintainability under repeated agent edits

  • Do recent agent-authored commits preserve architecture or accrete patches?
  • Are there signs of benchmark-style failure: giant single-file edits, duplicated logic, local hacks, unclear ownership?
  • What refactors would reduce future agent error rate?

Suggested deliverable

A short audit doc plus issues/PR plan:

  1. Subsystem map — current architecture, owners, risky areas.
  2. Agent-readiness scorecard — orientation, decomposition, testability, contracts, safety.
  3. Top 5 blockers to safe agentic maintenance.
  4. Small PR sequence to improve repo manageability.
  5. Regression gates that should exist before larger autonomous work.

Acceptance criteria

  • Audit is grounded in actual repo evidence: files, tests, commands, logs, docs.
  • No vague “needs cleanup” without a concrete path.
  • Produces actionable follow-up issues or a small PR stack.
  • Explicitly flags where an agent would currently be tempted into monolithic/rewrite-style behavior.
  • Recommends how to make future work smaller, safer, and more reviewable.

Reference

  • arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch?
  • Key lesson: passing partial behavioral tests is not the same as having good architecture; agents need decomposition, behavioral tests, and evidence/control-plane support.
## Source Signal request from Piotr to create a Platform issue after discussing arXiv:2605.03546 — **ProgramBench: Can Language Models Rebuild Programs From Scratch?** ## Why this matters ProgramBench is useful not because “models are bad at coding”, but because it shows a specific failure mode: when asked to build/rebuild whole programs, agents tend to produce compressed, monolithic imitations of behavior instead of well-decomposed, maintainable systems. That risk applies directly to our `platform` repo: if the repo shape, docs, tests, contracts, and ownership boundaries are unclear, future Codex/Claude/Iskra work will drift toward big patches, hidden coupling, and brittle “passes the local task” fixes. ## Goal Run a **ProgramBench-style agent-readiness audit** of `pdurlej/platform`: > Can an agent understand, modify, test, and improve this repo in small safe steps without turning it into monolithic sludge? This is not a benchmark vanity exercise. It is an operational audit for whether the platform codebase is manageable by AI agents and human reviewers. ## Audit questions ### 1. Repo orientation - Is there a clear map of major subsystems? - Can a fresh agent tell what owns what within 10–15 minutes? - Are entrypoints, services, workers, MCP/server pieces, and deployment paths discoverable? - Is there an `AGENTS.md` / operator guide that reflects reality? ### 2. Decomposition quality - Are modules/folders aligned to product/runtime concepts? - Are boundaries explicit enough to avoid one huge “fix everything” edit? - Are there obvious god files, overloaded scripts, or implicit cross-coupling? - Can a change be scoped to one subsystem with predictable blast radius? ### 3. Behavioral testability - Are core user-visible workflows covered by tests or smoke probes? - Can an agent verify a change without needing tribal knowledge? - Are there cheap gates: lint, typecheck, unit, integration, smoke, fixture-based contracts? - Are failure modes represented as regression tests, not just issues/chat memory? ### 4. Contract/evidence layer - Are API/MCP/runtime contracts written down and tested? - Do important outputs have schemas or stable fixtures? - Can an agent cite evidence paths/logs/test output after a change? - Does health distinguish liveness, readiness, and actual usefulness? ### 5. Agent workflow safety - Can Codex/Claude work in small PRs without hidden credentials or operator identity leaks? - Are write paths least-privilege and clearly gated? - Are destructive/runtime mutations separated from read-only audit paths? - Is there a safe “proposal first, apply after approval” lane where needed? ### 6. Maintainability under repeated agent edits - Do recent agent-authored commits preserve architecture or accrete patches? - Are there signs of benchmark-style failure: giant single-file edits, duplicated logic, local hacks, unclear ownership? - What refactors would reduce future agent error rate? ## Suggested deliverable A short audit doc plus issues/PR plan: 1. **Subsystem map** — current architecture, owners, risky areas. 2. **Agent-readiness scorecard** — orientation, decomposition, testability, contracts, safety. 3. **Top 5 blockers** to safe agentic maintenance. 4. **Small PR sequence** to improve repo manageability. 5. **Regression gates** that should exist before larger autonomous work. ## Acceptance criteria - Audit is grounded in actual repo evidence: files, tests, commands, logs, docs. - No vague “needs cleanup” without a concrete path. - Produces actionable follow-up issues or a small PR stack. - Explicitly flags where an agent would currently be tempted into monolithic/rewrite-style behavior. - Recommends how to make future work smaller, safer, and more reviewable. ## Reference - arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch? - Key lesson: passing partial behavioral tests is not the same as having good architecture; agents need decomposition, behavioral tests, and evidence/control-plane support.
Collaborator

M05 disposition: move to M06 and keep parked.

The ProgramBench-style agent-readiness audit is valuable, but it is broad agent/governance work, not M05 foundation closeout. Before implementation it should be split into a bounded audit packet with concrete files and pass/fail questions.

M05 disposition: move to M06 and keep parked. The ProgramBench-style agent-readiness audit is valuable, but it is broad agent/governance work, not M05 foundation closeout. Before implementation it should be split into a bounded audit packet with concrete files and pass/fail questions.
Collaborator

M10 disposition: moved to 10 - Improvements.

What this is: ProgramBench-style agent-readiness audit.

Why parked here: This is a valuable manageability audit, but broad and diagnostic; park it until we deliberately run an agent-readiness improvement pass.

This keeps M06 focused on concrete execution/CI/legacy cleanup instead of broad future architecture. Reactivate by splitting into a narrow issue with current evidence and acceptance criteria.

M10 disposition: moved to `10 - Improvements`. What this is: ProgramBench-style agent-readiness audit. Why parked here: This is a valuable manageability audit, but broad and diagnostic; park it until we deliberately run an agent-readiness improvement pass. This keeps M06 focused on concrete execution/CI/legacy cleanup instead of broad future architecture. Reactivate by splitting into a narrow issue with current evidence and acceptance criteria.
Collaborator

Tooling note (2026-05-30): use Repomix --compress as the input artifact for this audit.

Per the context-stack tooling decision, npx repomix@latest --compress produces a Tree-sitter-based, token-efficient whole-repo snapshot — ideal as the one-shot input for a ProgramBench-style architecture/manageability read. It complements (does not replace) CodeGraph (dynamic symbol graph) and Serena (LSP symbol-level ops): Repomix is the pocket snapshot for big one-off analyses, not constant context.

Suggested flow when this audit runs:

  1. npx repomix@latest --compress at repo root → compressed snapshot.
  2. Feed snapshot as the audit baseline; cross-check structure against MAP.md + docs/domains/.
  3. Use CodeGraph/Serena for any drill-down the snapshot flags.

Pocket tool only — not added as a standing MCP.

**Tooling note (2026-05-30): use Repomix `--compress` as the input artifact for this audit.** Per the context-stack tooling decision, `npx repomix@latest --compress` produces a Tree-sitter-based, token-efficient whole-repo snapshot — ideal as the one-shot input for a ProgramBench-style architecture/manageability read. It complements (does not replace) CodeGraph (dynamic symbol graph) and Serena (LSP symbol-level ops): Repomix is the *pocket snapshot for big one-off analyses*, not constant context. Suggested flow when this audit runs: 1. `npx repomix@latest --compress` at repo root → compressed snapshot. 2. Feed snapshot as the audit baseline; cross-check structure against `MAP.md` + `docs/domains/`. 3. Use CodeGraph/Serena for any drill-down the snapshot flags. Pocket tool only — not added as a standing MCP.
Collaborator

Parked (p3, M10 closure plan #653 + Judging Claw priority). Reactivate when an agent-readiness audit is scheduled — use Repomix --compress + codegraph (see prior comment).

**Parked (p3, M10 closure plan #653 + Judging Claw priority).** Reactivate when an agent-readiness audit is scheduled — use Repomix --compress + codegraph (see prior comment).
Author
Collaborator

{
"confidence": 5,
"effort_hint": "large",
"escalation": {
"kind": "none",
"reason": ""
},
"evidence_refs": [
{
"note": "Issue proposes a ProgramBench-style agent-readiness audit for the platform repository.",
"type": "forgejo",
"value": "issue-title-body-labels-and-target-snapshot"
},
{
"note": "Body frames the risk as agents producing monolithic brittle patches when repo boundaries, docs, tests, and contracts are unclear.",
"type": "forgejo",
"value": "issue-body-risk"
},
{
"note": "Snapshot labels mark the issue as priority p3, not-ready, large research, and parked.",
"type": "snapshot",
"value": "target-snapshot-labels"
}
],
"impact": 3,
"judge_actor": {
"name": "iskra",
"runtime": "openclaw"
},
"judged_at": "2026-06-10T01:01:00Z",
"labels_to_apply": [
"judge/p3"
],
"piotr_fit": "medium",
"priority": "p3",
"rationale_summary": "This is P3 parked research because agent-readiness matters, but the issue is large, not-ready, and should wait behind smaller repo-boundary and test-contract work.",
"reach": 4,
"recommended_next_action": "observe",
"rerun_reason": "no_prior_judgment",
"schema": "openclaw.judge.v0",
"target": {
"kind": "issue",
"number": 384,
"repo": "pdurlej/platform"
},
"target_snapshot": {
"body_hash": "sha256:482a7e27d1d8d99b16c0741ed312a59fc792f94c65b02a4ec15669c963bafb48",
"commit_count": null,
"evidence_hash": "sha256:aab0ca8e0e9940aad043ead7fd7ce808632b8aadbafd4901458d00d960415969",
"head_sha": null,
"labels": [
"agent/claude-code",
"kind/research",
"not-ready",
"priority:p3",
"risk/process",
"size/large",
"status:parked"
],
"labels_hash": "sha256:7775b8d91ebd9f5863dac98a23789e5acdcb4f8095a9dff8116496f9a41dd6cd",
"state": "open",
"title_hash": "sha256:0363dd5e016c6f07344bb67591a61e6d4af15dc73f28c5bc5e3788c41fd9ae24",
"updated_at": "2026-06-01T08:53:16+02:00"
},
"top_caveat": "Do not turn this into a broad audit program until a small scoped pilot defines evidence, scoring, and follow-up actions."
}

<!-- openclaw.judge.v0 --> { "confidence": 5, "effort_hint": "large", "escalation": { "kind": "none", "reason": "" }, "evidence_refs": [ { "note": "Issue proposes a ProgramBench-style agent-readiness audit for the platform repository.", "type": "forgejo", "value": "issue-title-body-labels-and-target-snapshot" }, { "note": "Body frames the risk as agents producing monolithic brittle patches when repo boundaries, docs, tests, and contracts are unclear.", "type": "forgejo", "value": "issue-body-risk" }, { "note": "Snapshot labels mark the issue as priority p3, not-ready, large research, and parked.", "type": "snapshot", "value": "target-snapshot-labels" } ], "impact": 3, "judge_actor": { "name": "iskra", "runtime": "openclaw" }, "judged_at": "2026-06-10T01:01:00Z", "labels_to_apply": [ "judge/p3" ], "piotr_fit": "medium", "priority": "p3", "rationale_summary": "This is P3 parked research because agent-readiness matters, but the issue is large, not-ready, and should wait behind smaller repo-boundary and test-contract work.", "reach": 4, "recommended_next_action": "observe", "rerun_reason": "no_prior_judgment", "schema": "openclaw.judge.v0", "target": { "kind": "issue", "number": 384, "repo": "pdurlej/platform" }, "target_snapshot": { "body_hash": "sha256:482a7e27d1d8d99b16c0741ed312a59fc792f94c65b02a4ec15669c963bafb48", "commit_count": null, "evidence_hash": "sha256:aab0ca8e0e9940aad043ead7fd7ce808632b8aadbafd4901458d00d960415969", "head_sha": null, "labels": [ "agent/claude-code", "kind/research", "not-ready", "priority:p3", "risk/process", "size/large", "status:parked" ], "labels_hash": "sha256:7775b8d91ebd9f5863dac98a23789e5acdcb4f8095a9dff8116496f9a41dd6cd", "state": "open", "title_hash": "sha256:0363dd5e016c6f07344bb67591a61e6d4af15dc73f28c5bc5e3788c41fd9ae24", "updated_at": "2026-06-01T08:53:16+02:00" }, "top_caveat": "Do not turn this into a broad audit program until a small scoped pilot defines evidence, scoring, and follow-up actions." } <!-- /openclaw.judge.v0 -->
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#384
No description provided.