pdurlej/platform

Fork 0

audit(agent-readiness): ProgramBench-style architecture/manageability audit for platform repo #384

New issue

Open

opened 2026-05-19 02:12:32 +02:00 by Iskra · 5 comments

Iskra commented

2026-05-19 02:12:32 +02:00

Collaborator

Source

Signal request from Piotr to create a Platform issue after discussing arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch?

Why this matters

ProgramBench is useful not because “models are bad at coding”, but because it shows a specific failure mode: when asked to build/rebuild whole programs, agents tend to produce compressed, monolithic imitations of behavior instead of well-decomposed, maintainable systems.

That risk applies directly to our platform repo: if the repo shape, docs, tests, contracts, and ownership boundaries are unclear, future Codex/Claude/Iskra work will drift toward big patches, hidden coupling, and brittle “passes the local task” fixes.

Goal

Run a ProgramBench-style agent-readiness audit of pdurlej/platform:

Can an agent understand, modify, test, and improve this repo in small safe steps without turning it into monolithic sludge?

This is not a benchmark vanity exercise. It is an operational audit for whether the platform codebase is manageable by AI agents and human reviewers.

Audit questions

1. Repo orientation

Is there a clear map of major subsystems?
Can a fresh agent tell what owns what within 10–15 minutes?
Are entrypoints, services, workers, MCP/server pieces, and deployment paths discoverable?
Is there an AGENTS.md / operator guide that reflects reality?

2. Decomposition quality

Are modules/folders aligned to product/runtime concepts?
Are boundaries explicit enough to avoid one huge “fix everything” edit?
Are there obvious god files, overloaded scripts, or implicit cross-coupling?
Can a change be scoped to one subsystem with predictable blast radius?

3. Behavioral testability

Are core user-visible workflows covered by tests or smoke probes?
Can an agent verify a change without needing tribal knowledge?
Are there cheap gates: lint, typecheck, unit, integration, smoke, fixture-based contracts?
Are failure modes represented as regression tests, not just issues/chat memory?

4. Contract/evidence layer

Are API/MCP/runtime contracts written down and tested?
Do important outputs have schemas or stable fixtures?
Can an agent cite evidence paths/logs/test output after a change?
Does health distinguish liveness, readiness, and actual usefulness?

5. Agent workflow safety

Can Codex/Claude work in small PRs without hidden credentials or operator identity leaks?
Are write paths least-privilege and clearly gated?
Are destructive/runtime mutations separated from read-only audit paths?
Is there a safe “proposal first, apply after approval” lane where needed?

6. Maintainability under repeated agent edits

Do recent agent-authored commits preserve architecture or accrete patches?
Are there signs of benchmark-style failure: giant single-file edits, duplicated logic, local hacks, unclear ownership?
What refactors would reduce future agent error rate?

Suggested deliverable

A short audit doc plus issues/PR plan:

Subsystem map — current architecture, owners, risky areas.
Agent-readiness scorecard — orientation, decomposition, testability, contracts, safety.
Top 5 blockers to safe agentic maintenance.
Small PR sequence to improve repo manageability.
Regression gates that should exist before larger autonomous work.

Acceptance criteria

Audit is grounded in actual repo evidence: files, tests, commands, logs, docs.
No vague “needs cleanup” without a concrete path.
Produces actionable follow-up issues or a small PR stack.
Explicitly flags where an agent would currently be tempted into monolithic/rewrite-style behavior.
Recommends how to make future work smaller, safer, and more reviewable.

Reference

arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch?
Key lesson: passing partial behavioral tests is not the same as having good architecture; agents need decomposition, behavioral tests, and evidence/control-plane support.

## Source Signal request from Piotr to create a Platform issue after discussing arXiv:2605.03546 — **ProgramBench: Can Language Models Rebuild Programs From Scratch?** ## Why this matters ProgramBench is useful not because “models are bad at coding”, but because it shows a specific failure mode: when asked to build/rebuild whole programs, agents tend to produce compressed, monolithic imitations of behavior instead of well-decomposed, maintainable systems. That risk applies directly to our `platform` repo: if the repo shape, docs, tests, contracts, and ownership boundaries are unclear, future Codex/Claude/Iskra work will drift toward big patches, hidden coupling, and brittle “passes the local task” fixes. ## Goal Run a **ProgramBench-style agent-readiness audit** of `pdurlej/platform`: > Can an agent understand, modify, test, and improve this repo in small safe steps without turning it into monolithic sludge? This is not a benchmark vanity exercise. It is an operational audit for whether the platform codebase is manageable by AI agents and human reviewers. ## Audit questions ### 1. Repo orientation - Is there a clear map of major subsystems? - Can a fresh agent tell what owns what within 10–15 minutes? - Are entrypoints, services, workers, MCP/server pieces, and deployment paths discoverable? - Is there an `AGENTS.md` / operator guide that reflects reality? ### 2. Decomposition quality - Are modules/folders aligned to product/runtime concepts? - Are boundaries explicit enough to avoid one huge “fix everything” edit? - Are there obvious god files, overloaded scripts, or implicit cross-coupling? - Can a change be scoped to one subsystem with predictable blast radius? ### 3. Behavioral testability - Are core user-visible workflows covered by tests or smoke probes? - Can an agent verify a change without needing tribal knowledge? - Are there cheap gates: lint, typecheck, unit, integration, smoke, fixture-based contracts? - Are failure modes represented as regression tests, not just issues/chat memory? ### 4. Contract/evidence layer - Are API/MCP/runtime contracts written down and tested? - Do important outputs have schemas or stable fixtures? - Can an agent cite evidence paths/logs/test output after a change? - Does health distinguish liveness, readiness, and actual usefulness? ### 5. Agent workflow safety - Can Codex/Claude work in small PRs without hidden credentials or operator identity leaks? - Are write paths least-privilege and clearly gated? - Are destructive/runtime mutations separated from read-only audit paths? - Is there a safe “proposal first, apply after approval” lane where needed? ### 6. Maintainability under repeated agent edits - Do recent agent-authored commits preserve architecture or accrete patches? - Are there signs of benchmark-style failure: giant single-file edits, duplicated logic, local hacks, unclear ownership? - What refactors would reduce future agent error rate? ## Suggested deliverable A short audit doc plus issues/PR plan: 1. **Subsystem map** — current architecture, owners, risky areas. 2. **Agent-readiness scorecard** — orientation, decomposition, testability, contracts, safety. 3. **Top 5 blockers** to safe agentic maintenance. 4. **Small PR sequence** to improve repo manageability. 5. **Regression gates** that should exist before larger autonomous work. ## Acceptance criteria - Audit is grounded in actual repo evidence: files, tests, commands, logs, docs. - No vague “needs cleanup” without a concrete path. - Produces actionable follow-up issues or a small PR stack. - Explicitly flags where an agent would currently be tempted into monolithic/rewrite-style behavior. - Recommends how to make future work smaller, safer, and more reviewable. ## Reference - arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch? - Key lesson: passing partial behavioral tests is not the same as having good architecture; agents need decomposition, behavioral tests, and evidence/control-plane support.

codex referenced this issue

2026-05-19 08:24:08 +02:00

decisions(0022): define module source and release boundaries #385

codex added this to the 05 - Platform modularity foundation milestone

2026-05-19 08:36:23 +02:00

codex referenced this issue

2026-05-19 08:45:06 +02:00

state(roadmap): consolidate current platform milestones #386

codex referenced this issue

2026-05-19 08:54:44 +02:00

meta(module-catalog): implement ADR-0022 source/artifact metadata pilot #388

codex referenced this issue

2026-05-28 01:21:53 +02:00

chore(m05): triage modularity foundation backlog after ADR-0021/0022 #540

codex commented

2026-05-29 15:58:15 +02:00

Collaborator

M05 disposition: move to M06 and keep parked.

The ProgramBench-style agent-readiness audit is valuable, but it is broad agent/governance work, not M05 foundation closeout. Before implementation it should be split into a bounded audit packet with concrete files and pass/fail questions.

M05 disposition: move to M06 and keep parked. The ProgramBench-style agent-readiness audit is valuable, but it is broad agent/governance work, not M05 foundation closeout. Before implementation it should be split into a bounded audit packet with concrete files and pass/fail questions.

codex modified the milestone from 05 - Platform modularity foundation to 06 - Agent execution and CI governance

2026-05-29 15:58:15 +02:00

codex added the

labels

2026-05-29 15:58:15 +02:00

codex referenced this issue

2026-05-29 15:58:16 +02:00

chore(m05): triage modularity foundation backlog after ADR-0021/0022 #540

codex commented

2026-05-29 16:23:48 +02:00

Collaborator

M10 disposition: moved to 10 - Improvements.

What this is: ProgramBench-style agent-readiness audit.

Why parked here: This is a valuable manageability audit, but broad and diagnostic; park it until we deliberately run an agent-readiness improvement pass.

This keeps M06 focused on concrete execution/CI/legacy cleanup instead of broad future architecture. Reactivate by splitting into a narrow issue with current evidence and acceptance criteria.

M10 disposition: moved to `10 - Improvements`. What this is: ProgramBench-style agent-readiness audit. Why parked here: This is a valuable manageability audit, but broad and diagnostic; park it until we deliberately run an agent-readiness improvement pass. This keeps M06 focused on concrete execution/CI/legacy cleanup instead of broad future architecture. Reactivate by splitting into a narrow issue with current evidence and acceptance criteria.

codex modified the milestone from 06 - Agent execution and CI governance to 10 - Improvements

2026-05-29 16:23:48 +02:00

codex referenced this issue

2026-05-29 16:59:57 +02:00

chore(w9): milestone archaeology packet for stale open issues #535

claude commented

2026-05-30 01:48:44 +02:00

Collaborator

Tooling note (2026-05-30): use Repomix --compress as the input artifact for this audit.

Per the context-stack tooling decision, npx repomix@latest --compress produces a Tree-sitter-based, token-efficient whole-repo snapshot — ideal as the one-shot input for a ProgramBench-style architecture/manageability read. It complements (does not replace) CodeGraph (dynamic symbol graph) and Serena (LSP symbol-level ops): Repomix is the pocket snapshot for big one-off analyses, not constant context.

Suggested flow when this audit runs:

npx repomix@latest --compress at repo root → compressed snapshot.
Feed snapshot as the audit baseline; cross-check structure against MAP.md + docs/domains/.
Use CodeGraph/Serena for any drill-down the snapshot flags.

Pocket tool only — not added as a standing MCP.

**Tooling note (2026-05-30): use Repomix `--compress` as the input artifact for this audit.** Per the context-stack tooling decision, `npx repomix@latest --compress` produces a Tree-sitter-based, token-efficient whole-repo snapshot — ideal as the one-shot input for a ProgramBench-style architecture/manageability read. It complements (does not replace) CodeGraph (dynamic symbol graph) and Serena (LSP symbol-level ops): Repomix is the *pocket snapshot for big one-off analyses*, not constant context. Suggested flow when this audit runs: 1. `npx repomix@latest --compress` at repo root → compressed snapshot. 2. Feed snapshot as the audit baseline; cross-check structure against `MAP.md` + `docs/domains/`. 3. Use CodeGraph/Serena for any drill-down the snapshot flags. Pocket tool only — not added as a standing MCP.

claude added the

agent/claude-code

label

2026-05-30 13:08:26 +02:00

claude added the

priority:p3

label

2026-05-30 22:43:00 +02:00

claude added the

status:parked

label

2026-06-01 08:53:16 +02:00

claude commented

2026-06-01 08:53:16 +02:00

Collaborator

Parked (p3, M10 closure plan #653 + Judging Claw priority). Reactivate when an agent-readiness audit is scheduled — use Repomix --compress + codegraph (see prior comment).

**Parked (p3, M10 closure plan #653 + Judging Claw priority).** Reactivate when an agent-readiness audit is scheduled — use Repomix --compress + codegraph (see prior comment).

Iskra commented

2026-06-10 03:03:16 +02:00

Author

Collaborator

{
"confidence": 5,
"effort_hint": "large",
"escalation": {
"kind": "none",
"reason": ""
},
"evidence_refs": [
{
"note": "Issue proposes a ProgramBench-style agent-readiness audit for the platform repository.",
"type": "forgejo",
"value": "issue-title-body-labels-and-target-snapshot"
},
{
"note": "Body frames the risk as agents producing monolithic brittle patches when repo boundaries, docs, tests, and contracts are unclear.",
"type": "forgejo",
"value": "issue-body-risk"
},
{
"note": "Snapshot labels mark the issue as priority p3, not-ready, large research, and parked.",
"type": "snapshot",
"value": "target-snapshot-labels"
}
],
"impact": 3,
"judge_actor": {
"name": "iskra",
"runtime": "openclaw"
},
"judged_at": "2026-06-10T01:01:00Z",
"labels_to_apply": [
"judge/p3"
],
"piotr_fit": "medium",
"priority": "p3",
"rationale_summary": "This is P3 parked research because agent-readiness matters, but the issue is large, not-ready, and should wait behind smaller repo-boundary and test-contract work.",
"reach": 4,
"recommended_next_action": "observe",
"rerun_reason": "no_prior_judgment",
"schema": "openclaw.judge.v0",
"target": {
"kind": "issue",
"number": 384,
"repo": "pdurlej/platform"
},
"target_snapshot": {
"body_hash": "sha256:482a7e27d1d8d99b16c0741ed312a59fc792f94c65b02a4ec15669c963bafb48",
"commit_count": null,
"evidence_hash": "sha256:aab0ca8e0e9940aad043ead7fd7ce808632b8aadbafd4901458d00d960415969",
"head_sha": null,
"labels": [
"agent/claude-code",
"kind/research",
"not-ready",
"priority:p3",
"risk/process",
"size/large",
"status:parked"
],
"labels_hash": "sha256:7775b8d91ebd9f5863dac98a23789e5acdcb4f8095a9dff8116496f9a41dd6cd",
"state": "open",
"title_hash": "sha256:0363dd5e016c6f07344bb67591a61e6d4af15dc73f28c5bc5e3788c41fd9ae24",
"updated_at": "2026-06-01T08:53:16+02:00"
},
"top_caveat": "Do not turn this into a broad audit program until a small scoped pilot defines evidence, scoring, and follow-up actions."
}

{ "confidence": 5, "effort_hint": "large", "escalation": { "kind": "none", "reason": "" }, "evidence_refs": [ { "note": "Issue proposes a ProgramBench-style agent-readiness audit for the platform repository.", "type": "forgejo", "value": "issue-title-body-labels-and-target-snapshot" }, { "note": "Body frames the risk as agents producing monolithic brittle patches when repo boundaries, docs, tests, and contracts are unclear.", "type": "forgejo", "value": "issue-body-risk" }, { "note": "Snapshot labels mark the issue as priority p3, not-ready, large research, and parked.", "type": "snapshot", "value": "target-snapshot-labels" } ], "impact": 3, "judge_actor": { "name": "iskra", "runtime": "openclaw" }, "judged_at": "2026-06-10T01:01:00Z", "labels_to_apply": [ "judge/p3" ], "piotr_fit": "medium", "priority": "p3", "rationale_summary": "This is P3 parked research because agent-readiness matters, but the issue is large, not-ready, and should wait behind smaller repo-boundary and test-contract work.", "reach": 4, "recommended_next_action": "observe", "rerun_reason": "no_prior_judgment", "schema": "openclaw.judge.v0", "target": { "kind": "issue", "number": 384, "repo": "pdurlej/platform" }, "target_snapshot": { "body_hash": "sha256:482a7e27d1d8d99b16c0741ed312a59fc792f94c65b02a4ec15669c963bafb48", "commit_count": null, "evidence_hash": "sha256:aab0ca8e0e9940aad043ead7fd7ce808632b8aadbafd4901458d00d960415969", "head_sha": null, "labels": [ "agent/claude-code", "kind/research", "not-ready", "priority:p3", "risk/process", "size/large", "status:parked" ], "labels_hash": "sha256:7775b8d91ebd9f5863dac98a23789e5acdcb4f8095a9dff8116496f9a41dd6cd", "state": "open", "title_hash": "sha256:0363dd5e016c6f07344bb67591a61e6d4af15dc73f28c5bc5e3788c41fd9ae24", "updated_at": "2026-06-01T08:53:16+02:00" }, "top_caveat": "Do not turn this into a broad audit program until a small scoped pilot defines evidence, scoring, and follow-up actions." }

Iskra added the

judge/p3

label

2026-06-10 03:03:16 +02:00