audit(agent-readiness): ProgramBench-style architecture/manageability audit for platform repo #384
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform#384
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Source
Signal request from Piotr to create a Platform issue after discussing arXiv:2605.03546 — ProgramBench: Can Language Models Rebuild Programs From Scratch?
Why this matters
ProgramBench is useful not because “models are bad at coding”, but because it shows a specific failure mode: when asked to build/rebuild whole programs, agents tend to produce compressed, monolithic imitations of behavior instead of well-decomposed, maintainable systems.
That risk applies directly to our
platformrepo: if the repo shape, docs, tests, contracts, and ownership boundaries are unclear, future Codex/Claude/Iskra work will drift toward big patches, hidden coupling, and brittle “passes the local task” fixes.Goal
Run a ProgramBench-style agent-readiness audit of
pdurlej/platform:This is not a benchmark vanity exercise. It is an operational audit for whether the platform codebase is manageable by AI agents and human reviewers.
Audit questions
1. Repo orientation
AGENTS.md/ operator guide that reflects reality?2. Decomposition quality
3. Behavioral testability
4. Contract/evidence layer
5. Agent workflow safety
6. Maintainability under repeated agent edits
Suggested deliverable
A short audit doc plus issues/PR plan:
Acceptance criteria
Reference
M05 disposition: move to M06 and keep parked.
The ProgramBench-style agent-readiness audit is valuable, but it is broad agent/governance work, not M05 foundation closeout. Before implementation it should be split into a bounded audit packet with concrete files and pass/fail questions.
M10 disposition: moved to
10 - Improvements.What this is: ProgramBench-style agent-readiness audit.
Why parked here: This is a valuable manageability audit, but broad and diagnostic; park it until we deliberately run an agent-readiness improvement pass.
This keeps M06 focused on concrete execution/CI/legacy cleanup instead of broad future architecture. Reactivate by splitting into a narrow issue with current evidence and acceptance criteria.
Tooling note (2026-05-30): use Repomix
--compressas the input artifact for this audit.Per the context-stack tooling decision,
npx repomix@latest --compressproduces a Tree-sitter-based, token-efficient whole-repo snapshot — ideal as the one-shot input for a ProgramBench-style architecture/manageability read. It complements (does not replace) CodeGraph (dynamic symbol graph) and Serena (LSP symbol-level ops): Repomix is the pocket snapshot for big one-off analyses, not constant context.Suggested flow when this audit runs:
npx repomix@latest --compressat repo root → compressed snapshot.MAP.md+docs/domains/.Pocket tool only — not added as a standing MCP.
Parked (p3, M10 closure plan #653 + Judging Claw priority). Reactivate when an agent-readiness audit is scheduled — use Repomix --compress + codegraph (see prior comment).
{
"confidence": 5,
"effort_hint": "large",
"escalation": {
"kind": "none",
"reason": ""
},
"evidence_refs": [
{
"note": "Issue proposes a ProgramBench-style agent-readiness audit for the platform repository.",
"type": "forgejo",
"value": "issue-title-body-labels-and-target-snapshot"
},
{
"note": "Body frames the risk as agents producing monolithic brittle patches when repo boundaries, docs, tests, and contracts are unclear.",
"type": "forgejo",
"value": "issue-body-risk"
},
{
"note": "Snapshot labels mark the issue as priority p3, not-ready, large research, and parked.",
"type": "snapshot",
"value": "target-snapshot-labels"
}
],
"impact": 3,
"judge_actor": {
"name": "iskra",
"runtime": "openclaw"
},
"judged_at": "2026-06-10T01:01:00Z",
"labels_to_apply": [
"judge/p3"
],
"piotr_fit": "medium",
"priority": "p3",
"rationale_summary": "This is P3 parked research because agent-readiness matters, but the issue is large, not-ready, and should wait behind smaller repo-boundary and test-contract work.",
"reach": 4,
"recommended_next_action": "observe",
"rerun_reason": "no_prior_judgment",
"schema": "openclaw.judge.v0",
"target": {
"kind": "issue",
"number": 384,
"repo": "pdurlej/platform"
},
"target_snapshot": {
"body_hash": "sha256:482a7e27d1d8d99b16c0741ed312a59fc792f94c65b02a4ec15669c963bafb48",
"commit_count": null,
"evidence_hash": "sha256:aab0ca8e0e9940aad043ead7fd7ce808632b8aadbafd4901458d00d960415969",
"head_sha": null,
"labels": [
"agent/claude-code",
"kind/research",
"not-ready",
"priority:p3",
"risk/process",
"size/large",
"status:parked"
],
"labels_hash": "sha256:7775b8d91ebd9f5863dac98a23789e5acdcb4f8095a9dff8116496f9a41dd6cd",
"state": "open",
"title_hash": "sha256:0363dd5e016c6f07344bb67591a61e6d4af15dc73f28c5bc5e3788c41fd9ae24",
"updated_at": "2026-06-01T08:53:16+02:00"
},
"top_caveat": "Do not turn this into a broad audit program until a small scoped pilot defines evidence, scoring, and follow-up actions."
}