Explore long-running agent harness patterns for BMADX + Heartswarm #2

Open
opened 2026-05-22 01:20:25 +02:00 by Iskra · 1 comment
Collaborator

Context

Piotr flagged this talk as worth turning into exploration work for BMADX + Heartswarm:

  • Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
  • Source: https://youtu.be/mR-WAvEPRwE
  • Summary captured in Matrix project-summaries on 2026-05-22.

Core thesis from the talk: long-running agents do not become reliable through a stronger prompt alone. They need a harness/scaffold around the model: explicit planning, durable state, separate evaluators, verification loops, checkpointing, and trace reading.

Exploration goal

Investigate which of these long-run agent patterns should be adapted into BMADX and/or Heartswarm so multi-hour runs do not lose the plot, rubber-stamp half-done work, or drift after context compaction.

Patterns to evaluate

  1. Planner / generator / evaluator split

    • Planner defines scope and success criteria.
    • Generator executes.
    • Evaluator is a separate critical reviewer, not the same agent self-grading.
    • Builder and evaluator should agree on definition of done before implementation.
  2. Durable run state outside model context

    • Store current objective, subgoals, decisions, blockers, test results, and next checkpoint in files/JSON.
    • Treat context compaction as lossy; do not rely on summaries as source of truth.
  3. Evaluator as QA harness

    • For code/UI work, evaluator should use real tools: tests, Playwright/browser checks, screenshots, console/network errors, smoke flows.
    • CI alone is not enough; check the product like a user where relevant.
  4. Checkpoint + continuation contract

    • Long runs should emit compact, machine-readable checkpoints.
    • On resume, the next agent should know what is proven, what is assumed, what is still unverified, and what must not be repeated.
  5. Trace-reading feedback loop

    • Keep inspectable traces from failed runs.
    • Use them to update prompt templates, harness rules, skills, or runbooks.
    • Calibration should come from real failures, not abstract rules.
  6. Scaffold retirement for stronger models

    • Some guardrails are patches for old model weaknesses.
    • Periodically test whether old scaffolding now causes friction, double-work, or stale constraints.

Questions for BMADX / Heartswarm

  • Where should evaluator agents live: inside BMADX, Heartswarm, or as shared harness primitives?
  • What is the minimal shared run_state.json schema?
  • What should count as done for different work types: code, research, planning, UX, ops?
  • Can Heartswarm require a verifier lane before marking swarm work complete?
  • Can BMADX expose reusable rubrics for subjective review: design, originality, usefulness, correctness, maintainability?
  • How do we prevent evaluator/generator collusion or mutual rubber-stamping?

Acceptance criteria for this exploration

  • Produce a short design note comparing current BMADX/Heartswarm behavior against the patterns above.
  • Propose a minimal v0 harness interface for long-running work.
  • Identify one small pilot flow where a separate evaluator would materially improve reliability.
  • Decide whether this becomes implementation work, a BMADX runbook, a Heartswarm primitive, or a shared convention.

Operator note

This is exploratory, not an immediate implementation request. The valuable thing is to extract operational principles from the Anthropic talk and adapt them to our actual system, not cargo-cult their architecture.

## Context Piotr flagged this talk as worth turning into exploration work for BMADX + Heartswarm: - **Build Agents That Run for Hours (Without Losing the Plot)** — Ash Prabaker & Andrew Wilson, Anthropic - Source: https://youtu.be/mR-WAvEPRwE - Summary captured in Matrix `project-summaries` on 2026-05-22. Core thesis from the talk: long-running agents do not become reliable through a stronger prompt alone. They need a harness/scaffold around the model: explicit planning, durable state, separate evaluators, verification loops, checkpointing, and trace reading. ## Exploration goal Investigate which of these long-run agent patterns should be adapted into **BMADX** and/or **Heartswarm** so multi-hour runs do not lose the plot, rubber-stamp half-done work, or drift after context compaction. ## Patterns to evaluate 1. **Planner / generator / evaluator split** - Planner defines scope and success criteria. - Generator executes. - Evaluator is a separate critical reviewer, not the same agent self-grading. - Builder and evaluator should agree on definition of done before implementation. 2. **Durable run state outside model context** - Store current objective, subgoals, decisions, blockers, test results, and next checkpoint in files/JSON. - Treat context compaction as lossy; do not rely on summaries as source of truth. 3. **Evaluator as QA harness** - For code/UI work, evaluator should use real tools: tests, Playwright/browser checks, screenshots, console/network errors, smoke flows. - CI alone is not enough; check the product like a user where relevant. 4. **Checkpoint + continuation contract** - Long runs should emit compact, machine-readable checkpoints. - On resume, the next agent should know what is proven, what is assumed, what is still unverified, and what must not be repeated. 5. **Trace-reading feedback loop** - Keep inspectable traces from failed runs. - Use them to update prompt templates, harness rules, skills, or runbooks. - Calibration should come from real failures, not abstract rules. 6. **Scaffold retirement for stronger models** - Some guardrails are patches for old model weaknesses. - Periodically test whether old scaffolding now causes friction, double-work, or stale constraints. ## Questions for BMADX / Heartswarm - Where should evaluator agents live: inside BMADX, Heartswarm, or as shared harness primitives? - What is the minimal shared `run_state.json` schema? - What should count as `done` for different work types: code, research, planning, UX, ops? - Can Heartswarm require a verifier lane before marking swarm work complete? - Can BMADX expose reusable rubrics for subjective review: design, originality, usefulness, correctness, maintainability? - How do we prevent evaluator/generator collusion or mutual rubber-stamping? ## Acceptance criteria for this exploration - [ ] Produce a short design note comparing current BMADX/Heartswarm behavior against the patterns above. - [ ] Propose a minimal v0 harness interface for long-running work. - [ ] Identify one small pilot flow where a separate evaluator would materially improve reliability. - [ ] Decide whether this becomes implementation work, a BMADX runbook, a Heartswarm primitive, or a shared convention. ## Operator note This is exploratory, not an immediate implementation request. The valuable thing is to extract operational principles from the Anthropic talk and adapt them to our actual system, not cargo-cult their architecture.
Author
Collaborator

Addendum: Tejas Kumar / IBM — Harnesses in AI

Piotr flagged the second harness talk as also valuable for BMADX + Heartswarm:

  • Harnesses in AI: A Deep Dive — Tejas Kumar, IBM
  • Source: https://youtu.be/C_GG5g38vLU
  • Captured in Matrix project-summaries on 2026-05-22.

This talk is useful because it explains harnessing from first principles, not just for long-running agents. Core framing:

Harness = everything around the model that grounds it in reality and makes outcomes reliable.

Additional patterns to include in the exploration

  1. Stop prompt-hardening when the environment is the problem

    • In the demo, the prompt stayed unchanged.
    • Reliability improved by changing the harness: max iterations, trace, verify step, retry loop, deterministic login handler.
    • BMADX/Heartswarm should distinguish: prompt issue vs harness issue vs capability issue.
  2. Deterministic handlers for stable subproblems

    • If a step should be reliable and mechanical, do not ask the LLM to improvise it every time.
    • Example from talk: login flow handled outside the model.
    • Candidate Heartswarm pattern: deterministic handlers for auth/bootstrap/setup/status checks before model work begins.
  3. Verify step as a first-class harness primitive

    • The harness must be able to reject a model’s claimed success.
    • This maps directly to BMADX/Heartswarm completion gates: a task is not done because the model says so; it is done when verifier evidence exists.
  4. Agent loop is not the harness

    • The harness is the runtime around the agent loop: tools, context management, guardrails, retries, traces, capability boundaries, verification.
    • Useful distinction for BMADX architecture docs: avoid calling every loop a harness.
  5. Use weaker/cheaper models when the harness is strong

    • A good harness can make smaller models useful for bounded work.
    • Potential pilot: compare frontier model vs cheaper/local model on a constrained BMADX/Heartswarm task with the same harness.
  6. Dynamic/on-the-fly harnesses as future direction

    • Tejas speculates that after agents and harnesses comes agent-generated harnesses: before doing the task, the agent designs the safety/verification/runtime scaffold for that task.
    • Candidate research question: can BMADX produce a task-specific mini-harness spec before Heartswarm execution?

Suggested extra acceptance criterion

  • Classify one BMADX/Heartswarm failure mode as either prompt, harness, capability, memory, or verification failure, and propose the smallest harness change that would prevent recurrence.
## Addendum: Tejas Kumar / IBM — Harnesses in AI Piotr flagged the second harness talk as also valuable for BMADX + Heartswarm: - **Harnesses in AI: A Deep Dive** — Tejas Kumar, IBM - Source: https://youtu.be/C_GG5g38vLU - Captured in Matrix `project-summaries` on 2026-05-22. This talk is useful because it explains harnessing from first principles, not just for long-running agents. Core framing: > Harness = everything around the model that grounds it in reality and makes outcomes reliable. ## Additional patterns to include in the exploration 1. **Stop prompt-hardening when the environment is the problem** - In the demo, the prompt stayed unchanged. - Reliability improved by changing the harness: max iterations, trace, verify step, retry loop, deterministic login handler. - BMADX/Heartswarm should distinguish: prompt issue vs harness issue vs capability issue. 2. **Deterministic handlers for stable subproblems** - If a step should be reliable and mechanical, do not ask the LLM to improvise it every time. - Example from talk: login flow handled outside the model. - Candidate Heartswarm pattern: deterministic handlers for auth/bootstrap/setup/status checks before model work begins. 3. **Verify step as a first-class harness primitive** - The harness must be able to reject a model’s claimed success. - This maps directly to BMADX/Heartswarm completion gates: a task is not done because the model says so; it is done when verifier evidence exists. 4. **Agent loop is not the harness** - The harness is the runtime around the agent loop: tools, context management, guardrails, retries, traces, capability boundaries, verification. - Useful distinction for BMADX architecture docs: avoid calling every loop a harness. 5. **Use weaker/cheaper models when the harness is strong** - A good harness can make smaller models useful for bounded work. - Potential pilot: compare frontier model vs cheaper/local model on a constrained BMADX/Heartswarm task with the same harness. 6. **Dynamic/on-the-fly harnesses as future direction** - Tejas speculates that after agents and harnesses comes agent-generated harnesses: before doing the task, the agent designs the safety/verification/runtime scaffold for that task. - Candidate research question: can BMADX produce a task-specific mini-harness spec before Heartswarm execution? ## Suggested extra acceptance criterion - [ ] Classify one BMADX/Heartswarm failure mode as either `prompt`, `harness`, `capability`, `memory`, or `verification` failure, and propose the smallest harness change that would prevent recurrence.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/bdmax#2
No description provided.