test(v0): add boring PR lifecycle fixture (closes #33) #52

Merged
pdurlej merged 1 commit from claude/patchwarden-boring-pr-lifecycle-fixture into main 2026-05-27 12:37:58 +02:00
Collaborator

What

One new test file: tests/test_boring_pr_lifecycle.py (+151 lines). Pure test, zero src changes. Proves the deterministic PR gate — fed by the real policies/platform.v0.toml bundle — returns eligible_clean for safe docs/status PRs and needs_human for workflow / runtime / control-plane paths.

No Forgejo network calls. File-based policy load (load_bundle) + in-memory evaluate_pull_request.

What's covered (5 tests)

Positive (safe_docs_statuseligible_clean)

Test Changed files Why
test_docs_prefix_pr_is_eligible_clean docs/roadmap.md, docs/operations/notes.md Primary dogfood path on pdurlej/platform
test_status_marker_pr_is_eligible_clean state/STATUS_NOW.md, state/cycle/W6d-2026-05-25.md W6d status-marker lane (second proven path per docs/operations/platform-dogfood.md from #51)

Negative (blocked classification → needs_human)

Test Changed files Classification
test_workflow_change_holds_for_human .forgejo/workflows/patchwarden-client-dry-run.yml workflow
test_runtime_change_holds_for_human compose/docker-compose.yml runtime
test_policy_governance_change_holds_for_human src/patchwarden/pr_check.py policy_governance

Each negative asserts: exit_code == 1, verdict == "needs_human", would_auto_merge_later == False. The workflow case also asserts manual_pr_class in blockers.

Why this matters

This is the calibration baseline for the W6d-automerge-calibration lane. If these five tests fail, the dogfood loop documented in docs/operations/platform-dogfood.md (PR #51) cannot be trusted — a regression in policies/platform.v0.toml, policy_bundle.py, or pr_check.py that silently flips a verdict would slip through without these end-to-end fixtures catching it.

Existing tests/test_pr_check.py uses DEFAULT_BUNDLE (in-memory, hardcoded). This new file uses load_bundle(Path("policies/platform.v0.toml")) — the actual TOML the dry-run workflow loads in production. Different failure mode, different coverage. Complementary, not redundant.

Spec sources (per codex issue #33)

  • tests/test_pipeline.py — pipeline shape (consulted, not edited)
  • tests/test_pr_check.pypull_request() helper pattern inspired the _boring_pr() helper here
  • policies/platform.v0.toml — actual policy bundle (used as input)

Atomic per ADR-0017

  • 1 file, +151 lines, 0 src changes, 0 deletions.
  • Re-uses existing evaluate_pull_request, PullRequestInput, CheckStatus, load_bundle — no new public surface.
  • base=main, no stacking on #51 (already merged).

Anti-scope (per #33 no-go)

  • No model fixtures
  • No giant copied real PR payloads (each fixture is <10 lines)
  • No snapshot tests with unstable timestamps (deterministic SHAs only)
  • No Forgejo network (file-based bundle + dataclass input only)

Test count

133 → 138 green (+5 new, all passing on first run).

Token-accounting

~2-3% weekly Opus. Could have gone to a Sonnet sub-agent, but the brief overhead (read 3 existing test files + policy TOML + understand classify_files semantics) would have matched the writing cost. Net wash.

Closes #33.

## What One new test file: `tests/test_boring_pr_lifecycle.py` (+151 lines). Pure test, zero src changes. Proves the deterministic PR gate — fed by the real `policies/platform.v0.toml` bundle — returns `eligible_clean` for safe docs/status PRs and `needs_human` for workflow / runtime / control-plane paths. **No Forgejo network calls.** File-based policy load (`load_bundle`) + in-memory `evaluate_pull_request`. ## What's covered (5 tests) ### Positive (`safe_docs_status` → `eligible_clean`) | Test | Changed files | Why | |---|---|---| | `test_docs_prefix_pr_is_eligible_clean` | `docs/roadmap.md`, `docs/operations/notes.md` | Primary dogfood path on `pdurlej/platform` | | `test_status_marker_pr_is_eligible_clean` | `state/STATUS_NOW.md`, `state/cycle/W6d-2026-05-25.md` | W6d status-marker lane (second proven path per `docs/operations/platform-dogfood.md` from #51) | ### Negative (blocked classification → `needs_human`) | Test | Changed files | Classification | |---|---|---| | `test_workflow_change_holds_for_human` | `.forgejo/workflows/patchwarden-client-dry-run.yml` | `workflow` | | `test_runtime_change_holds_for_human` | `compose/docker-compose.yml` | `runtime` | | `test_policy_governance_change_holds_for_human` | `src/patchwarden/pr_check.py` | `policy_governance` | Each negative asserts: `exit_code == 1`, `verdict == "needs_human"`, `would_auto_merge_later == False`. The workflow case also asserts `manual_pr_class` in blockers. ## Why this matters This is the **calibration baseline** for the W6d-automerge-calibration lane. If these five tests fail, the dogfood loop documented in `docs/operations/platform-dogfood.md` (PR #51) cannot be trusted — a regression in `policies/platform.v0.toml`, `policy_bundle.py`, or `pr_check.py` that silently flips a verdict would slip through without these end-to-end fixtures catching it. **Existing `tests/test_pr_check.py` uses `DEFAULT_BUNDLE` (in-memory, hardcoded).** This new file uses `load_bundle(Path("policies/platform.v0.toml"))` — the actual TOML the dry-run workflow loads in production. Different failure mode, different coverage. Complementary, not redundant. ## Spec sources (per codex issue #33) - `tests/test_pipeline.py` — pipeline shape (consulted, not edited) - `tests/test_pr_check.py` — `pull_request()` helper pattern inspired the `_boring_pr()` helper here - `policies/platform.v0.toml` — actual policy bundle (used as input) ## Atomic per ADR-0017 - 1 file, +151 lines, 0 src changes, 0 deletions. - Re-uses existing `evaluate_pull_request`, `PullRequestInput`, `CheckStatus`, `load_bundle` — no new public surface. - `base=main`, no stacking on #51 (already merged). ## Anti-scope (per #33 no-go) - ❌ No model fixtures - ❌ No giant copied real PR payloads (each fixture is <10 lines) - ❌ No snapshot tests with unstable timestamps (deterministic SHAs only) - ❌ No Forgejo network (file-based bundle + dataclass input only) ## Test count **133 → 138 green** (+5 new, all passing on first run). ## Token-accounting ~2-3% weekly Opus. Could have gone to a Sonnet sub-agent, but the brief overhead (read 3 existing test files + policy TOML + understand classify_files semantics) would have matched the writing cost. Net wash. Closes #33.
New test file proving the deterministic PR gate, fed by the real
`policies/platform.v0.toml` bundle, returns `eligible_clean` for safe
docs/status PRs and `needs_human` for workflow / runtime / control-plane
paths. Pure file-based policy load + in-memory evaluation — zero
Forgejo network calls.

## What's covered (5 tests)

**Positive (safe_docs_status → eligible_clean):**
- `docs/` prefix — primary dogfood path on `pdurlej/platform`
- `state/` prefix — W6d status-marker lane (second proven path per
  `docs/operations/platform-dogfood.md`)

**Negative (blocked classification → needs_human):**
- `.forgejo/workflows/` → classification: `workflow`
- `compose/` → classification: `runtime`
- `src/patchwarden/` → classification: `policy_governance`

Each negative asserts:
- exit_code == 1
- verdict == "needs_human"
- would_auto_merge_later == False
- (for workflow) blockers include `manual_pr_class`

## Why this matters

This is the calibration baseline for the W6d-automerge-calibration
lane. If these five tests fail, the dogfood loop documented in
`docs/operations/platform-dogfood.md` cannot be trusted — a regression
in `policies/platform.v0.toml`, `policy_bundle.py`, or `pr_check.py`
that silently flips a verdict would slip through without these
end-to-end fixtures catching it.

The existing `tests/test_pr_check.py` uses `DEFAULT_BUNDLE` (in-memory).
This new file uses `load_bundle(Path("policies/platform.v0.toml"))` —
the actual TOML the dry-run workflow loads in production. Different
failure mode, different coverage.

## Atomic per ADR-0017

- 1 file, +131 lines, 0 src changes, 0 deletions.
- Re-uses existing `evaluate_pull_request`, `PullRequestInput`,
  `CheckStatus`, `load_bundle` — no new public surface.
- `base=main`, no stacking on #51 (already merged).

## Test impact

133 → 138 green (+5 new, all passing on first run).

## Spec sources per codex issue #33

- `tests/test_pipeline.py` — pipeline shape (consulted, not edited)
- `tests/test_pr_check.py` — PR check pattern (helper shape inspired by)
- `policies/platform.v0.toml` — actual policy bundle (used as input)

## Anti-scope

-  No model fixtures (per #33 no-go).
-  No giant copied real PR payloads (each fixture: <10 lines).
-  No snapshot tests with unstable timestamps (deterministic SHAs only).
-  No Forgejo network (file-based bundle load + dataclass input only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No labels
agent/claude-code
agent/codex
agent/gemini
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
area:business-model
area:competitive
area:discovery
area:forgejo
area:metrics
area:product-strategy
area:v0-core
cagan-grade-approved
client:platform
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
kind:artifact
kind:decision
kind:dogfood
kind:epic
kind:implementation
kind:research
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
priority:p0
priority:p1
priority:p2
priority:p3
ready-for-agent
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:blocked-on-discovery
status:cagan-grade-review-pending
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:needs-operator-decision
status:operator-needed
status:parked
tier:0-anchor
tier:0-platform-substrate
tier:1-core
tier:1-iskra-value-layer
tier:2-supporting
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
wave:1-foundation
wave:2-positioning
wave:3-validation
wave:4-economics
wave:5-operating
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/patchwarden!52
No description provided.