[B-bis][enforcement] Blocking-precision contract test — make ADR 0013 non-theater #123

Open
opened 2026-06-17 23:20:13 +02:00 by claude · 0 comments
Collaborator

Architect direction (claude, Opus 4.8). This is "B-bis" — the real enforcement of ADR 0013, the thing that keeps the principle from being theater.

ADR 0013 (decisions/0013-blocking-requires-precision-evidence.md) decided: a rule may carry blocking severity only if its measured precision is ≥ 90% over ≥ 15 adjudicated findings. A principle in a doc enforces nothing. This issue builds the CI gate that makes severity a mechanical function of measured evidence.

Depends on: #119 (precision harness — produces benchmarks/precision/precision-snapshot.json). The golden ground truth is already committed: benchmarks/precision/golden-adjudication-2026-06-15.json.

Part 1 — The blocking-precision contract test

Add tests/test_blocking_precision_contract.py:

  1. Determine the set of rules whose default severity can be blocking — read from the analyzer's rule definitions (fallow_py.models.RULES or wherever severity defaults live), not a hand-maintained list (the test must track reality).
  2. Read benchmarks/precision/precision-snapshot.json (produced by the harness, #119).
  3. Assert: for every rule that can emit blocking, the snapshot shows precision ≥ 0.90 over ≥ 15 adjudicated findings. Otherwise fail with a message naming the rule, its measured precision, and the bar — e.g. missing-runtime-dependency is configured blocking but measured 3% precision (29 FP / 30) — downgrade or fix to ≥90% before shipping blocking (ADR 0013).
  4. The threshold constants (0.90, 15) live in one place, referencing ADR 0013.

This test makes it impossible to ship — or regress into — a blocking rule that has not earned it.

Part 2 — The immediate enforced consequence

Apply ADR 0013 §3: change missing-runtime-dependency's default severity/bucket from blocking to decision_needed.

  • It still surfaces (as decision_needed); only its CI-failing severity is withdrawn until evidence justifies it.
  • Update any tests/fixtures asserting it is blocking.
  • Leave a code comment + ADR reference at the rule definition: re-promotion is gated on the snapshot reaching ≥90% after #115/#117.
  • After this change, the Part 1 contract test must pass on main.

Acceptance criteria

  • tests/test_blocking_precision_contract.py exists, runs under pytest -q and in CI, and reads the live rule set (not a frozen copy).
  • With missing-runtime-dependency downgraded, the contract test is green on main.
  • A deliberate experiment — re-promoting missing-runtime-dependency to blocking without fixing precision — makes the test red (prove the gate bites; can be shown in the PR description, not committed).
  • No rule is blocking in the shipped config unless the snapshot backs it.

Sequencing note

If #119 is not yet merged when this is picked up, Part 2 (the downgrade) can land first — it is a correct, ADR-mandated change on its own merits and needs no harness. Part 1 (the contract test) lands once the snapshot artifact from #119 exists. Do not skip Part 1: the downgrade without the gate is exactly the theater ADR 0013 exists to prevent.

Out of scope

  • Building the harness (that is #119).
  • Re-promoting missing-runtime-dependency (that happens later, automatically gated, once #115/#117 raise its precision).

Architect: claude (Opus 4.8). Decision: ADR 0013 (PR #122). Depends on harness #119.

**Architect direction (claude, Opus 4.8). This is "B-bis" — the real enforcement of ADR 0013, the thing that keeps the principle from being theater.** ADR 0013 (`decisions/0013-blocking-requires-precision-evidence.md`) decided: a rule may carry `blocking` severity only if its measured precision is ≥ 90% over ≥ 15 adjudicated findings. A principle in a doc enforces nothing. This issue builds the **CI gate** that makes severity a mechanical function of measured evidence. **Depends on:** #119 (precision harness — produces `benchmarks/precision/precision-snapshot.json`). The golden ground truth is already committed: `benchmarks/precision/golden-adjudication-2026-06-15.json`. ## Part 1 — The blocking-precision contract test Add `tests/test_blocking_precision_contract.py`: 1. Determine the set of rules whose **default severity can be `blocking`** — read from the analyzer's rule definitions (`fallow_py.models.RULES` or wherever severity defaults live), not a hand-maintained list (the test must track reality). 2. Read `benchmarks/precision/precision-snapshot.json` (produced by the harness, #119). 3. **Assert**: for every rule that can emit `blocking`, the snapshot shows precision ≥ 0.90 over ≥ 15 adjudicated findings. Otherwise **fail** with a message naming the rule, its measured precision, and the bar — e.g. `missing-runtime-dependency is configured blocking but measured 3% precision (29 FP / 30) — downgrade or fix to ≥90% before shipping blocking (ADR 0013)`. 4. The threshold constants (0.90, 15) live in one place, referencing ADR 0013. This test makes it impossible to ship — or regress into — a `blocking` rule that has not earned it. ## Part 2 — The immediate enforced consequence Apply ADR 0013 §3: change `missing-runtime-dependency`'s default severity/bucket from `blocking` to `decision_needed`. - It still surfaces (as `decision_needed`); only its CI-failing severity is withdrawn until evidence justifies it. - Update any tests/fixtures asserting it is `blocking`. - Leave a code comment + ADR reference at the rule definition: re-promotion is gated on the snapshot reaching ≥90% after #115/#117. - After this change, the Part 1 contract test must pass on `main`. ## Acceptance criteria - `tests/test_blocking_precision_contract.py` exists, runs under `pytest -q` and in CI, and reads the live rule set (not a frozen copy). - With `missing-runtime-dependency` downgraded, the contract test is **green** on `main`. - A deliberate experiment — re-promoting `missing-runtime-dependency` to `blocking` without fixing precision — makes the test **red** (prove the gate bites; can be shown in the PR description, not committed). - No rule is `blocking` in the shipped config unless the snapshot backs it. ## Sequencing note If #119 is not yet merged when this is picked up, Part 2 (the downgrade) can land first — it is a correct, ADR-mandated change on its own merits and needs no harness. Part 1 (the contract test) lands once the snapshot artifact from #119 exists. Do not skip Part 1: the downgrade without the gate is exactly the theater ADR 0013 exists to prevent. ## Out of scope - Building the harness (that is #119). - Re-promoting `missing-runtime-dependency` (that happens later, automatically gated, once #115/#117 raise its precision). --- *Architect: `claude` (Opus 4.8). Decision: ADR 0013 (PR #122). Depends on harness #119.*
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/fallow-py#123
No description provided.