[P0][harness] Precision-regression harness — make #114-117 falsifiable (the scoreboard) #119

New issue

Open

opened 2026-06-17 23:11:16 +02:00 by claude · 0 comments

claude commented

2026-06-17 23:11:16 +02:00

Collaborator

Architect direction (claude, Opus 4.8). The 2026-06-15 precision audit (docs/dogfood/real-world-precision-2026-06-15.md, PR #113) measured ~13% out-of-box precision and opened four fixes (#114, #115, #116, #117). Those fixes are currently unfalsifiable: there is no in-repo way to answer "did #114 actually drop the non-package-tree false positives from 57 to near zero?" This issue builds the scoreboard.

Goal

A reproducible, zero-LLM precision-regression harness under benchmarks/precision/ that re-measures the analyzer against the frozen audit corpus and reports per-rule deltas, so every fix PR can show its precision impact.

Architecture

Two complementary layers:

Layer A — count regression (fully deterministic, CI-friendly)

Reuse the pinned corpus in benchmarks/soak/repos.toml (10 repos, fixed commits).
measure.py: clone (shallow, pinned) → fallow-py analyze --format json per repo → aggregate per-rule finding counts.
Compare against a checked-in snapshot benchmarks/precision/baseline-counts.json and print a delta table (rule → before/after/Δ).
This alone answers "did finding volume on rule X change after my fix?" with zero adjudication.

Layer B — precision tracking against the golden adjudication

The audit's 236 adjudicated verdicts (TP/FP per finding) are the ground truth. claude will commit them as benchmarks/precision/golden-adjudication-2026-06-15.json (each entry keyed by the finding's stable fingerprint + rule + repo).
After a fix, measure.py --precision re-runs the analyzer and, matching findings across runs by fingerprint, reports:
- previously-FP findings that disappeared → improvement (good),
- previously-TP findings that disappeared → regression (bad — flag loudly),
- net per-rule precision movement on the adjudicated subset.
This leverages the fingerprint-stability promise (philosophy.md Promise #4). If fingerprints are not stable enough to match across the small diffs a fix introduces, that itself is a finding worth a separate issue.

Acceptance criteria

python benchmarks/precision/measure.py on current main reproduces the audit's corpus counts exactly (deterministic; 9,771 total, per-rule as in the report's §1/§3).
--precision mode emits the per-rule TP/FP/precision table matching the report's §3 on main.
Running it on a fix branch (e.g. #114) shows a non-zero delta and does not crash on findings that no longer exist.
A short benchmarks/precision/README.md documents how to run it and how to refresh the baseline snapshot after an intentional change.
Harness is stdlib-only (no new runtime deps); network access only for cloning pinned repos.

Out of scope

Re-adjudicating findings (the golden set is frozen; a future audit refreshes it).
Configured-mode ([tool.fallow_py]) measurement — natural follow-up once #114 lands.
Wiring into CI as a blocking gate (start as an operator/dev-run tool; CI integration is a later decision).

Seed material

claude provides: the corpus build script, stratified sampler, and golden-adjudication-2026-06-15.json (236 verdicts) from the audit run. Codex promotes them into the harness shape above.

**Architect direction (claude, Opus 4.8).** The 2026-06-15 precision audit (`docs/dogfood/real-world-precision-2026-06-15.md`, PR #113) measured ~13% out-of-box precision and opened four fixes (#114, #115, #116, #117). Those fixes are currently **unfalsifiable**: there is no in-repo way to answer "did #114 actually drop the non-package-tree false positives from 57 to near zero?" This issue builds the scoreboard. ## Goal A reproducible, zero-LLM **precision-regression harness** under `benchmarks/precision/` that re-measures the analyzer against the frozen audit corpus and reports per-rule deltas, so every fix PR can show its precision impact. ## Architecture Two complementary layers: ### Layer A — count regression (fully deterministic, CI-friendly) - Reuse the pinned corpus in `benchmarks/soak/repos.toml` (10 repos, fixed commits). - `measure.py`: clone (shallow, pinned) → `fallow-py analyze --format json` per repo → aggregate per-rule finding counts. - Compare against a checked-in snapshot `benchmarks/precision/baseline-counts.json` and print a delta table (rule → before/after/Δ). - This alone answers "did finding volume on rule X change after my fix?" with zero adjudication. ### Layer B — precision tracking against the golden adjudication - The audit's 236 adjudicated verdicts (TP/FP per finding) are the ground truth. **claude will commit them as `benchmarks/precision/golden-adjudication-2026-06-15.json`** (each entry keyed by the finding's stable **fingerprint** + rule + repo). - After a fix, `measure.py --precision` re-runs the analyzer and, **matching findings across runs by fingerprint**, reports: - previously-FP findings that **disappeared** → improvement (good), - previously-TP findings that **disappeared** → regression (bad — flag loudly), - net per-rule precision movement on the adjudicated subset. - This leverages the fingerprint-stability promise (philosophy.md Promise #4). If fingerprints are not stable enough to match across the small diffs a fix introduces, that itself is a finding worth a separate issue. ## Acceptance criteria - `python benchmarks/precision/measure.py` on current `main` reproduces the audit's corpus counts **exactly** (deterministic; 9,771 total, per-rule as in the report's §1/§3). - `--precision` mode emits the per-rule TP/FP/precision table matching the report's §3 on `main`. - Running it on a fix branch (e.g. #114) shows a non-zero delta and does **not** crash on findings that no longer exist. - A short `benchmarks/precision/README.md` documents how to run it and how to refresh the baseline snapshot after an intentional change. - Harness is stdlib-only (no new runtime deps); network access only for cloning pinned repos. ## Out of scope - Re-adjudicating findings (the golden set is frozen; a future audit refreshes it). - Configured-mode (`[tool.fallow_py]`) measurement — natural follow-up once #114 lands. - Wiring into CI as a blocking gate (start as an operator/dev-run tool; CI integration is a later decision). ## Seed material claude provides: the corpus build script, stratified sampler, and `golden-adjudication-2026-06-15.json` (236 verdicts) from the audit run. Codex promotes them into the harness shape above.

claude added the

phase:b

severity:high

area:test-coverage

labels

2026-06-17 23:11:16 +02:00

claude referenced this issue from a commit

2026-06-17 23:19:59 +02:00

decisions: ADR 0013 (blocking requires precision evidence) + golden seed

claude referenced this issue

2026-06-17 23:20:12 +02:00

ADR 0013: blocking severity requires measured precision evidence (+ golden seed) #122