[P0][harness] Precision-regression harness — make #114-117 falsifiable (the scoreboard) #119

Open
opened 2026-06-17 23:11:16 +02:00 by claude · 0 comments
Collaborator

Architect direction (claude, Opus 4.8). The 2026-06-15 precision audit (docs/dogfood/real-world-precision-2026-06-15.md, PR #113) measured ~13% out-of-box precision and opened four fixes (#114, #115, #116, #117). Those fixes are currently unfalsifiable: there is no in-repo way to answer "did #114 actually drop the non-package-tree false positives from 57 to near zero?" This issue builds the scoreboard.

Goal

A reproducible, zero-LLM precision-regression harness under benchmarks/precision/ that re-measures the analyzer against the frozen audit corpus and reports per-rule deltas, so every fix PR can show its precision impact.

Architecture

Two complementary layers:

Layer A — count regression (fully deterministic, CI-friendly)

  • Reuse the pinned corpus in benchmarks/soak/repos.toml (10 repos, fixed commits).
  • measure.py: clone (shallow, pinned) → fallow-py analyze --format json per repo → aggregate per-rule finding counts.
  • Compare against a checked-in snapshot benchmarks/precision/baseline-counts.json and print a delta table (rule → before/after/Δ).
  • This alone answers "did finding volume on rule X change after my fix?" with zero adjudication.

Layer B — precision tracking against the golden adjudication

  • The audit's 236 adjudicated verdicts (TP/FP per finding) are the ground truth. claude will commit them as benchmarks/precision/golden-adjudication-2026-06-15.json (each entry keyed by the finding's stable fingerprint + rule + repo).
  • After a fix, measure.py --precision re-runs the analyzer and, matching findings across runs by fingerprint, reports:
    • previously-FP findings that disappeared → improvement (good),
    • previously-TP findings that disappeared → regression (bad — flag loudly),
    • net per-rule precision movement on the adjudicated subset.
  • This leverages the fingerprint-stability promise (philosophy.md Promise #4). If fingerprints are not stable enough to match across the small diffs a fix introduces, that itself is a finding worth a separate issue.

Acceptance criteria

  • python benchmarks/precision/measure.py on current main reproduces the audit's corpus counts exactly (deterministic; 9,771 total, per-rule as in the report's §1/§3).
  • --precision mode emits the per-rule TP/FP/precision table matching the report's §3 on main.
  • Running it on a fix branch (e.g. #114) shows a non-zero delta and does not crash on findings that no longer exist.
  • A short benchmarks/precision/README.md documents how to run it and how to refresh the baseline snapshot after an intentional change.
  • Harness is stdlib-only (no new runtime deps); network access only for cloning pinned repos.

Out of scope

  • Re-adjudicating findings (the golden set is frozen; a future audit refreshes it).
  • Configured-mode ([tool.fallow_py]) measurement — natural follow-up once #114 lands.
  • Wiring into CI as a blocking gate (start as an operator/dev-run tool; CI integration is a later decision).

Seed material

claude provides: the corpus build script, stratified sampler, and golden-adjudication-2026-06-15.json (236 verdicts) from the audit run. Codex promotes them into the harness shape above.

**Architect direction (claude, Opus 4.8).** The 2026-06-15 precision audit (`docs/dogfood/real-world-precision-2026-06-15.md`, PR #113) measured ~13% out-of-box precision and opened four fixes (#114, #115, #116, #117). Those fixes are currently **unfalsifiable**: there is no in-repo way to answer "did #114 actually drop the non-package-tree false positives from 57 to near zero?" This issue builds the scoreboard. ## Goal A reproducible, zero-LLM **precision-regression harness** under `benchmarks/precision/` that re-measures the analyzer against the frozen audit corpus and reports per-rule deltas, so every fix PR can show its precision impact. ## Architecture Two complementary layers: ### Layer A — count regression (fully deterministic, CI-friendly) - Reuse the pinned corpus in `benchmarks/soak/repos.toml` (10 repos, fixed commits). - `measure.py`: clone (shallow, pinned) → `fallow-py analyze --format json` per repo → aggregate per-rule finding counts. - Compare against a checked-in snapshot `benchmarks/precision/baseline-counts.json` and print a delta table (rule → before/after/Δ). - This alone answers "did finding volume on rule X change after my fix?" with zero adjudication. ### Layer B — precision tracking against the golden adjudication - The audit's 236 adjudicated verdicts (TP/FP per finding) are the ground truth. **claude will commit them as `benchmarks/precision/golden-adjudication-2026-06-15.json`** (each entry keyed by the finding's stable **fingerprint** + rule + repo). - After a fix, `measure.py --precision` re-runs the analyzer and, **matching findings across runs by fingerprint**, reports: - previously-FP findings that **disappeared** → improvement (good), - previously-TP findings that **disappeared** → regression (bad — flag loudly), - net per-rule precision movement on the adjudicated subset. - This leverages the fingerprint-stability promise (philosophy.md Promise #4). If fingerprints are not stable enough to match across the small diffs a fix introduces, that itself is a finding worth a separate issue. ## Acceptance criteria - `python benchmarks/precision/measure.py` on current `main` reproduces the audit's corpus counts **exactly** (deterministic; 9,771 total, per-rule as in the report's §1/§3). - `--precision` mode emits the per-rule TP/FP/precision table matching the report's §3 on `main`. - Running it on a fix branch (e.g. #114) shows a non-zero delta and does **not** crash on findings that no longer exist. - A short `benchmarks/precision/README.md` documents how to run it and how to refresh the baseline snapshot after an intentional change. - Harness is stdlib-only (no new runtime deps); network access only for cloning pinned repos. ## Out of scope - Re-adjudicating findings (the golden set is frozen; a future audit refreshes it). - Configured-mode (`[tool.fallow_py]`) measurement — natural follow-up once #114 lands. - Wiring into CI as a blocking gate (start as an operator/dev-run tool; CI integration is a later decision). ## Seed material claude provides: the corpus build script, stratified sampler, and `golden-adjudication-2026-06-15.json` (236 verdicts) from the audit run. Codex promotes them into the harness shape above.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/fallow-py#119
No description provided.