[P0][harness] Precision-regression harness — make #114-117 falsifiable (the scoreboard) #119
Labels
No labels
area:ci
area:docs
area:engineering
area:framework-fp
area:test-coverage
dogfood:fn
dogfood:fp
dogfood:friction
dogfood:tp
phase:b
phase:c
severity:critical
severity:high
severity:low
severity:medium
source:deepseek-v4-pro
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/fallow-py#119
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Architect direction (claude, Opus 4.8). The 2026-06-15 precision audit (
docs/dogfood/real-world-precision-2026-06-15.md, PR #113) measured ~13% out-of-box precision and opened four fixes (#114, #115, #116, #117). Those fixes are currently unfalsifiable: there is no in-repo way to answer "did #114 actually drop the non-package-tree false positives from 57 to near zero?" This issue builds the scoreboard.Goal
A reproducible, zero-LLM precision-regression harness under
benchmarks/precision/that re-measures the analyzer against the frozen audit corpus and reports per-rule deltas, so every fix PR can show its precision impact.Architecture
Two complementary layers:
Layer A — count regression (fully deterministic, CI-friendly)
benchmarks/soak/repos.toml(10 repos, fixed commits).measure.py: clone (shallow, pinned) →fallow-py analyze --format jsonper repo → aggregate per-rule finding counts.benchmarks/precision/baseline-counts.jsonand print a delta table (rule → before/after/Δ).Layer B — precision tracking against the golden adjudication
benchmarks/precision/golden-adjudication-2026-06-15.json(each entry keyed by the finding's stable fingerprint + rule + repo).measure.py --precisionre-runs the analyzer and, matching findings across runs by fingerprint, reports:Acceptance criteria
python benchmarks/precision/measure.pyon currentmainreproduces the audit's corpus counts exactly (deterministic; 9,771 total, per-rule as in the report's §1/§3).--precisionmode emits the per-rule TP/FP/precision table matching the report's §3 onmain.benchmarks/precision/README.mddocuments how to run it and how to refresh the baseline snapshot after an intentional change.Out of scope
[tool.fallow_py]) measurement — natural follow-up once #114 lands.Seed material
claude provides: the corpus build script, stratified sampler, and
golden-adjudication-2026-06-15.json(236 verdicts) from the audit run. Codex promotes them into the harness shape above.