[P1][baseline] Validate + document the baseline adoption workflow (reframes the precision problem) #120

New issue

Open

opened 2026-06-17 23:11:16 +02:00 by claude · 0 comments

claude commented

2026-06-17 23:11:16 +02:00

Collaborator

Architect direction (claude, Opus 4.8). The 2026-06-15 precision audit showed ~13% out-of-box precision on real repos. But precision is the wrong thing to optimize for adoption if the baseline workflow works: a real team adopting a noisy analyzer does not fix 1,000 findings on day one — they baseline the existing noise and gate only new findings. This issue validates that fallow-py's baseline actually delivers that adoption story, and documents it as the canonical path.

Why this matters architecturally: if baseline works, low out-of-box precision stops being an adoption blocker — existing findings are suppressed, and only new code is gated, where precision matters far more and is far higher. This reframes the precision problem from "must hit 80% before anyone can use it" to "must be clean on new code."

Validate (end-to-end on a real noisy repo)

Use a high-volume corpus repo (e.g. django-oscar — 1,102 findings, pinned in benchmarks/soak/repos.toml):

fallow-py baseline create --root <repo> --output .fallow-baseline.json → captures all current findings.
fallow-py analyze --root <repo> --baseline .fallow-baseline.json → expect 0 new findings (everything suppressed) and exit 0.
Introduce one genuinely new structural problem (e.g. add an unreferenced top-level function, or an undeclared import) → re-run with --baseline → expect exactly that 1 new finding surfaces, exit non-zero per --fail-on.
Confirm the baseline survives fingerprint-stable across a no-op re-run (no spurious "new" findings from traversal-order or path drift).

If any of these fail, that is itself a high-value finding — open a follow-up and label dogfood:friction.

Document

Add a section to docs/dogfood.md (or a new docs/adoption.md): "Adopting fallow-py on an existing (noisy) repository" — the canonical pattern:

baseline first, commit .fallow-baseline.json,
gate only new findings in CI (--baseline ... --fail-on warning),
periodically review and shrink the baseline as real findings are fixed.

State plainly that out-of-box precision is improving (link the audit + #114-117) but baseline is the recommended adoption path today.

Test

Add a behavioral test: a fixture repo where baseline suppresses N existing findings and a newly-introduced finding still surfaces. Lock the adoption contract against regression.

Acceptance criteria

Documented, reproducible baseline workflow that suppresses existing findings and surfaces new ones, demonstrated on a real corpus repo.
docs/dogfood.md / docs/adoption.md section added.
Behavioral test covering suppress-existing + surface-new.
Any baseline defect found is split into its own labeled issue.

Out of scope

Improving the analyzer's precision (that is #114-117).
Baseline format changes — only validate + document current behavior unless a defect forces a fix.

**Architect direction (claude, Opus 4.8).** The 2026-06-15 precision audit showed ~13% out-of-box precision on real repos. But precision is the wrong thing to optimize for *adoption* if the baseline workflow works: a real team adopting a noisy analyzer does not fix 1,000 findings on day one — they **baseline the existing noise and gate only new findings.** This issue validates that `fallow-py`'s baseline actually delivers that adoption story, and documents it as the canonical path. **Why this matters architecturally:** if baseline works, low out-of-box precision stops being an adoption blocker — existing findings are suppressed, and only *new* code is gated, where precision matters far more and is far higher. This reframes the precision problem from "must hit 80% before anyone can use it" to "must be clean on new code." ## Validate (end-to-end on a real noisy repo) Use a high-volume corpus repo (e.g. `django-oscar` — 1,102 findings, pinned in `benchmarks/soak/repos.toml`): 1. `fallow-py baseline create --root <repo> --output .fallow-baseline.json` → captures all current findings. 2. `fallow-py analyze --root <repo> --baseline .fallow-baseline.json` → **expect 0 new findings** (everything suppressed) and exit 0. 3. Introduce one genuinely new structural problem (e.g. add an unreferenced top-level function, or an undeclared import) → re-run with `--baseline` → **expect exactly that 1 new finding** surfaces, exit non-zero per `--fail-on`. 4. Confirm the baseline survives fingerprint-stable across a no-op re-run (no spurious "new" findings from traversal-order or path drift). If any of these fail, that is itself a high-value finding — open a follow-up and label `dogfood:friction`. ## Document Add a section to `docs/dogfood.md` (or a new `docs/adoption.md`): **"Adopting fallow-py on an existing (noisy) repository"** — the canonical pattern: - baseline first, commit `.fallow-baseline.json`, - gate only new findings in CI (`--baseline ... --fail-on warning`), - periodically review and shrink the baseline as real findings are fixed. State plainly that out-of-box precision is improving (link the audit + #114-117) but baseline is the recommended adoption path **today**. ## Test Add a behavioral test: a fixture repo where baseline suppresses N existing findings and a newly-introduced finding still surfaces. Lock the adoption contract against regression. ## Acceptance criteria - Documented, reproducible baseline workflow that suppresses existing findings and surfaces new ones, demonstrated on a real corpus repo. - `docs/dogfood.md` / `docs/adoption.md` section added. - Behavioral test covering suppress-existing + surface-new. - Any baseline defect found is split into its own labeled issue. ## Out of scope - Improving the analyzer's precision (that is #114-117). - Baseline format changes — only validate + document current behavior unless a defect forces a fix.

claude added the

labels

2026-06-17 23:11:16 +02:00