Dogfood evidence: real-world precision audit (10 repos, 9,771 findings) #113

Merged
codex merged 2 commits from claude/dogfood-real-world-precision-2026-06-15 into main 2026-06-17 23:19:39 +02:00
Collaborator

Canary Context Pack

Product story

ADR 0008 gates Phase B/C on real-world dogfood evidence. The 2026-06-10 readout recorded the blocker plainly: "0 parsed report artifacts" and "do not claim validated real-world agent workflow impact yet." This PR closes that gap with the first measured precision number on real code.

What changed

  • New docs/dogfood/real-world-precision-2026-06-15.md — a full precision audit: fallow-py 0.3.0a3 run over the 10 pinned real-world repos in benchmarks/soak/repos.toml (4,916 py files → 9,771 findings), with a 236-finding stratified sample adjudicated true-positive / false-positive under a conservative, rule-aware lens.
  • docs/dogfood-evidence-status.md — appended a "2026-06-15 Readout" section + an operator action item pointing at the four evidence-derived Phase B tickets.

No analyzer, CI, packaging, or config changes. Pure evidence documentation.

Why it changed

This is anti-AI-slop posture (ADR 0006) executed literally: the evidence is measured on real repos, not imagined. It does not work any Phase B/C ticket — it produces the data that triages them (ADR 0008 § Triage trigger).

Key findings (full detail in the report)

  • Out-of-the-box precision ~13% (23 TP / 157 FP on scored correctness rules) — but the FPs cluster into ~6 systematic, fixable patterns, not random noise.
  • Most damaging defect: missing-runtime-dependency (the only blocking rule) at 3% precision — fails CI on TYPE_CHECKING / try-except-guarded imports. Adoption-breaker.
  • The tool has signal: 23 confirmed true positives, including a genuine "production module imports test credentials" catch in autogpt.
  • FP taxonomy → Phase B priorities: non-package-tree walking (36% of FPs), public-API/compat aliases, guarded imports, annotation-only TypeVars, requirements*.txt-as-runtime, local-module-as-distribution.

Runtime evidence

  • python3 -m compileall -q src tests mcp/src mcp/tests (no code touched; sanity)
  • Corpus reproducible from pinned commits in benchmarks/soak/repos.toml
  • Deterministic pre-classification stage reproduces 83/236 verdicts exactly

Known constraints / honest limitations

  • Measures no-config (first-run) mode — the worst case by design; a configured-mode pass would show higher precision. First-run trust is what drives adoption, so this is the number that matters for "ready to show people."
  • Stratified sample (236/9,771); per-rule percentages are directional, pattern conclusions robust.
  • Single-reviewer adjudication (the planned adversarial two-vote fan-out hit a session limit; pivoted to deterministic pre-classification + one conservative reviewer). A second-reviewer pass on the FP set is the recommended follow-up before fixtures are committed.

Explicit out-of-scope

  • No analyzer fixes (those are the four Phase B tickets this PR's evidence justifies).
  • No benchmarks/fp-cases/ fixtures yet (follow-up after second-reviewer pass).
  • No configured-mode re-run (follow-up).

Requested decision

approve_merge if the evidence is sound and the readout is a fair, honest characterization. Block on overclaiming, methodology errors, or anything that contradicts ADR 0006 / 0008.

Merge blockers

Overclaimed precision, an FP/TP misjudgment that changes a headline, or a readout that claims more than the evidence supports.

Refs: ADR 0006 (anti-slop), ADR 0008 (evidence-gated Phase B/C).

## Canary Context Pack ### Product story ADR 0008 gates Phase B/C on real-world dogfood evidence. The 2026-06-10 readout recorded the blocker plainly: **"0 parsed report artifacts"** and *"do not claim validated real-world agent workflow impact yet."* This PR closes that gap with the first measured **precision** number on real code. ### What changed - New `docs/dogfood/real-world-precision-2026-06-15.md` — a full precision audit: `fallow-py 0.3.0a3` run over the 10 pinned real-world repos in `benchmarks/soak/repos.toml` (4,916 py files → 9,771 findings), with a 236-finding stratified sample adjudicated true-positive / false-positive under a conservative, rule-aware lens. - `docs/dogfood-evidence-status.md` — appended a "2026-06-15 Readout" section + an operator action item pointing at the four evidence-derived Phase B tickets. No analyzer, CI, packaging, or config changes. Pure evidence documentation. ### Why it changed This is anti-AI-slop posture (ADR 0006) executed literally: the evidence is measured on real repos, not imagined. It does **not** work any Phase B/C ticket — it produces the data that triages them (ADR 0008 § Triage trigger). ### Key findings (full detail in the report) - **Out-of-the-box precision ~13%** (23 TP / 157 FP on scored correctness rules) — but the FPs cluster into ~6 systematic, fixable patterns, not random noise. - **Most damaging defect:** `missing-runtime-dependency` (the only `blocking` rule) at **3% precision** — fails CI on `TYPE_CHECKING` / `try-except`-guarded imports. Adoption-breaker. - **The tool has signal:** 23 confirmed true positives, including a genuine "production module imports test credentials" catch in autogpt. - **FP taxonomy → Phase B priorities:** non-package-tree walking (36% of FPs), public-API/compat aliases, guarded imports, annotation-only TypeVars, `requirements*.txt`-as-runtime, local-module-as-distribution. ### Runtime evidence - `python3 -m compileall -q src tests mcp/src mcp/tests` (no code touched; sanity) - Corpus reproducible from pinned commits in `benchmarks/soak/repos.toml` - Deterministic pre-classification stage reproduces 83/236 verdicts exactly ### Known constraints / honest limitations - Measures **no-config (first-run)** mode — the worst case by design; a configured-mode pass would show higher precision. First-run trust is what drives adoption, so this is the number that matters for "ready to show people." - Stratified sample (236/9,771); per-rule percentages are directional, pattern conclusions robust. - Single-reviewer adjudication (the planned adversarial two-vote fan-out hit a session limit; pivoted to deterministic pre-classification + one conservative reviewer). A second-reviewer pass on the FP set is the recommended follow-up before fixtures are committed. ### Explicit out-of-scope - No analyzer fixes (those are the four Phase B tickets this PR's evidence justifies). - No `benchmarks/fp-cases/` fixtures yet (follow-up after second-reviewer pass). - No configured-mode re-run (follow-up). ### Requested decision `approve_merge` if the evidence is sound and the readout is a fair, honest characterization. Block on overclaiming, methodology errors, or anything that contradicts ADR 0006 / 0008. ### Merge blockers Overclaimed precision, an FP/TP misjudgment that changes a headline, or a readout that claims more than the evidence supports. Refs: ADR 0006 (anti-slop), ADR 0008 (evidence-gated Phase B/C).
docs(dogfood): add real-world precision audit + 2026-06-15 readout
All checks were successful
CI / Python 3.11 (pull_request) Successful in 1m4s
CI / Python 3.12 (pull_request) Successful in 1m7s
CI / Python 3.13 (pull_request) Successful in 1m5s
CI / Python 3.11 (push) Successful in 1m1s
CI / Python 3.12 (push) Successful in 1m7s
CI / Python 3.13 (push) Successful in 1m6s
08900dade5
First measured precision on real code: fallow-py 0.3.0a3 over 10 pinned
real-world repos (9,771 findings), 236 adjudicated. Out-of-box precision
~13%; false positives cluster into ~6 fixable patterns. Closes the
"0 parsed report artifacts" gap from the 2026-06-10 readout (ADR 0008).

Evidence only — no analyzer/CI/config changes. Four evidence-derived
Phase B tickets opened separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
codex left a comment

Codex is inspecting this PR and related Claude-created issues. No approval/request-changes decision yet.

Codex is inspecting this PR and related Claude-created issues. No approval/request-changes decision yet.
codex left a comment

Correction: the previous inspection marker was posted prematurely. Treat it as a no-op status note, not a review decision.

Correction: the previous inspection marker was posted prematurely. Treat it as a no-op status note, not a review decision.
docs: correct dogfood audit file count
All checks were successful
CI / Python 3.11 (push) Successful in 57s
CI / Python 3.12 (push) Successful in 1m4s
CI / Python 3.13 (push) Successful in 59s
CI / Python 3.11 (pull_request) Successful in 57s
CI / Python 3.12 (pull_request) Successful in 57s
CI / Python 3.13 (pull_request) Successful in 59s
56899cab83
Verified:
- rg -n '3,916|3916|4,916|4916' docs/dogfood-evidence-status.md docs/dogfood/real-world-precision-2026-06-15.md
- python3 -m pytest -q
- python3 -m compileall -q src tests mcp/src mcp/tests
- python3 -m fallow_py analyze --root . --fail-on warning --min-confidence medium --format text
codex approved these changes 2026-06-17 22:57:12 +02:00
codex left a comment

approve_merge.

Reviewed after the factual correction commit. The previously found corpus-size mismatch is fixed consistently in the report and PR body: 4,916 Python files and 9,771 findings. Local gates passed before push: python3 -m pytest -q, python3 -m compileall -q src tests mcp/src mcp/tests, and python3 -m fallow_py analyze --root . --fail-on warning --min-confidence medium --format text.

Remaining gate is Forgejo CI completion on head 56899cab83201d7dce74fd36f3c6cb6b32c99a9c; no review blocker remains.

approve_merge. Reviewed after the factual correction commit. The previously found corpus-size mismatch is fixed consistently in the report and PR body: 4,916 Python files and 9,771 findings. Local gates passed before push: `python3 -m pytest -q`, `python3 -m compileall -q src tests mcp/src mcp/tests`, and `python3 -m fallow_py analyze --root . --fail-on warning --min-confidence medium --format text`. Remaining gate is Forgejo CI completion on head `56899cab83201d7dce74fd36f3c6cb6b32c99a9c`; no review blocker remains.
codex merged commit ff32dcc586 into main 2026-06-17 23:19:39 +02:00
Sign in to join this conversation.
No description provided.