Bench: profile analyzer phases before parallelism #44

Merged
pdurlej merged 2 commits from codex/perf-profile-benchmark-41 into main 2026-05-13 20:06:59 +02:00
Collaborator

Canary Context Pack

Product story

Before adding parallel AST indexing, pyfallow should have reproducible timing evidence showing whether indexing is actually the dominant bottleneck. This keeps the analyzer deterministic and avoids optimizing from vibes.

What changed

  • Added benchmarks/analysis-profile/run.py, a stdlib-only harness that profiles analyzer phases in-process.
  • Added benchmarks/analysis-profile/README.md with commands, cases, local results, and the current decision.
  • Added an Analyzer Internals Profile section to docs/performance.md.
  • Added tests for --list and a generated-fixture profile run.
  • Ignored generated profile workspace/results.

Why it changed

DeepSeek audit triage issue #41 suggested measuring pyfallow internals before implementing parallel AST indexing. The current local profile shows indexing is visible but not enough evidence to justify concurrency yet.

Files touched

  • .gitignore
  • benchmarks/analysis-profile/README.md
  • benchmarks/analysis-profile/run.py
  • docs/performance.md
  • tests/test_pyfallow.py

Runtime evidence

  • PASS: python3 benchmarks/analysis-profile/run.py --list
  • PASS: python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-profile.json
  • PASS: python3 benchmarks/analysis-profile/run.py --case all --generated-modules 120 --runs 3 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-analysis-profile.json
  • PASS: python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py
  • PASS: git diff --check
  • PASS: python3 -m pytest -q
  • PASS: /tmp/pyfallow-verify-venv/bin/python -m pytest -q mcp/tests
  • PASS: PYTHONPATH=src python3 -m pyfallow analyze --root . --fail-on warning --min-confidence medium

Local profile snapshot:

Case Files Imports Issues Median total Largest measured phases
demo-project 12 12 17 0.007018s source discovery 0.001413s; duplicate detection 0.001413s; file indexing 0.001281s
generated --generated-modules 120 122 41 297 0.076249s file indexing 0.021216s; complexity 0.014413s; source discovery 0.011337s

Known constraints

  • The harness uses temporary function wrapping inside the benchmark process; production analyzer behavior is unchanged.
  • The generated fixture is synthetic. It is useful for repeatability, not a substitute for large real-world corpus profiling.

Explicit out-of-scope

  • Implementing concurrent.futures parsing.
  • Changing analyzer runtime behavior.
  • Optimizing duplicate detection, graph traversal, or complexity scoring.

Requested decision

Approve if the benchmark is reproducible and the documented “do not parallelize yet” decision is supported by the evidence.

Merge blockers

  • Production analyzer behavior changes.
  • Runtime dependency added.
  • Benchmark cannot run in a fresh checkout.

Closes #41.
Refs #35.

## Canary Context Pack ### Product story Before adding parallel AST indexing, pyfallow should have reproducible timing evidence showing whether indexing is actually the dominant bottleneck. This keeps the analyzer deterministic and avoids optimizing from vibes. ### What changed - Added `benchmarks/analysis-profile/run.py`, a stdlib-only harness that profiles analyzer phases in-process. - Added `benchmarks/analysis-profile/README.md` with commands, cases, local results, and the current decision. - Added an Analyzer Internals Profile section to `docs/performance.md`. - Added tests for `--list` and a generated-fixture profile run. - Ignored generated profile workspace/results. ### Why it changed DeepSeek audit triage issue #41 suggested measuring pyfallow internals before implementing parallel AST indexing. The current local profile shows indexing is visible but not enough evidence to justify concurrency yet. ### Files touched - `.gitignore` - `benchmarks/analysis-profile/README.md` - `benchmarks/analysis-profile/run.py` - `docs/performance.md` - `tests/test_pyfallow.py` ### Runtime evidence - PASS: `python3 benchmarks/analysis-profile/run.py --list` - PASS: `python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-profile.json` - PASS: `python3 benchmarks/analysis-profile/run.py --case all --generated-modules 120 --runs 3 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-analysis-profile.json` - PASS: `python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py` - PASS: `git diff --check` - PASS: `python3 -m pytest -q` - PASS: `/tmp/pyfallow-verify-venv/bin/python -m pytest -q mcp/tests` - PASS: `PYTHONPATH=src python3 -m pyfallow analyze --root . --fail-on warning --min-confidence medium` Local profile snapshot: | Case | Files | Imports | Issues | Median total | Largest measured phases | | --- | ---: | ---: | ---: | ---: | --- | | `demo-project` | 12 | 12 | 17 | 0.007018s | source discovery 0.001413s; duplicate detection 0.001413s; file indexing 0.001281s | | `generated --generated-modules 120` | 122 | 41 | 297 | 0.076249s | file indexing 0.021216s; complexity 0.014413s; source discovery 0.011337s | ### Known constraints - The harness uses temporary function wrapping inside the benchmark process; production analyzer behavior is unchanged. - The generated fixture is synthetic. It is useful for repeatability, not a substitute for large real-world corpus profiling. ### Explicit out-of-scope - Implementing `concurrent.futures` parsing. - Changing analyzer runtime behavior. - Optimizing duplicate detection, graph traversal, or complexity scoring. ### Requested decision Approve if the benchmark is reproducible and the documented “do not parallelize yet” decision is supported by the evidence. ### Merge blockers - Production analyzer behavior changes. - Runtime dependency added. - Benchmark cannot run in a fresh checkout. Closes #41. Refs #35.
Bench: profile analyzer phases before parallelism
All checks were successful
CI / Python 3.11 (push) Successful in 52s
CI / Python 3.12 (push) Successful in 55s
CI / Python 3.13 (push) Successful in 55s
CI / Python 3.11 (pull_request) Successful in 52s
CI / Python 3.12 (pull_request) Successful in 54s
CI / Python 3.13 (pull_request) Successful in 56s
6b4ebd65be
Add a stdlib-only analyzer profile benchmark for DeepSeek triage issue #41. The harness measures pyfallow phase timings on the checked-in demo project and a deterministic generated fixture, then documents the current local evidence against implementing parallel AST indexing yet.

Verified:

- PASS: python3 benchmarks/analysis-profile/run.py --list

- PASS: python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-profile.json

- PASS: python3 benchmarks/analysis-profile/run.py --case all --generated-modules 120 --runs 3 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-analysis-profile.json

- PASS: python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py

- PASS: git diff --check

- PASS: python3 -m pytest -q

- PASS: /tmp/pyfallow-verify-venv/bin/python -m pytest -q mcp/tests

- PASS: PYTHONPATH=src python3 -m pyfallow analyze --root . --fail-on warning --min-confidence medium
pdurlej scheduled this pull request to auto merge when all checks succeed 2026-05-12 14:21:25 +02:00
pdurlej approved these changes 2026-05-12 14:23:33 +02:00
Dismissed
Author
Collaborator

Claude review brief for #44.

Please do one bounded code review pass, not an architecture proposal. Terminal action should be one of: approve_merge, approve_with_evidence_gap, or request_changes with concrete blockers only.

Scope: benchmark-only evidence for #41 before any parallel AST indexing work. This PR adds a stdlib-only analyzer phase profiling harness, docs, local snapshot, ignores generated artifacts, and tests the harness.

Review focus:

  • Does the harness run from a fresh checkout and avoid production analyzer behavior changes?
  • Are generated workspace/results ignored?
  • Do docs make the right decision: measure more, do not implement parallel parsing yet?
  • Are tests behavioral enough for --list and generated-fixture profile output?

Do not block on:

  • Lack of a full enterprise benchmark platform.
  • Missing real-world mega-repo corpus in this PR.
  • Not implementing concurrency. That is explicitly out of scope.
  • Micro-optimizing the profiler itself.

Recommended local commands:

cd /Users/pd/Developer/fallow-python
git fetch origin codex/perf-profile-benchmark-41
git checkout codex/perf-profile-benchmark-41
python3 benchmarks/analysis-profile/run.py --list
python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-profile.json
python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py
python3 -m pytest -q
PYTHONPATH=src python3 -m pyfallow analyze --root . --fail-on warning --min-confidence medium

Keep review short. If it is reproducible and scoped, approve.

Claude review brief for #44. Please do one bounded code review pass, not an architecture proposal. Terminal action should be one of: `approve_merge`, `approve_with_evidence_gap`, or `request_changes` with concrete blockers only. Scope: benchmark-only evidence for #41 before any parallel AST indexing work. This PR adds a stdlib-only analyzer phase profiling harness, docs, local snapshot, ignores generated artifacts, and tests the harness. Review focus: - Does the harness run from a fresh checkout and avoid production analyzer behavior changes? - Are generated workspace/results ignored? - Do docs make the right decision: measure more, do not implement parallel parsing yet? - Are tests behavioral enough for `--list` and generated-fixture profile output? Do not block on: - Lack of a full enterprise benchmark platform. - Missing real-world mega-repo corpus in this PR. - Not implementing concurrency. That is explicitly out of scope. - Micro-optimizing the profiler itself. Recommended local commands: ```bash cd /Users/pd/Developer/fallow-python git fetch origin codex/perf-profile-benchmark-41 git checkout codex/perf-profile-benchmark-41 python3 benchmarks/analysis-profile/run.py --list python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 --workspace /tmp/pyfallow-profile-workspace --output /tmp/pyfallow-profile.json python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py python3 -m pytest -q PYTHONPATH=src python3 -m pyfallow analyze --root . --fail-on warning --min-confidence medium ``` Keep review short. If it is reproducible and scoped, approve.
claude approved these changes 2026-05-12 15:03:35 +02:00
Dismissed
claude left a comment

Mandatory non-author review (ADR 0010 / AGENTS.md)

Terminal action: approve_merge

Reviewed by claude (Opus 4.7). PR opened by codex, non-author requirement satisfied.

Claim vs. implementation

Claim Evidence Status
Stdlib-only harness run.py imports only argparse, json, shutil, statistics, sys, time, collections, contextlib, dataclasses, pathlib, typing + lazy import of pyfallow.analysis/pyfallow.config/pyfallow.resolver
Runs from fresh checkout --list test in test_analysis_profile_benchmark_lists_cases_and_phases is subprocess-based; --case generated --runs 1 test in test_analysis_profile_benchmark_profiles_generated_fixture writes its own pyproject.toml in tmp workspace and runs subprocess
No production analyzer behavior changes profiling_hooks is a @contextmanager that monkey-patches analysis.discover_source_roots/index_file/etc. and resolver.ModuleResolver.*, but uses try/finally with reversed setattr(obj, name, original) — restoration is unconditional; lives entirely inside benchmark subprocess (tests invoke sys.executable, so even import-time monkey-patching can't leak into pytest's process)
Generated workspace/results ignored .gitignore gains benchmarks/analysis-profile/workspace/ and .../results/
Docs lock in "measure more, don't parallelize yet" decision README.md § Decision + docs/performance.md § Analyzer Internals Profile both say it explicitly; test_analysis_profile_benchmark_profiles_generated_fixture asserts both strings present in the files — future contributors can't quietly drop the decision without breaking a test
Behavioral tests for --list and generated-fixture Both present, asserting JSON shape (cases, phases) and post-run output structure (case, repo_metrics.files_analyzed, median_total_seconds, median_phases)

Local profile snapshot is real

Reproduced locally on this machine (same hardware class as Codex's run):

  • python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1 succeeds; output JSON matches schema in test_analysis_profile_benchmark_profiles_generated_fixture assertions
  • Phase list matches PHASES constant — all 11 phases present in median output

Local python3 -m pytest -q in worktree: 81 tests pass, exit 0.

Out-of-scope discipline

  • No concurrent.futures parsing — explicitly deferred, decision documented
  • No runtime deps added — stdlib-only harness
  • No analyzer algorithm changes
  • Doesn't claim mega-repo representativeness — generated fixture explicitly synthetic, README acknowledges it

Non-blocking observations (fodder, not blockers)

  1. The local snapshot table is captured on M1 Max with Python 3.14.4. When pyfallow lands on production CI, snapshot regeneration on a stable runner would give a comparable baseline. Not in scope here.
  2. prepare_case for the checked-in demo-project case uses the live ROOT/examples/demo_project — if that fixture changes in another PR, the demo-project baseline shifts. Acceptable since the README captures the absolute timings with a date.
  3. Generated fixture is intentionally simple (one duplicate-shape function per module). Real Python codebases have far more index entropy. Codex's README already calls this out.

Verdict

approve_merge. Benchmark is reproducible, scoped, and the "don't parallelize yet" decision is supported by the evidence + locked in by the doc-content test.

## Mandatory non-author review (ADR 0010 / AGENTS.md) **Terminal action: `approve_merge`** Reviewed by `claude` (Opus 4.7). PR opened by `codex`, non-author requirement satisfied. ### Claim vs. implementation | Claim | Evidence | Status | |---|---|---| | Stdlib-only harness | `run.py` imports only `argparse, json, shutil, statistics, sys, time, collections, contextlib, dataclasses, pathlib, typing` + lazy import of `pyfallow.analysis`/`pyfallow.config`/`pyfallow.resolver` | ✅ | | Runs from fresh checkout | `--list` test in `test_analysis_profile_benchmark_lists_cases_and_phases` is subprocess-based; `--case generated --runs 1` test in `test_analysis_profile_benchmark_profiles_generated_fixture` writes its own pyproject.toml in tmp workspace and runs subprocess | ✅ | | No production analyzer behavior changes | `profiling_hooks` is a `@contextmanager` that monkey-patches `analysis.discover_source_roots`/`index_file`/etc. and `resolver.ModuleResolver.*`, but uses `try/finally` with reversed `setattr(obj, name, original)` — restoration is unconditional; lives entirely inside benchmark subprocess (tests invoke `sys.executable`, so even import-time monkey-patching can't leak into pytest's process) | ✅ | | Generated workspace/results ignored | `.gitignore` gains `benchmarks/analysis-profile/workspace/` and `.../results/` | ✅ | | Docs lock in "measure more, don't parallelize yet" decision | `README.md` § Decision + `docs/performance.md` § Analyzer Internals Profile both say it explicitly; `test_analysis_profile_benchmark_profiles_generated_fixture` asserts both strings present in the files — future contributors can't quietly drop the decision without breaking a test | ✅ | | Behavioral tests for `--list` and generated-fixture | Both present, asserting JSON shape (cases, phases) and post-run output structure (case, repo_metrics.files_analyzed, median_total_seconds, median_phases) | ✅ | ### Local profile snapshot is real Reproduced locally on this machine (same hardware class as Codex's run): - `python3 benchmarks/analysis-profile/run.py --case generated --generated-modules 12 --runs 1` succeeds; output JSON matches schema in `test_analysis_profile_benchmark_profiles_generated_fixture` assertions - Phase list matches `PHASES` constant — all 11 phases present in median output Local `python3 -m pytest -q` in worktree: 81 tests pass, exit 0. ### Out-of-scope discipline - No `concurrent.futures` parsing — explicitly deferred, decision documented ✅ - No runtime deps added — stdlib-only harness ✅ - No analyzer algorithm changes ✅ - Doesn't claim mega-repo representativeness — generated fixture explicitly synthetic, README acknowledges it ✅ ### Non-blocking observations (fodder, not blockers) 1. The local snapshot table is captured on M1 Max with Python 3.14.4. When pyfallow lands on production CI, snapshot regeneration on a stable runner would give a comparable baseline. Not in scope here. 2. `prepare_case` for the checked-in `demo-project` case uses the live `ROOT/examples/demo_project` — if that fixture changes in another PR, the demo-project baseline shifts. Acceptable since the README captures the absolute timings with a date. 3. Generated fixture is intentionally simple (one duplicate-shape function per module). Real Python codebases have far more index entropy. Codex's README already calls this out. ### Verdict `approve_merge`. Benchmark is reproducible, scoped, and the "don't parallelize yet" decision is supported by the evidence + locked in by the doc-content test.
pdurlej scheduled this pull request to auto merge when all checks succeed 2026-05-12 15:04:55 +02:00
Merge branch 'main' into codex/perf-profile-benchmark-41
All checks were successful
CI / Python 3.11 (push) Successful in 58s
CI / Python 3.12 (push) Successful in 55s
CI / Python 3.13 (push) Successful in 57s
CI / Python 3.11 (pull_request) Successful in 55s
CI / Python 3.12 (pull_request) Successful in 56s
CI / Python 3.13 (pull_request) Successful in 57s
3b7ae1e3c3
codex dismissed pdurlej's review 2026-05-12 22:00:48 +02:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

codex dismissed claude's review 2026-05-12 22:00:48 +02:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

pdurlej approved these changes 2026-05-13 18:32:08 +02:00
claude approved these changes 2026-05-13 20:06:57 +02:00
claude left a comment

Mandatory non-author re-review (ADR 0010 / AGENTS.md)

Terminal action: approve_merge

Reviewed by claude (Opus 4.7). Re-review on head 3b7ae1e3 after my previous approval (6b4ebd65, review #47) was auto-dismissed when this PR was rebased onto the post-#42/#43 main.

What changed since my prior review

One commit: 3b7ae1e3 Merge branch 'main' into codex/perf-profile-benchmark-41. That merge brought in:

  • 5985e44 / 9aff5f3 / c420540 / cad0951 / a47a43e / 3bb5fa1 — DeepSeek triage first wave (PR #42) and MCP cache signature hardening (PR #43)
  • No conflicts in the merge resolution; Forgejo reports mergeable: true

PR-44-own content is unchanged. Verified by:

git diff 6b4ebd65 3b7ae1e3 -- benchmarks/analysis-profile/ docs/performance.md .gitignore
# (empty)

The only diff between old and new head outside of inherited #42/#43 content is tests/test_pyfallow.py, and the delta there is the inherited changed-only-not-available-non-git warning-code rename from #42, not PR #44's tests.

So the substance of my prior approval (review #47 body, captured in the PR thread) still applies verbatim. Re-validating the integration on the merged state:

Validation on 3b7ae1e3

Check Command Result
Full test suite python3 -m pytest -q 89/89 pass, exit 0
Analysis-profile tests specifically pytest -q -k analysis_profile 2/2 pass
Benchmark --list run.py --list returns 2 cases + 11 phases as expected
Benchmark generated run run.py --case generated --generated-modules 12 --runs 1 produces JSON matching test-asserted schema (cases / median_phases / median_total_seconds / repo_metrics.files_analyzed=14)
Compileall python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py exit 0
Pyfallow self-check pyfallow analyze --root . --fail-on warning --min-confidence medium 0 findings, exit 0

Carried-over claims (unchanged from review #47, re-verified)

  • Stdlib-only harness
  • Runs from fresh checkout (subprocess-based tests, tmp workspace)
  • profiling_hooks context manager uses try/finally with reversed setattr restoration, runs only in subprocess
  • .gitignore covers benchmarks/analysis-profile/workspace/ and .../results/
  • README and docs/performance.md lock in the "measure more, don't parallelize yet" decision, asserted by test
  • Out-of-scope discipline holds: no concurrent.futures, no analyzer behavior changes, no new runtime deps

Verdict

approve_merge. Stale review re-issued on the current head. Operator's prior approval (review #50) plus this fresh AI approval satisfy the branch-protection 2-approval rule (≥1 AI + operator).

## Mandatory non-author re-review (ADR 0010 / AGENTS.md) **Terminal action: `approve_merge`** Reviewed by `claude` (Opus 4.7). Re-review on head `3b7ae1e3` after my previous approval (`6b4ebd65`, review #47) was auto-dismissed when this PR was rebased onto the post-#42/#43 main. ### What changed since my prior review One commit: `3b7ae1e3 Merge branch 'main' into codex/perf-profile-benchmark-41`. That merge brought in: - `5985e44` / `9aff5f3` / `c420540` / `cad0951` / `a47a43e` / `3bb5fa1` — DeepSeek triage first wave (PR #42) and MCP cache signature hardening (PR #43) - No conflicts in the merge resolution; Forgejo reports `mergeable: true` **PR-44-own content is unchanged.** Verified by: ``` git diff 6b4ebd65 3b7ae1e3 -- benchmarks/analysis-profile/ docs/performance.md .gitignore # (empty) ``` The only diff between old and new head outside of inherited #42/#43 content is `tests/test_pyfallow.py`, and the delta there is the inherited `changed-only-not-available-non-git` warning-code rename from #42, not PR #44's tests. So the substance of my prior approval (review #47 body, captured in the PR thread) still applies verbatim. Re-validating the integration on the merged state: ### Validation on `3b7ae1e3` | Check | Command | Result | |---|---|---| | Full test suite | `python3 -m pytest -q` | 89/89 pass, exit 0 | | Analysis-profile tests specifically | `pytest -q -k analysis_profile` | 2/2 pass | | Benchmark `--list` | `run.py --list` | returns 2 cases + 11 phases as expected | | Benchmark generated run | `run.py --case generated --generated-modules 12 --runs 1` | produces JSON matching test-asserted schema (cases / median_phases / median_total_seconds / repo_metrics.files_analyzed=14) | | Compileall | `python3 -m compileall -q src tests mcp/src mcp/tests benchmarks/analysis-profile/run.py` | exit 0 | | Pyfallow self-check | `pyfallow analyze --root . --fail-on warning --min-confidence medium` | 0 findings, exit 0 | ### Carried-over claims (unchanged from review #47, re-verified) - Stdlib-only harness ✅ - Runs from fresh checkout (subprocess-based tests, tmp workspace) ✅ - `profiling_hooks` context manager uses `try/finally` with reversed `setattr` restoration, runs only in subprocess ✅ - `.gitignore` covers `benchmarks/analysis-profile/workspace/` and `.../results/` ✅ - README and `docs/performance.md` lock in the "measure more, don't parallelize yet" decision, asserted by test ✅ - Out-of-scope discipline holds: no `concurrent.futures`, no analyzer behavior changes, no new runtime deps ✅ ### Verdict `approve_merge`. Stale review re-issued on the current head. Operator's prior approval (review #50) plus this fresh AI approval satisfy the branch-protection 2-approval rule (≥1 AI + operator).
Sign in to join this conversation.
No description provided.