[P1][determinism] Regression test enforcing analyzer determinism + fingerprint stability #121

Open
opened 2026-06-17 23:11:16 +02:00 by claude · 0 comments
Collaborator

Architect direction (claude, Opus 4.8). Determinism is fallow-py's foundational promise (docs/philosophy.md Promise #1; ADR 0001; ADR 0007 "pure function (repo state, config) → findings"). The 2026-06-15 audit relied on it (re-runs were identical) but nothing in the test suite enforces it. A future change to dict/set iteration, path handling, or traversal order could silently break replayability and no test would catch it.

Goal

A regression test that fails loudly if analyzer output stops being deterministic.

Design

  1. Same-input determinism. Run analyze twice on the same fixture (and once on a small real fixture repo) and assert the --format json and --format agent-fix-plan outputs are byte-identical after normalizing the legitimately-variable fields (generated_at timestamp, absolute root path). If those fields are the only differences, output is deterministic.
  2. Fingerprint stability. Assert the same finding produces the same fingerprint across the two runs — fingerprints must be a function of (rule + symbol + canonical evidence), not traversal artifacts (philosophy.md Promise #4). This is the property the precision harness (#harness) depends on for cross-run matching.
  3. Ordering stability. Assert the issues list ordering is stable across runs (sort key is deterministic, not hash-seeded).

Acceptance criteria

  • A test in tests/ (e.g. test_determinism.py) that runs the analyzer twice and asserts byte-identical output modulo generated_at / absolute root.
  • Fingerprint-stability assertion across runs on a fixture with at least one finding per major rule family (dead-code, dependencies, architecture).
  • Test runs under the existing python -m pytest -q gate and in CI.
  • If any non-determinism is found while writing the test, fix the root cause (or split it into a severity:high issue if non-trivial) — a green test that hides a real non-determinism is worse than no test.

Out of scope

  • Cross-platform / cross-Python-version reproducibility (a stronger guarantee; separate issue if wanted).
  • Performance.

Note

Cheap, foundational, and a prerequisite for trusting the precision harness (#harness), which matches findings across runs by fingerprint. Worth doing alongside or before that harness.

**Architect direction (claude, Opus 4.8).** Determinism is fallow-py's foundational promise (`docs/philosophy.md` Promise #1; ADR 0001; ADR 0007 "pure function `(repo state, config) → findings`"). The 2026-06-15 audit *relied* on it (re-runs were identical) but **nothing in the test suite enforces it.** A future change to dict/set iteration, path handling, or traversal order could silently break replayability and no test would catch it. ## Goal A regression test that fails loudly if analyzer output stops being deterministic. ## Design 1. **Same-input determinism.** Run `analyze` twice on the same fixture (and once on a small real fixture repo) and assert the `--format json` and `--format agent-fix-plan` outputs are **byte-identical** after normalizing the legitimately-variable fields (`generated_at` timestamp, absolute `root` path). If those fields are the *only* differences, output is deterministic. 2. **Fingerprint stability.** Assert the same finding produces the same `fingerprint` across the two runs — fingerprints must be a function of (rule + symbol + canonical evidence), not traversal artifacts (philosophy.md Promise #4). This is the property the precision harness (#harness) depends on for cross-run matching. 3. **Ordering stability.** Assert the `issues` list ordering is stable across runs (sort key is deterministic, not hash-seeded). ## Acceptance criteria - A test in `tests/` (e.g. `test_determinism.py`) that runs the analyzer twice and asserts byte-identical output modulo `generated_at` / absolute `root`. - Fingerprint-stability assertion across runs on a fixture with at least one finding per major rule family (dead-code, dependencies, architecture). - Test runs under the existing `python -m pytest -q` gate and in CI. - If any non-determinism is found while writing the test, fix the root cause (or split it into a `severity:high` issue if non-trivial) — a green test that hides a real non-determinism is worse than no test. ## Out of scope - Cross-platform / cross-Python-version reproducibility (a stronger guarantee; separate issue if wanted). - Performance. ## Note Cheap, foundational, and a prerequisite for trusting the precision harness (#harness), which matches findings across runs by fingerprint. Worth doing alongside or before that harness.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/fallow-py#121
No description provided.