[P1][determinism] Regression test enforcing analyzer determinism + fingerprint stability #121

New issue

Open

opened 2026-06-17 23:11:16 +02:00 by claude · 0 comments

claude commented

2026-06-17 23:11:16 +02:00

Collaborator

Architect direction (claude, Opus 4.8). Determinism is fallow-py's foundational promise (docs/philosophy.md Promise #1; ADR 0001; ADR 0007 "pure function (repo state, config) → findings"). The 2026-06-15 audit relied on it (re-runs were identical) but nothing in the test suite enforces it. A future change to dict/set iteration, path handling, or traversal order could silently break replayability and no test would catch it.

Goal

A regression test that fails loudly if analyzer output stops being deterministic.

Design

Same-input determinism. Run analyze twice on the same fixture (and once on a small real fixture repo) and assert the --format json and --format agent-fix-plan outputs are byte-identical after normalizing the legitimately-variable fields (generated_at timestamp, absolute root path). If those fields are the only differences, output is deterministic.
Fingerprint stability. Assert the same finding produces the same fingerprint across the two runs — fingerprints must be a function of (rule + symbol + canonical evidence), not traversal artifacts (philosophy.md Promise #4). This is the property the precision harness (#harness) depends on for cross-run matching.
Ordering stability. Assert the issues list ordering is stable across runs (sort key is deterministic, not hash-seeded).

Acceptance criteria

A test in tests/ (e.g. test_determinism.py) that runs the analyzer twice and asserts byte-identical output modulo generated_at / absolute root.
Fingerprint-stability assertion across runs on a fixture with at least one finding per major rule family (dead-code, dependencies, architecture).
Test runs under the existing python -m pytest -q gate and in CI.
If any non-determinism is found while writing the test, fix the root cause (or split it into a severity:high issue if non-trivial) — a green test that hides a real non-determinism is worse than no test.

Out of scope

Cross-platform / cross-Python-version reproducibility (a stronger guarantee; separate issue if wanted).
Performance.

Note

Cheap, foundational, and a prerequisite for trusting the precision harness (#harness), which matches findings across runs by fingerprint. Worth doing alongside or before that harness.

**Architect direction (claude, Opus 4.8).** Determinism is fallow-py's foundational promise (`docs/philosophy.md` Promise #1; ADR 0001; ADR 0007 "pure function `(repo state, config) → findings`"). The 2026-06-15 audit *relied* on it (re-runs were identical) but **nothing in the test suite enforces it.** A future change to dict/set iteration, path handling, or traversal order could silently break replayability and no test would catch it. ## Goal A regression test that fails loudly if analyzer output stops being deterministic. ## Design 1. **Same-input determinism.** Run `analyze` twice on the same fixture (and once on a small real fixture repo) and assert the `--format json` and `--format agent-fix-plan` outputs are **byte-identical** after normalizing the legitimately-variable fields (`generated_at` timestamp, absolute `root` path). If those fields are the *only* differences, output is deterministic. 2. **Fingerprint stability.** Assert the same finding produces the same `fingerprint` across the two runs — fingerprints must be a function of (rule + symbol + canonical evidence), not traversal artifacts (philosophy.md Promise #4). This is the property the precision harness (#harness) depends on for cross-run matching. 3. **Ordering stability.** Assert the `issues` list ordering is stable across runs (sort key is deterministic, not hash-seeded). ## Acceptance criteria - A test in `tests/` (e.g. `test_determinism.py`) that runs the analyzer twice and asserts byte-identical output modulo `generated_at` / absolute `root`. - Fingerprint-stability assertion across runs on a fixture with at least one finding per major rule family (dead-code, dependencies, architecture). - Test runs under the existing `python -m pytest -q` gate and in CI. - If any non-determinism is found while writing the test, fix the root cause (or split it into a `severity:high` issue if non-trivial) — a green test that hides a real non-determinism is worse than no test. ## Out of scope - Cross-platform / cross-Python-version reproducibility (a stronger guarantee; separate issue if wanted). - Performance. ## Note Cheap, foundational, and a prerequisite for trusting the precision harness (#harness), which matches findings across runs by fingerprint. Worth doing alongside or before that harness.

claude added the

phase:b

severity:medium

area:test-coverage

labels

2026-06-17 23:11:16 +02:00