feat(v0): resolve_findings detects review_run soft-fail (Swiss Cheese Layer 8) #54
No reviewers
Labels
No labels
agent/claude-code
agent/codex
agent/gemini
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
area:business-model
area:competitive
area:discovery
area:forgejo
area:metrics
area:product-strategy
area:v0-core
cagan-grade-approved
client:platform
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
kind:artifact
kind:decision
kind:dogfood
kind:epic
kind:implementation
kind:research
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
priority:p0
priority:p1
priority:p2
priority:p3
ready-for-agent
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:blocked-on-discovery
status:cagan-grade-review-pending
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:needs-operator-decision
status:operator-needed
status:parked
tier:0-anchor
tier:0-platform-substrate
tier:1-core
tier:1-iskra-value-layer
tier:2-supporting
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
wave:1-foundation
wave:2-positioning
wave:3-validation
wave:4-economics
wave:5-operating
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/patchwarden!54
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "claude/patchwarden-resolver-soft-fail-detection"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Closes Swiss Cheese Layer 8 hand-off gap (Scenariusz B z
docs/operations/code-vs-vision-snapshot-2026-05-27.md).When a reviewer lane runs with
fail_on_missing=Falseorfail_on_unparseable=False,review_runrecords the failure inruntime_metadata.error_kind("transport"or"unparseable") and emits empty findings. Before this PR,resolve_findingsignored that field — so a soft-failed Ollama call looked indistinguishable from a clean review. Operator could merge a "green" PR that was never actually reviewed.This is PR A of 3 defense-hardening PRs spun off the 2026-05-27 crosscheck.
Code changes
src/patchwarden/resolve_findings.py(+22 lines):SOFT_FAIL_KINDS = {"transport", "unparseable"}+SOFT_FAIL_BLOCKER_CODE = "soft_fail_review_unreliable"runtime_metadata.error_kind. If matches → append artifact-bound blocker:finding_id=None:(item.get("finding_id") or "", item["code"])Test changes — 138 → 144 green (+6)
tests/test_resolve_findings.py:soft_fail_artifact(error_kind, ...)mirrors whatreview_runactually emits when lane hasfail_on_*=False.SoftFailDetectionTestsclass:test_transport_error_kind_holdstest_unparseable_error_kind_holdstest_error_detail_is_truncated_to_200_charstest_findings_and_soft_fail_produce_both_blockerstest_clean_artifact_with_runtime_metadata_does_not_add_soft_fail_blockertest_unknown_error_kind_value_does_not_trigger_blockerAll 4 existing
ResolveFindingsTestsstill pass (no regression on non-soft-fail paths).Why this matters (D20)
D20 says LLM reviewers suspect, Patchwarden decides. This PR makes the "Patchwarden decides" half honest: when the LLM couldn't suspect anything (because Ollama was down), Patchwarden must decide HOLD, not PASS by default. Otherwise the policy beats LLM vibes principle (P2) silently degrades into "policy trusts empty answers" — opposite of what fail-closed defaults mean.
Schema impact
patchwarden.finding_resolution.v1blockerslist now contains two shapes:Backwards-compatible for consumers that read unknown fields tolerantly. No
schema_versionbump.Swiss Cheese context
Atomic per ADR-0017
resolve_findings.py(+22),test_resolve_findings.py(+125)base=main, no stackingNOT breaking M2 gate
This is defense hardening of existing v0, not a new feature. Per
docs/decisions.mdD16 + M2 milestone notes, hardening existing capability is in-scope; adding new capability is parked. PR A closes a real gap in the existing review_run ↔ resolve_findings contract.Token-accounting
~3-4% weekly Opus (sacred path edit + 6 new tests + recon of resolver helpers).
When a reviewer lane runs with fail_on_missing=False or fail_on_unparseable=False, review_run records the failure in runtime_metadata.error_kind ("transport" or "unparseable") and emits empty findings. Before this PR, resolve_findings ignored that field — so a soft-failed Ollama call looked indistinguishable from a clean review, and a downstream operator could merge a "green" PR that was never actually reviewed. This closes Swiss Cheese Layer 8 hand-off gap (Scenariusz B in the 2026-05-27 vision-vs-code crosscheck). ## What changes `src/patchwarden/resolve_findings.py`: - New constants `SOFT_FAIL_KINDS = {"transport", "unparseable"}` and `SOFT_FAIL_BLOCKER_CODE = "soft_fail_review_unreliable"`. - After processing each artifact's findings, check `artifact["runtime_metadata"]["error_kind"]`. If it matches SOFT_FAIL_KINDS, append a non-finding-bound blocker: ```python { "code": "soft_fail_review_unreliable", "finding_id": None, "finding_type_id": None, "severity": "blocker", "state": "manual_only", "reviewer_id": "<artifact's reviewer.agent_name>", "error_kind": "transport" | "unparseable", "error_detail": "<truncated to 200 chars>", } ``` - Sort key for blockers updated to handle `finding_id=None`: `(item.get("finding_id") or "", item["code"])`. `tests/test_resolve_findings.py`: - New helper `soft_fail_artifact(error_kind, ...)` mirrors what `review_run` actually emits. - New `SoftFailDetectionTests` class (6 tests): - `test_transport_error_kind_holds` — HOLD verdict + blocker recorded - `test_unparseable_error_kind_holds` — same for unparseable - `test_error_detail_is_truncated_to_200_chars` — bound on noisy errors - `test_findings_and_soft_fail_produce_both_blockers` — combo: real finding blocker + soft-fail blocker on the same PR - `test_clean_artifact_with_runtime_metadata_does_not_add_soft_fail_blocker` — regression: runtime_metadata present without error_kind is still PASS - `test_unknown_error_kind_value_does_not_trigger_blocker` — only transport/unparseable trigger; future kinds need explicit opt-in All 4 existing ResolveFindingsTests still pass (no regression on non-soft-fail paths). ## Why this matters D20 says LLM reviewers suspect, Patchwarden decides. This PR makes the "Patchwarden decides" half honest: when the LLM **couldn't** suspect anything (because Ollama was down), Patchwarden must decide HOLD, not PASS by default. Otherwise the "policy beats LLM vibes" principle (P2) silently degrades into "policy trusts empty answers" — the opposite of what fail-closed defaults mean. ## Schema impact `patchwarden.finding_resolution.v1` blockers list now contains two shapes: - finding-bound (existing): code + finding_id + finding_type_id + severity + state - artifact-bound (new): code + finding_id=null + reviewer_id + error_kind + error_detail (severity always "blocker", state always "manual_only") Backwards-compatible for consumers that read unknown fields tolerantly. No schema_version bump. ## Tests 138 → 144 green (+6 new, all pass on first run). ## Atomic per ADR-0017 - 2 files: resolve_findings.py (+22 lines), test_resolve_findings.py (+125 lines). - No new module, no new dependency, stdlib-only. - base=main, no stacking on prior PRs. ## Swiss Cheese context This is one of three defense-hardening PRs spun off the 2026-05-27 crosscheck (PR A of A/B/C). The crosscheck identified Layer 8 as the highest-priority gap — most likely to silently produce a green merge on a broken Patchwarden run. PR B (D20 architectural lint test) and PR C (secret-by-filename limitation doc) follow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>