feat(v0): exponential backoff for transient Forgejo HTTP failures (urllib only) #63

Closed
opened 2026-05-28 13:38:08 +02:00 by claude · 1 comment
Collaborator

Goal

Make forgejo_client.py and ollama_client.py recover from transient network failures without losing the stdlib-only discipline. Today a single 503 or socket timeout fails the whole CLI invocation — fine for dev, fragile for the dogfood loop.

Why this matters

  • Identified in 2026-05-28 Cloud Opus 4.6 repo review (§II.3 "Synchroniczne I/O i brak retry/backoff").
  • pdurlej/platform#527 evidence run took 1m6s on first hit, 44s and 41s on follow-ups — well within Forgejo's typical jitter. A single hiccup on a real review would surface as soft_fail_review_unreliable even when the actual issue is transient.
  • Defense hardening per D21 — no new core capability, just makes the existing wiring more honest about transient vs. permanent failure.

Scope

Where the change lands

  • src/patchwarden/forgejo_client.py_urllib_request() (or equivalent transport helper). All Forgejo calls go through one chokepoint; add the retry there.
  • src/patchwarden/ollama_client.py_urllib_transport(). Same treatment but applied at the transport callable level so test stubs continue to work.

Retry policy (small, predictable)

Aspect Value
Max attempts 3 (1 original + 2 retries)
Backoff exponential, base 1s → [1s, 2s, 4s] slept BEFORE the next attempt
Jitter none for v1 (deterministic for tests; revisit later if real traffic shows herd issues)
Retry-eligible errors HTTP 429, HTTP 503, socket.timeout, TimeoutError, urllib.error.URLError whose reason is a connection error (ConnectionResetError, ConnectionRefusedError, BrokenPipeError)
Hard-fail errors (no retry) HTTP 4xx other than 429 (401/403/404 are bugs, retry would hide them), OllamaUnparseableError (D20: never auto-recover from unparseable), any error after attempt 3

Retry-After header on 429: if present and parseable as integer seconds, honor it instead of exponential schedule for that one wait. If not parseable, fall back to exponential.

Implementation discipline

  • stdlib onlytime.sleep, urllib.error.HTTPError/URLError, socket.timeout. No tenacity, no requests, no backoff library.
  • Use a small private helper, e.g. def _with_retry(callable, *, max_attempts=3, base_delay=1.0): ... in each module rather than one shared utility (the two modules have different exception translations — keep the coupling local).
  • Log each retry as ::notice:: so the workflow log shows what happened — never silent recovery.
  • Configurable via env vars PATCHWARDEN_HTTP_MAX_ATTEMPTS and PATCHWARDEN_HTTP_BASE_DELAY_SECONDS for the platform workflow to tune later. Defaults stay as above.

Acceptance criteria

  • PYTHONPATH=src python3 -m unittest discover tests → all green.
  • New tests tests/test_forgejo_client_retry.py and tests/test_ollama_client_retry.py cover:
    • 503 then 200 → succeeds after 1 retry, total wait ≥ 1s
    • 503, 503, 200 → succeeds after 2 retries, total wait ≥ 3s
    • 503, 503, 503 → final failure raises after 3 attempts
    • 404 → fails immediately (no retry, single attempt)
    • 429 with Retry-After: 2 → honors 2s instead of 1s
    • socket.timeout then success → retried
    • OllamaUnparseableError → never retried (D20)
  • Tests use time.monotonic mocking or an injectable sleep_fn (preferred — keep tests fast).
  • grep -r "import requests\|import httpx\|import tenacity\|import backoff" src/patchwarden/ returns zero hits.
  • tests/test_d20_architectural_boundary.py still passes (no new forbidden imports/endpoints).

No-go

  • Do NOT add a runtime dep (tenacity, backoff, requests) — stdlib-only per cousin discipline.
  • Do NOT retry on OllamaUnparseableError or HTTP 4xx (other than 429) — per D20, unparseable output must surface explicitly, not be hidden by retry.
  • Do NOT make retry async — keep the codebase synchronous. The G1 (webhook mode) issue is a separate larger change parked per D21.
  • Do NOT modify OllamaRequest / response dataclasses — the change is internal to the transport layer.

Spec sources

  • src/patchwarden/forgejo_client.py — current transport (no retry today)
  • src/patchwarden/ollama_client.py_urllib_transport() + _attempt()
  • tests/test_d20_architectural_boundary.py — must stay green
  • 2026-05-28 Cloud Opus 4.6 repo analysis, §II.3 "Synchroniczne I/O i brak retry/backoff"

Status flow

ready-for-agent → agent claims → PR → operator review → merge


Created 2026-05-28 by claude (Patchwarden dedicated thread) after Cloud Opus 4.6 review. Atomic for any cousin (e.g. Cloud Opus 4.6 via Antigravity) to pick up.

## Goal Make `forgejo_client.py` and `ollama_client.py` recover from transient network failures without losing the stdlib-only discipline. Today a single 503 or socket timeout fails the whole CLI invocation — fine for dev, fragile for the dogfood loop. ## Why this matters - Identified in 2026-05-28 Cloud Opus 4.6 repo review (§II.3 "Synchroniczne I/O i brak retry/backoff"). - `pdurlej/platform#527` evidence run took 1m6s on first hit, 44s and 41s on follow-ups — well within Forgejo's typical jitter. A single hiccup on a real review would surface as `soft_fail_review_unreliable` even when the actual issue is transient. - Defense hardening per D21 — no new core capability, just makes the existing wiring more honest about transient vs. permanent failure. ## Scope ### Where the change lands - `src/patchwarden/forgejo_client.py` — `_urllib_request()` (or equivalent transport helper). All Forgejo calls go through one chokepoint; add the retry there. - `src/patchwarden/ollama_client.py` — `_urllib_transport()`. Same treatment but applied at the transport callable level so test stubs continue to work. ### Retry policy (small, predictable) | Aspect | Value | |---|---| | Max attempts | 3 (1 original + 2 retries) | | Backoff | exponential, base 1s → `[1s, 2s, 4s]` slept BEFORE the next attempt | | Jitter | none for v1 (deterministic for tests; revisit later if real traffic shows herd issues) | | Retry-eligible errors | HTTP 429, HTTP 503, `socket.timeout`, `TimeoutError`, `urllib.error.URLError` whose reason is a connection error (`ConnectionResetError`, `ConnectionRefusedError`, `BrokenPipeError`) | | Hard-fail errors (no retry) | HTTP 4xx other than 429 (401/403/404 are bugs, retry would hide them), `OllamaUnparseableError` (D20: never auto-recover from unparseable), any error after attempt 3 | `Retry-After` header on 429: if present and parseable as integer seconds, honor it instead of exponential schedule for that one wait. If not parseable, fall back to exponential. ### Implementation discipline - **stdlib only** — `time.sleep`, `urllib.error.HTTPError`/`URLError`, `socket.timeout`. No `tenacity`, no `requests`, no `backoff` library. - Use a small private helper, e.g. `def _with_retry(callable, *, max_attempts=3, base_delay=1.0): ...` in each module rather than one shared utility (the two modules have different exception translations — keep the coupling local). - Log each retry as `::notice::` so the workflow log shows what happened — never silent recovery. - Configurable via env vars `PATCHWARDEN_HTTP_MAX_ATTEMPTS` and `PATCHWARDEN_HTTP_BASE_DELAY_SECONDS` for the platform workflow to tune later. Defaults stay as above. ## Acceptance criteria - `PYTHONPATH=src python3 -m unittest discover tests` → all green. - New tests `tests/test_forgejo_client_retry.py` and `tests/test_ollama_client_retry.py` cover: - 503 then 200 → succeeds after 1 retry, total wait ≥ 1s - 503, 503, 200 → succeeds after 2 retries, total wait ≥ 3s - 503, 503, 503 → final failure raises after 3 attempts - 404 → fails immediately (no retry, single attempt) - 429 with `Retry-After: 2` → honors 2s instead of 1s - `socket.timeout` then success → retried - `OllamaUnparseableError` → never retried (D20) - Tests use `time.monotonic` mocking or an injectable `sleep_fn` (preferred — keep tests fast). - `grep -r "import requests\|import httpx\|import tenacity\|import backoff" src/patchwarden/` returns zero hits. - `tests/test_d20_architectural_boundary.py` still passes (no new forbidden imports/endpoints). ## No-go - ❌ Do NOT add a runtime dep (`tenacity`, `backoff`, `requests`) — stdlib-only per cousin discipline. - ❌ Do NOT retry on `OllamaUnparseableError` or HTTP 4xx (other than 429) — per D20, unparseable output must surface explicitly, not be hidden by retry. - ❌ Do NOT make retry async — keep the codebase synchronous. The G1 (webhook mode) issue is a separate larger change parked per D21. - ❌ Do NOT modify `OllamaRequest` / response dataclasses — the change is internal to the transport layer. ## Spec sources - `src/patchwarden/forgejo_client.py` — current transport (no retry today) - `src/patchwarden/ollama_client.py` — `_urllib_transport()` + `_attempt()` - `tests/test_d20_architectural_boundary.py` — must stay green - 2026-05-28 Cloud Opus 4.6 repo analysis, §II.3 "Synchroniczne I/O i brak retry/backoff" ## Status flow `ready-for-agent` → agent claims → PR → operator review → merge --- Created 2026-05-28 by claude (Patchwarden dedicated thread) after Cloud Opus 4.6 review. Atomic for any cousin (e.g. Cloud Opus 4.6 via Antigravity) to pick up.
Collaborator

{
"confidence": 5,
"effort_hint": "medium",
"escalation": {
"kind": "patchwarden_review",
"reason": "Retry behavior should preserve Patchwarden reliability contracts and stdlib-only discipline."
},
"evidence_refs": [
{
"note": "Issue adds exponential backoff for transient Forgejo and Ollama network failures.",
"type": "forgejo",
"value": "issue-title-body-labels-and-target-snapshot"
},
{
"note": "Body states a single 503 or timeout can currently fail the whole CLI invocation.",
"type": "forgejo",
"value": "issue-body-problem"
},
{
"note": "Scope targets transport chokepoints while preserving stdlib-only runtime code.",
"type": "forgejo",
"value": "issue-body-scope"
}
],
"impact": 4,
"judge_actor": {
"name": "iskra",
"runtime": "openclaw"
},
"judged_at": "2026-06-08T01:04:00Z",
"labels_to_apply": [
"judge/p2",
"judge/patchwarden-candidate"
],
"piotr_fit": "high",
"priority": "p2",
"rationale_summary": "This is P2 Patchwarden hardening because bounded retries make transient failures less likely to masquerade as unreliable reviews.",
"reach": 4,
"recommended_next_action": "patchwarden_candidate",
"rerun_reason": "no_prior_judgment",
"schema": "openclaw.judge.v0",
"target": {
"kind": "issue",
"number": 63,
"repo": "pdurlej/patchwarden"
},
"target_snapshot": {
"body_hash": "sha256:4db3ac009f1b3e7b374f6c32de35acdd6f39840d85bc100bbedfabee91298a59",
"commit_count": null,
"evidence_hash": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"head_sha": null,
"labels": [
"agent/gemini",
"area:v0-core",
"flow/ready",
"kind/feature",
"ready-for-agent",
"size/medium"
],
"labels_hash": "sha256:0d320e131156dd17518177afa26919a654867d8c60f519f558eb2cf2e542ca97",
"state": "open",
"title_hash": "sha256:b6938992fd1a47c23f326a38f2b5077604f9e5a0d950072cb533ec0bbda99532",
"updated_at": "2026-05-28T14:04:53+02:00"
},
"top_caveat": "Keep retries bounded and transparent so permanent failures still fail clearly."
}

<!-- openclaw.judge.v0 --> { "confidence": 5, "effort_hint": "medium", "escalation": { "kind": "patchwarden_review", "reason": "Retry behavior should preserve Patchwarden reliability contracts and stdlib-only discipline." }, "evidence_refs": [ { "note": "Issue adds exponential backoff for transient Forgejo and Ollama network failures.", "type": "forgejo", "value": "issue-title-body-labels-and-target-snapshot" }, { "note": "Body states a single 503 or timeout can currently fail the whole CLI invocation.", "type": "forgejo", "value": "issue-body-problem" }, { "note": "Scope targets transport chokepoints while preserving stdlib-only runtime code.", "type": "forgejo", "value": "issue-body-scope" } ], "impact": 4, "judge_actor": { "name": "iskra", "runtime": "openclaw" }, "judged_at": "2026-06-08T01:04:00Z", "labels_to_apply": [ "judge/p2", "judge/patchwarden-candidate" ], "piotr_fit": "high", "priority": "p2", "rationale_summary": "This is P2 Patchwarden hardening because bounded retries make transient failures less likely to masquerade as unreliable reviews.", "reach": 4, "recommended_next_action": "patchwarden_candidate", "rerun_reason": "no_prior_judgment", "schema": "openclaw.judge.v0", "target": { "kind": "issue", "number": 63, "repo": "pdurlej/patchwarden" }, "target_snapshot": { "body_hash": "sha256:4db3ac009f1b3e7b374f6c32de35acdd6f39840d85bc100bbedfabee91298a59", "commit_count": null, "evidence_hash": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "head_sha": null, "labels": [ "agent/gemini", "area:v0-core", "flow/ready", "kind/feature", "ready-for-agent", "size/medium" ], "labels_hash": "sha256:0d320e131156dd17518177afa26919a654867d8c60f519f558eb2cf2e542ca97", "state": "open", "title_hash": "sha256:b6938992fd1a47c23f326a38f2b5077604f9e5a0d950072cb533ec0bbda99532", "updated_at": "2026-05-28T14:04:53+02:00" }, "top_caveat": "Keep retries bounded and transparent so permanent failures still fail clearly." } <!-- /openclaw.judge.v0 -->
Sign in to join this conversation.
No labels
agent/claude-code
agent/codex
agent/gemini
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
area:business-model
area:competitive
area:discovery
area:forgejo
area:metrics
area:product-strategy
area:v0-core
cagan-grade-approved
client:platform
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
kind:artifact
kind:decision
kind:dogfood
kind:epic
kind:implementation
kind:research
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
priority:p0
priority:p1
priority:p2
priority:p3
ready-for-agent
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:blocked-on-discovery
status:cagan-grade-review-pending
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:needs-operator-decision
status:operator-needed
status:parked
tier:0-anchor
tier:0-platform-substrate
tier:1-core
tier:1-iskra-value-layer
tier:2-supporting
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
wave:1-foundation
wave:2-positioning
wave:3-validation
wave:4-economics
wave:5-operating
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/patchwarden#63
No description provided.