fix(providers): replace direct-API with local-CLI invocations #5

Merged
pdurlej merged 2 commits from codex/review-pipeline/local-cli-providers into main 2026-05-01 12:14:05 +02:00
Owner

Summary

Architectural correction post PR #4. Operator clarified (2026-05-01, after PR #4 merged): claude + codex reviewers use local CLI subscriptions, not direct API. Only z.ai goes via HTTPS (key from Infisical).

PR #4 shipped anthropic.py and openai.py as direct-API HTTP adapters. Wrong design — would have required managing 3 API keys, paying per-call instead of using existing Max/Pro subscriptions, and missing that Codex CLI has codex exec --sandbox read-only purpose-built for review.

This PR fixes that.

What changes

New providers:

  • providers/claude_cli.pyclaude -p --model opus --effort max --system-prompt … --no-session-persistence. Uses subscription auth via Keychain. Fresh Claude session per reviewer call → independent reasoning.
  • providers/codex_cli.pycodex exec --sandbox read-only --ignore-user-config -m gpt-5.5 -c model_reasoning_effort=high. Uses subscription auth via codex's own login. Fresh Codex session per reviewer call.

Registry update (run_review.py REVIEWER_REGISTRY):

Slot Was (PR #4) Now (PR #5)
tech-claude anthropic (HTTPS) claude_cli (local)
product-claude anthropic claude_cli
tech-gpt openai (HTTPS) codex_cli (local)
product-gpt openai codex_cli
tech-glm / product-glm zai zai (unchanged)

Fallbacks preserved: anthropic.py + openai.py stay in providers/ for edge cases (CI runners without CLIs, or operator preference for billed API on a specific reviewer). Override via --provider-override tech-claude=anthropic.

z.ai docs: Infisical path documented inline:

/home-platform/providers/zai/ZAI_API_KEY (prod env)

With infisical secrets get … snippet for export.

Spec rewrite: prompts/01.5c §2 — auth model section + invocation tables. Cost stance updated: claude/gpt have $0 per-PR cost (subscriptions); only z.ai trickles cents.

Operator inbox convention: state/OPERATOR_INBOX.md — new file for operator → claude inbound messages while operator is mobile. ScheduleWakeup loop set up (10-min interval) to check.

Tests

57/57 green:

  • 34 in test_run_review.py (was 29; +5 wire-up tests):
    • test_get_provider_claude_cli / test_get_provider_codex_cli
    • test_default_registry_uses_local_cli_for_claude_and_gpt
    • test_get_provider_anthropic_fallback_still_available
    • test_get_provider_openai_fallback_still_available
  • 14 consolidator (PR #3)
  • 4 negative-control (PR #1)
  • 5 platformctl smoke
pytest platformctl/tests/ → 57 passed in 3.84s

NOT in this PR (intentional)

  • Real CLI invocation untested. claude_cli.py + codex_cli.py ship without exercising live CLI. Subprocess machinery + flag combos are based on claude --help / codex exec --help. First canary 3+3 run on a small Phase 02 PR will validate; flags can be tuned via env vars without code change.
  • No Forgejo Actions wiring. Still follow-up.
  • Anthropic + OpenAI HTTP providers not deleted. Kept as fallbacks; could be removed in a future PR once operator confirms CLI path is reliable for canary + 5+ subsequent runs.

Test plan

  • pytest platformctl/tests/ → 57 green
  • Manual: real canary 3+3 run on a small Phase 02 PR (post-merge); verify all 6 reviewers respond
  • Manual: induce claude CLI failure (e.g., wrong model alias), verify ProviderError surfaces correctly + DECISION_REQUIRED.md fires

🤖 Generated with Claude Code

## Summary Architectural correction post PR #4. Operator clarified (2026-05-01, after PR #4 merged): claude + codex reviewers use **local CLI subscriptions**, not direct API. Only z.ai goes via HTTPS (key from Infisical). PR #4 shipped `anthropic.py` and `openai.py` as direct-API HTTP adapters. Wrong design — would have required managing 3 API keys, paying per-call instead of using existing Max/Pro subscriptions, and missing that Codex CLI has `codex exec --sandbox read-only` purpose-built for review. This PR fixes that. ## What changes **New providers:** - `providers/claude_cli.py` — `claude -p --model opus --effort max --system-prompt … --no-session-persistence`. Uses subscription auth via Keychain. Fresh Claude session per reviewer call → independent reasoning. - `providers/codex_cli.py` — `codex exec --sandbox read-only --ignore-user-config -m gpt-5.5 -c model_reasoning_effort=high`. Uses subscription auth via codex's own login. Fresh Codex session per reviewer call. **Registry update (`run_review.py` REVIEWER_REGISTRY):** | Slot | Was (PR #4) | Now (PR #5) | |---|---|---| | `tech-claude` | `anthropic` (HTTPS) | `claude_cli` (local) | | `product-claude` | `anthropic` | `claude_cli` | | `tech-gpt` | `openai` (HTTPS) | `codex_cli` (local) | | `product-gpt` | `openai` | `codex_cli` | | `tech-glm` / `product-glm` | `zai` | `zai` (unchanged) | **Fallbacks preserved:** `anthropic.py` + `openai.py` stay in `providers/` for edge cases (CI runners without CLIs, or operator preference for billed API on a specific reviewer). Override via `--provider-override tech-claude=anthropic`. **z.ai docs:** Infisical path documented inline: ``` /home-platform/providers/zai/ZAI_API_KEY (prod env) ``` With `infisical secrets get …` snippet for export. **Spec rewrite:** `prompts/01.5c §2` — auth model section + invocation tables. Cost stance updated: claude/gpt have $0 per-PR cost (subscriptions); only z.ai trickles cents. **Operator inbox convention:** `state/OPERATOR_INBOX.md` — new file for operator → claude inbound messages while operator is mobile. ScheduleWakeup loop set up (10-min interval) to check. ## Tests **57/57 green:** - 34 in `test_run_review.py` (was 29; +5 wire-up tests): - `test_get_provider_claude_cli` / `test_get_provider_codex_cli` - `test_default_registry_uses_local_cli_for_claude_and_gpt` - `test_get_provider_anthropic_fallback_still_available` - `test_get_provider_openai_fallback_still_available` - 14 consolidator (PR #3) - 4 negative-control (PR #1) - 5 platformctl smoke ``` pytest platformctl/tests/ → 57 passed in 3.84s ``` ## NOT in this PR (intentional) - **Real CLI invocation untested.** `claude_cli.py` + `codex_cli.py` ship without exercising live CLI. Subprocess machinery + flag combos are based on `claude --help` / `codex exec --help`. First canary 3+3 run on a small Phase 02 PR will validate; flags can be tuned via env vars without code change. - **No Forgejo Actions wiring.** Still follow-up. - **Anthropic + OpenAI HTTP providers not deleted.** Kept as fallbacks; could be removed in a future PR once operator confirms CLI path is reliable for canary + 5+ subsequent runs. ## Test plan - [x] `pytest platformctl/tests/` → 57 green - [ ] Manual: real canary 3+3 run on a small Phase 02 PR (post-merge); verify all 6 reviewers respond - [ ] Manual: induce `claude` CLI failure (e.g., wrong model alias), verify ProviderError surfaces correctly + DECISION_REQUIRED.md fires 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Architectural correction per operator clarification 2026-05-01 (post PR #4
merge): claude + codex reviewers use LOCAL CLI subscriptions, NOT direct
Anthropic/OpenAI APIs. Only z.ai goes via HTTPS (key from Infisical).

Why this matters:
- All 3 providers under operator's existing subscriptions (Anthropic Max,
  OpenAI Pro, z.ai paid). Local CLI = no per-PR API costs for claude/gpt.
- One key to manage (z.ai), not three. Removes API-key-management surface
  the previous design implied.
- Codex CLI has `codex exec --sandbox read-only` purpose-built for review;
  Claude CLI has `claude -p --effort max` for max thinking. Both ship.

Changes:
- NEW `providers/claude_cli.py` — `claude -p --model opus --effort max
  --system-prompt … --no-session-persistence --output-format text`
- NEW `providers/codex_cli.py` — `codex exec --sandbox read-only
  --ignore-user-config -m gpt-5.5 -c model_reasoning_effort=high`
- `providers/__init__.py` — registry adds claude_cli + codex_cli;
  anthropic + openai remain as fallbacks for CI/edge cases
- `providers/zai.py` — docs updated with Infisical path:
  `/home-platform/providers/zai/ZAI_API_KEY` (prod env)
- `run_review.py` REVIEWER_REGISTRY swapped:
  - tech-claude / product-claude → claude_cli
  - tech-gpt / product-gpt → codex_cli
  - tech-glm / product-glm → zai (unchanged)
- `prompts/01.5c §2` — auth model section + invocation tables rewritten
- `state/OPERATOR_INBOX.md` — NEW convention file for operator → claude
  inbound messages (operator pushed: "sam sobie dodaj recuring loop")
- `state/AUDIT_LOG.jsonl` — appended OQ correction + PR5 coded entries
- `state/STATUS_NOW.md` — refreshed live status

Tests: 57/57 green
- 34 in test_run_review.py (was 29; +5 wire-up tests for new providers)
- 14 consolidator (PR #3)
- 4 negative-control (PR #1)
- 5 platformctl smoke

Recurring inbox-check loop scheduled via ScheduleWakeup (600s interval)
so claude session can react to operator messages while operator is mobile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-world output from `codex exec` (verified 2026-05-01 smoke test) emits:
- Preamble (version, workdir, model, sandbox info)
- "user" header + prompt echo (which itself contains an example {...} JSON
  if the prompt asks for JSON output)
- "codex" header + actual model JSON answer
- Mid-response error log (e.g., `codex_core::session` rollout warnings)
- Token usage line
- Duplicate JSON answer

The previous parser used "first { to last }" slicing which would capture
EVERYTHING between the prompt-echo's example brace and the duplicate-emit's
closing brace — unparseable, fell back to ABSTAIN.

This commit:
- Adds `_extract_balanced_json_blocks` — depth-counted, string-aware scan
  for top-level balanced `{...}` blocks
- Two-pass parse: prefer candidates with `verdict` key (the real answer);
  fall back to first parseable dict (lenient)
- New test fixtures:
  - `test_parse_handles_codex_cli_real_output` — verbatim captured output
  - `test_parse_skips_prompt_echo_without_verdict_key` — synthetic case

Tests: 59/59 (was 57; +2 here, +5 in PR #5 commit, all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform!5
No description provided.