pdurlej/platform

Fork 0

fix(canary-findings): parser bug + claude_cli argv leak + zai default URL #6

Closed

pdurlej wants to merge 3 commits from codex/review-pipeline/canary-findings-pr6 into main

pdurlej commented

2026-05-01 16:21:44 +02:00

Owner

Summary

First canary 3+3 review fired (2026-05-01, retrospective on PR #5) surfaced BLOCKER + 3 HIGH real bugs in the very PR #5 code. 3/3 tech reviewers convergent on the parser bug — exactly what the cross-provider redundancy was designed to catch.

This PR fixes all 4 critical findings. NO auto-merge — operator green light was scoped to a small "default URL fix"; canary surfaced bigger issues. Operator reviews when back.

Canary results (full artifacts in `state/reviews/PR-5/`)

Reviewer	Verdict
tech-glm	NOT_OK (BLOCKER)
tech-claude	NOT_OK (HIGH × 6)
tech-gpt	NOT_OK (BLOCKER + HIGH × 3)
product-glm	OK
product-claude	OK
product-gpt	NOT_OK (HIGH × 2)

Tech consensus: 3/3 dissent. Product: 1/3 dissent. Risk = BLOCKER, action = defer.

What this PR fixes

1. BLOCKER — Parser returns prompt-echo's JSON instead of model answer

_base.py:parse_reviewer_response iterates candidates in source order and picks the FIRST verdict-containing block. But codex_cli echoes the user_message (which embeds the diff verbatim) BEFORE the model's answer. If the diff contains {"verdict":"OK"} literals (e.g., test_run_review.py has 24+ such occurrences), parser silently returns the diff-embedded verdict instead of the model's actual answer.

Fix: iterate in REVERSE order. Model answer always comes last in CLI output.

Regression test added with prompt-echo OK + model answer NOT_OK — would FAIL before fix, PASSES after.

2. HIGH — Claude provider puts full PR payload in argv

claude_cli.py passed user_message (full diff) as -p argv. Issues:

ps aux shows the diff to anyone with shell access
ARG_MAX overflow on large PRs (silent failure)

Fix: pass via stdin. Verified claude -p reads stdin when no positional. Also added --bare flag for faster, isolated reviewer sessions (skips hooks/LSP/plugin-sync/auto-memory/keychain/CLAUDE.md). Override via PLATFORMCTL_CLAUDE_NO_BARE=1.

3. HIGH — z.ai key export docs leak secret to /tmp

CANARY_PLAN.md and zai.py docstring had infisical secrets get ... | tee /tmp/zai-key.txt && export ... && rm /tmp/zai-key.txt pattern. The /tmp file briefly exposes the key on disk.

Fix: removed temp-file pattern. Documented direct export $(infisical secrets get ... --plain) and scoped infisical run -- python ... alternatives.

4. HIGH — Tests added in PR #5 don't exercise bug class

test_parse_handles_codex_cli_real_output used IDENTICAL OK JSON in both prompt-echo and model-answer positions, masking the bug.

Fix: new test test_parse_picks_LAST_verdict_block_when_diff_contains_verdict_literal uses different verdicts in each position. Asserts NOT_OK is returned.

Bonus changes

codex_cli.py --ephemeral (tech-gpt opportunity): no session persistence, matches reviewer intent, removes the rollout-recording noise.
zai.py default URL: PaaS (/api/paas/v4) → Coding (/api/coding/paas/v4). Operator's subscription is Coding Plan; PaaS requires separate balance/recharge — caused the original "find the working endpoint" investigation. Now the default works without env override.

Tests

60/60 green:

37 in test_run_review.py (was 36; +1 regression test for parser bug class)
14 consolidator (PR #3)
4 negative-control (PR #1)
5 platformctl smoke

pytest platformctl/tests/ → 60 passed in 3.33s

NOT in this PR (deferred reviewer findings, low-medium severity)

Architectural / opportunity-grade items worth a separate PR:

Use claude --json-schema for structured output (eliminates parser drift entirely; tech-claude × 2)
Use claude --output-format json envelope (cleaner than text parsing)
Promote smoke-test-providers.sh to platformctl provider-doctor subcommand (product-gpt)
Named provider profiles: local-subscriptions vs api-fallback (product-gpt)
Open loop: review default model strings quarterly (product-claude)
Document ScheduleWakeup vs CronCreate trade-off in OPERATOR_INBOX (product-claude)

These are queued; see state/AUDIT_LOG.jsonl for the canary findings index.

Test plan

pytest platformctl/tests/ → 60 green
First canary fired on PR #5 → 6/6 reviewers responded → BLOCKER findings (this PR fixes them)
Second canary on this PR (post-merge) to validate the fix doesn't introduce new regressions

🤖 Generated with Claude Code

## Summary First canary 3+3 review fired (2026-05-01, retrospective on PR #5) surfaced **BLOCKER + 3 HIGH** real bugs in the very PR #5 code. **3/3 tech reviewers convergent** on the parser bug — exactly what the cross-provider redundancy was designed to catch. This PR fixes all 4 critical findings. **NO auto-merge** — operator green light was scoped to a small "default URL fix"; canary surfaced bigger issues. Operator reviews when back. ## Canary results (full artifacts in `state/reviews/PR-5/`) | Reviewer | Verdict | |---|---| | tech-glm | NOT_OK (BLOCKER) | | tech-claude | NOT_OK (HIGH × 6) | | tech-gpt | NOT_OK (BLOCKER + HIGH × 3) | | product-glm | OK | | product-claude | OK | | product-gpt | NOT_OK (HIGH × 2) | **Tech consensus: 3/3 dissent. Product: 1/3 dissent.** Risk = BLOCKER, action = defer. ## What this PR fixes **1. BLOCKER — Parser returns prompt-echo's JSON instead of model answer** `_base.py:parse_reviewer_response` iterates candidates in source order and picks the FIRST verdict-containing block. But codex_cli echoes the user_message (which embeds the diff verbatim) BEFORE the model's answer. If the diff contains `{"verdict":"OK"}` literals (e.g., `test_run_review.py` has 24+ such occurrences), parser silently returns the diff-embedded verdict instead of the model's actual answer. **Fix:** iterate in REVERSE order. Model answer always comes last in CLI output. **Regression test added** with prompt-echo `OK` + model answer `NOT_OK` — would FAIL before fix, PASSES after. **2. HIGH — Claude provider puts full PR payload in argv** `claude_cli.py` passed `user_message` (full diff) as `-p` argv. Issues: - `ps aux` shows the diff to anyone with shell access - ARG_MAX overflow on large PRs (silent failure) **Fix:** pass via stdin. Verified `claude -p` reads stdin when no positional. Also added `--bare` flag for faster, isolated reviewer sessions (skips hooks/LSP/plugin-sync/auto-memory/keychain/CLAUDE.md). Override via `PLATFORMCTL_CLAUDE_NO_BARE=1`. **3. HIGH — z.ai key export docs leak secret to /tmp** CANARY_PLAN.md and zai.py docstring had `infisical secrets get ... | tee /tmp/zai-key.txt && export ... && rm /tmp/zai-key.txt` pattern. The /tmp file briefly exposes the key on disk. **Fix:** removed temp-file pattern. Documented direct `export $(infisical secrets get ... --plain)` and scoped `infisical run -- python ...` alternatives. **4. HIGH — Tests added in PR #5 don't exercise bug class** `test_parse_handles_codex_cli_real_output` used IDENTICAL `OK` JSON in both prompt-echo and model-answer positions, masking the bug. **Fix:** new test `test_parse_picks_LAST_verdict_block_when_diff_contains_verdict_literal` uses different verdicts in each position. Asserts NOT_OK is returned. ## Bonus changes - **`codex_cli.py` `--ephemeral`** (tech-gpt opportunity): no session persistence, matches reviewer intent, removes the rollout-recording noise. - **`zai.py` default URL**: PaaS (`/api/paas/v4`) → Coding (`/api/coding/paas/v4`). Operator's subscription is Coding Plan; PaaS requires separate balance/recharge — caused the original "find the working endpoint" investigation. Now the default works without env override. ## Tests **60/60 green:** - 37 in `test_run_review.py` (was 36; +1 regression test for parser bug class) - 14 consolidator (PR #3) - 4 negative-control (PR #1) - 5 platformctl smoke ``` pytest platformctl/tests/ → 60 passed in 3.33s ``` ## NOT in this PR (deferred reviewer findings, low-medium severity) Architectural / opportunity-grade items worth a separate PR: - Use `claude --json-schema` for structured output (eliminates parser drift entirely; tech-claude × 2) - Use `claude --output-format json` envelope (cleaner than text parsing) - Promote `smoke-test-providers.sh` to `platformctl provider-doctor` subcommand (product-gpt) - Named provider profiles: `local-subscriptions` vs `api-fallback` (product-gpt) - Open loop: review default model strings quarterly (product-claude) - Document ScheduleWakeup vs CronCreate trade-off in OPERATOR_INBOX (product-claude) These are queued; see `state/AUDIT_LOG.jsonl` for the canary findings index. ## Test plan - [x] `pytest platformctl/tests/` → 60 green - [x] First canary fired on PR #5 → 6/6 reviewers responded → BLOCKER findings (this PR fixes them) - [ ] **Second canary on this PR (post-merge)** to validate the fix doesn't introduce new regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code)

pdurlej added 1 commit

2026-05-01 16:21:44 +02:00

fix(canary-findings): parser bug + claude_cli argv leak + zai default URL 1694faee4f

First canary 3+3 review (PR #5 retrospective, 2026-05-01) surfaced:
- BLOCKER (3/3 tech reviewers convergent): parser returns prompt-echo's
  JSON instead of model answer when diff contains verdict-keyed JSON
- HIGH (tech-gpt): claude_cli passes full PR payload via argv (ARG_MAX +
  process listing leakage of diff content)
- HIGH (tech-gpt): z.ai key export docs write secret to /tmp
- HIGH (tech-claude): tests added in PR #5 do not exercise the bug class

This PR fixes all four. No auto-merge — scope creep beyond operator's
green light for the original "small URL fix" PR #6 plan; operator reviews
results when back. Per OQ3 wait-mode + reviewer dissent → defer.

Changes:

**Parser (BLOCKER fix):**
- `_base.py` `parse_reviewer_response` iterates candidates in REVERSE
  order, picking LAST verdict-containing JSON block. CLI providers always
  emit model answer LAST (after prompt-echo + preamble). Reverse iteration
  picks model answer over diff-embedded fixtures.
- Bug class: codex_cli echoes user_message (incl. diff). When diff has
  `{"verdict":"OK", ...}` literals (test_run_review.py has 24+ such
  literals), parser would return them silently overriding model verdict.
- New regression test: prompt-echo OK + model NOT_OK → must return NOT_OK.
  This test would FAIL before the fix.

**claude_cli (HIGH fix):**
- Pass `user_message` via stdin instead of `-p <full-diff>` argv.
  Avoids ARG_MAX overflow on large PRs and keeps diff content out of
  process listing (`ps aux` would otherwise show full payload).
- Add `--bare` flag (skips hooks/LSP/plugin-sync/auto-memory/keychain
  reads/CLAUDE.md auto-discovery — fastest deterministic reviewer).
  Override via `PLATFORMCTL_CLAUDE_NO_BARE=1` for debugging.
- Truncate stderr in error messages (it may echo prompt content).

**codex_cli (small win):**
- Add `--ephemeral` flag (no session persistence; matches review intent;
  removes the rollout-recording warning seen in PR #5 parser test).

**zai.py (HIGH fix + URL default):**
- Default `PLATFORMCTL_ZAI_BASE_URL` changed from PaaS endpoint
  (`/api/paas/v4`) to Coding Plan (`/api/coding/paas/v4`). Operator's
  subscription is Coding Plan; PaaS requires separate balance/recharge.
- Removed `tee /tmp/zai-key.txt` pattern from docstring. Replaced with
  direct `export ... = $(infisical secrets get ... --plain)` and the
  scoped `infisical run -- python ...` alternative.

**CANARY_PLAN.md (HIGH fix):**
- Removed /tmp secret-leak pattern. Documented direct export + scoped
  invocation patterns.

State files:
- `state/AUDIT_LOG.jsonl` — canary fire entries + findings
- `state/OPERATOR_INBOX.md` — ACTed all 3 prior dispatch entries
- `state/DISPATCH_BRIDGE.json` — session_id for direct-channel
- `state/reviews/PR-5/` — full canary artifacts (decision packet + 6
  reviewer JSONs); first real audit trail of the 3+3 pipeline

Tests: 60/60 green
- 37 in test_run_review.py (was 36 in PR #5 head; +1 regression test)
- 14 consolidator (PR #3)
- 4 negative-control (PR #1)
- 5 platformctl smoke

NOT in this PR (deferred follow-ups, low-medium severity reviewer findings):
- claude `--json-schema` for structured output (architectural; bigger PR)
- Promote smoke-test to platformctl subcommand (product-gpt opportunity)
- Named provider profiles `local-subscriptions` vs `api-fallback`
- Open_loop: review default model strings quarterly
- ScheduleWakeup vs CronCreate trade-off doc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pdurlej referenced this pull request from a commit

2026-05-01 16:25:17 +02:00

fix(claude_cli): make --bare opt-in (was breaking OAuth Max auth)

pdurlej added 1 commit

2026-05-01 16:25:17 +02:00

fix(claude_cli): make --bare opt-in (was breaking OAuth Max auth) 5959ce5ecd

Canary #2 (PR #6 self-test, 2026-05-01) failed: 2/6 reviewers errored
with `claude CLI exit 1: stderr=''`. Root cause: per `claude --help`,
`--bare` mode requires `ANTHROPIC_API_KEY` env or `apiKeyHelper` via
`--settings` — OAuth and Keychain reads are explicitly disabled.

Operator's setup is OAuth Max subscription via macOS Keychain.
Adding `--bare` (PR #6 commit 1694faee, line "if not no_bare") killed
the auth path silently.

Fix: invert the env-var convention.
- Was: `--bare` ON by default, `PLATFORMCTL_CLAUDE_NO_BARE=1` to disable
- Now: `--bare` OFF by default, `PLATFORMCTL_CLAUDE_BARE=1` to enable
- Docstring spells out the auth dependency

The DECISION_REQUIRED.md from canary #2 is preserved (wait-mode policy
worked correctly — refused to consolidate with 4/6 votes). Successful
reviewer outputs in `state/reviews/PR-6/` are also preserved as audit
trail showing what the partial pipeline looked like.

Lesson: even a "low-severity opportunity" from a reviewer (tech-claude
in canary #1 suggested --bare for speed) needs evaluation against the
auth model, not just the speed model. Cross-reviewer redundancy still
won the BLOCKER, but no reviewer caught the auth implication of --bare
because they didn't have visibility into operator's Keychain setup.

Tests: 60/60 still green (no test changes — this is a runtime config
change). Will re-fire canary #3 to validate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pdurlej referenced this pull request from a commit

2026-05-01 16:31:52 +02:00

docs(ledger): canary #2 + #3 narrative + reviewer findings recap

pdurlej added 1 commit

2026-05-01 16:31:52 +02:00

docs(ledger): canary #2 + #3 narrative + reviewer findings recap 66e2c3008e

This commit updates ONLY the session ledger (AUDIT_LOG, OPERATOR_INBOX,
STATUS_NOW). Canary #3 review artifacts are deliberately NOT committed
— per convergent reviewer finding (4/6 reviewers in canary #3): committed
machine-local state + review artifacts cause "self-referential confusion"
in subsequent reviews. Artifacts persist locally on disk but stay out of
git history per their wisdom.

Operator must decide:
- Whether to merge PR #6 as-is (4 real code fixes, scope creep)
- Whether to split PR #6 (3 reviewers convergent recommend split)
- Whether to add charter rule on committed-vs-runtime state
  (DISPATCH_BRIDGE.json as teaching example, flagged by 4 reviewers)

Full canary chronology + findings recap in OPERATOR_INBOX.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pdurlej commented

2026-05-01 17:22:11 +02:00

Author

Owner

Rejected per operator architectural decision (2026-05-01)

Successor: PR #7 (codex/review-pipeline/improved-from-pr6) — narrow code-only scope, links back here.

Why rejected (convergent reviewer findings, canary #3):

Scope creep — 25 files, +1297/-27, mixes 4 code fixes + DISPATCH_BRIDGE + canary docs + ledger + 10 review artifacts (3 reviewers convergent: split before merge)
Machine-local state committed (state/DISPATCH_BRIDGE.json — session_id, absolute paths) (4 reviewers convergent)
Credential extraction path elevated to canonical doc (state/canary/CANARY_PLAN.md Path C) (4 reviewers convergent)
Reverse-order parser still has heuristic vulnerability post-noise (1 reviewer, but valid)

Pattern enshrined (per Oracle consult 2026-05-01, will land in PLATFORM_CHARTER via PR #7):

When a PR receives convergent BLOCKER findings or is materially contaminated by scope creep, local state, or credential exposure, close it as rejected and open a narrower successor PR linked as improved-from-#N. Rejection is a learning artifact, not a cleanup shortcut.

What stays in record (canary findings):

state/reviews/PR-5/ (canary #1 — caught real parser BLOCKER) — committed in PR #6, will stay on this branch's history
Canary #2 + #3 artifacts deliberately uncommitted (per reviewer convergent feedback on committed-state hygiene)

This rejection is the FIRST exercise of the "PRs as learning log" pattern.

## Rejected per operator architectural decision (2026-05-01) **Successor:** PR #7 (`codex/review-pipeline/improved-from-pr6`) — narrow code-only scope, links back here. **Why rejected (convergent reviewer findings, canary #3):** - Scope creep — 25 files, +1297/-27, mixes 4 code fixes + DISPATCH_BRIDGE + canary docs + ledger + 10 review artifacts (3 reviewers convergent: split before merge) - Machine-local state committed (`state/DISPATCH_BRIDGE.json` — session_id, absolute paths) (4 reviewers convergent) - Credential extraction path elevated to canonical doc (`state/canary/CANARY_PLAN.md` Path C) (4 reviewers convergent) - Reverse-order parser still has heuristic vulnerability post-noise (1 reviewer, but valid) **Pattern enshrined (per Oracle consult 2026-05-01, will land in PLATFORM_CHARTER via PR #7):** > When a PR receives convergent BLOCKER findings or is materially contaminated by scope creep, local state, or credential exposure, close it as `rejected` and open a narrower successor PR linked as `improved-from-#N`. Rejection is a learning artifact, not a cleanup shortcut. **What stays in record (canary findings):** - `state/reviews/PR-5/` (canary #1 — caught real parser BLOCKER) — committed in PR #6, will stay on this branch's history - Canary #2 + #3 artifacts deliberately uncommitted (per reviewer convergent feedback on committed-state hygiene) This rejection is the FIRST exercise of the "PRs as learning log" pattern.

pdurlej closed this pull request

2026-05-01 17:22:12 +02:00

pdurlej referenced this pull request from a commit

2026-05-01 17:23:38 +02:00

fix(review-pipeline): improved-from-#6 — narrow code-only successor

pdurlej referenced this pull request

2026-05-01 17:25:25 +02:00

fix(review-pipeline): improved-from-#6 — narrow code-only successor #7

pdurlej referenced this pull request

2026-05-01 17:29:25 +02:00

fix(review-pipeline): improved-from-#6 — narrow code-only successor #7