pdurlej/kan-ductor

Fork 0

bug(ops): /health endpoint reports ok without verifying upstream usability #59

New issue

Open

opened 2026-05-10 14:20:19 +02:00 by pdurlej · 2 comments

pdurlej commented

2026-05-10 14:20:19 +02:00

Owner

Source

3+3 review on PR #41 (feat: add MCP ops health and smoke modes), merged 2026-05-10 via chain drain #52.

"/health is positioned as ops health but can report ok without verifying upstream usability. Either make readiness honest or label endpoint as config/liveness only."

Problem

The /health endpoint currently returns ok based on local config/liveness checks (process running, env loaded, DB pool initialized). It does NOT verify:

Database actually reachable (only that pool exists)
MCP server can dispatch a real call
Critical upstream services (Forgejo, Iskra runtime) reachable

So /health: ok may be returned when the system cannot actually serve real traffic. This makes the endpoint unreliable for ops alerting.

Scope

Two paths:

(a) Honest readiness — verify usability:

Check DB with cheap query (e.g. SELECT 1)
Check MCP-side connectivity if applicable
Add /ready for full readiness vs /live for liveness
Document checks in docs/ops/health.md

(b) Honest labeling — keep current behavior, fix naming:

Rename /health → /live (liveness only)
Add separate /ready for full readiness
Update ops docs to clarify

Acceptance criteria

Either path implemented
Ops alerting on the right endpoint reports actual user-impact
No false ok when DB/MCP are degraded
Existing health probes continue to work or get updated config

Refs

PR #41 (chain drain via #52)
3+3 review comment id 3189

## Source 3+3 review on PR #41 (`feat: add MCP ops health and smoke modes`), merged 2026-05-10 via chain drain #52. > "/health is positioned as ops health but can report ok without verifying upstream usability. Either make readiness honest or label endpoint as config/liveness only." ## Problem The `/health` endpoint currently returns `ok` based on local config/liveness checks (process running, env loaded, DB pool initialized). It does NOT verify: - Database actually reachable (only that pool exists) - MCP server can dispatch a real call - Critical upstream services (Forgejo, Iskra runtime) reachable So `/health: ok` may be returned when the system cannot actually serve real traffic. This makes the endpoint unreliable for ops alerting. ## Scope Two paths: **(a) Honest readiness** — verify usability: - Check DB with cheap query (e.g. `SELECT 1`) - Check MCP-side connectivity if applicable - Add `/ready` for full readiness vs `/live` for liveness - Document checks in `docs/ops/health.md` **(b) Honest labeling** — keep current behavior, fix naming: - Rename `/health` → `/live` (liveness only) - Add separate `/ready` for full readiness - Update ops docs to clarify ## Acceptance criteria - Either path implemented - Ops alerting on the right endpoint reports actual user-impact - No false `ok` when DB/MCP are degraded - Existing health probes continue to work or get updated config ## Refs - PR #41 (chain drain via #52) - 3+3 review comment id 3189

pdurlej added the

priority:p1

3plus3-followup

labels

2026-05-10 14:20:19 +02:00

Iskra referenced this issue

2026-05-19 02:11:37 +02:00

ux(Iskra): Kan cockpit should distinguish liveness, readiness, and user-visible usefulness #85

codex referenced this issue

2026-05-28 01:20:41 +02:00

gemini(w3): audit /live /ready and Docker healthcheck semantics #117

codex commented

2026-06-03 10:26:37 +02:00

Collaborator

Codex verification on current origin/main / current working tree after the stabilization train: #59 appears resolved.

Evidence:

packages/mcp/src/index.ts separates /live and /mcp/live from /ready and /mcp/ready.
/live is liveness only and reports upstream Kan API as unknown/not probed.
/ready calls the Kan API health path through the MCP client and returns non-OK when the upstream is unavailable.
/health and /mcp/health are compatibility readiness endpoints; they also probe the Kan API instead of reporting false OK.
packages/mcp/src/health.ts sanitizes configured URLs and avoids leaking credentials.
docs/openclaw-kan-mcp.md plus docs/ops/merge-train-smoke.md describe the /live, /ready, /health split; PRs #151/#152 further clarify diagnostics and smoke modes.

Verification run:

pnpm --filter @kan/mcp test

Passed. Recommendation: close #59 as satisfied by current main.

Codex verification on current `origin/main` / current working tree after the stabilization train: #59 appears resolved. Evidence: - `packages/mcp/src/index.ts` separates `/live` and `/mcp/live` from `/ready` and `/mcp/ready`. - `/live` is liveness only and reports upstream Kan API as unknown/not probed. - `/ready` calls the Kan API health path through the MCP client and returns non-OK when the upstream is unavailable. - `/health` and `/mcp/health` are compatibility readiness endpoints; they also probe the Kan API instead of reporting false OK. - `packages/mcp/src/health.ts` sanitizes configured URLs and avoids leaking credentials. - `docs/openclaw-kan-mcp.md` plus `docs/ops/merge-train-smoke.md` describe the `/live`, `/ready`, `/health` split; PRs #151/#152 further clarify diagnostics and smoke modes. Verification run: ```bash pnpm --filter @kan/mcp test ``` Passed. Recommendation: close #59 as satisfied by current main.

codex referenced this issue

2026-06-03 10:37:46 +02:00

ux(Iskra): Kan cockpit should distinguish liveness, readiness, and user-visible usefulness #85

Iskra commented

2026-06-09 03:10:12 +02:00

Collaborator

{
"confidence": 5,
"effort_hint": "medium",
"escalation": {
"kind": "none",
"reason": ""
},
"evidence_refs": [
{
"note": "Issue reports the ops health endpoint can return ok without verifying upstream usability.",
"type": "forgejo",
"value": "issue-title-body-labels-and-target-snapshot"
},
{
"note": "Body states current health checks may only prove local liveness, not database reachability, MCP dispatch, or critical upstream access.",
"type": "forgejo",
"value": "issue-body-problem"
},
{
"note": "Scope proposes honest readiness checks or relabeling endpoints as liveness/config-only with documentation.",
"type": "forgejo",
"value": "issue-body-scope"
}
],
"impact": 4,
"judge_actor": {
"name": "iskra",
"runtime": "openclaw"
},
"judged_at": "2026-06-09T01:09:00Z",
"labels_to_apply": [
"judge/p1",
"judge/codex-candidate"
],
"piotr_fit": "high",
"priority": "p1",
"rationale_summary": "This is P1 Codex-ready ops reliability work because misleading health checks can hide real service inability and break alerting trust.",
"reach": 4,
"recommended_next_action": "codex_candidate",
"rerun_reason": "no_prior_judgment",
"schema": "openclaw.judge.v0",
"target": {
"kind": "issue",
"number": 59,
"repo": "pdurlej/kan-ductor"
},
"target_snapshot": {
"body_hash": "sha256:1f098bc63f182fe79362a41df5b42252c24660a2af3acb07634b271e59482463",
"commit_count": null,
"evidence_hash": "sha256:f2372711456965cfaf6eac2f26806aef2f599f59ed8f2c3657abeba31b3db67e",
"head_sha": null,
"labels": [
"3plus3-followup",
"priority:p1"
],
"labels_hash": "sha256:eae246ad0747d73dd2fb96aea169d7d74574e5b2de03312edce7ec9f6d87a8f0",
"state": "open",
"title_hash": "sha256:d545bf15dbc576ecdfd7b3c4b7d6c13934ddb7344734b88de550845ef619cf93",
"updated_at": "2026-06-03T10:37:46+02:00"
},
"top_caveat": "Separate liveness from readiness clearly so cheap health checks do not overclaim real traffic readiness."
}

{ "confidence": 5, "effort_hint": "medium", "escalation": { "kind": "none", "reason": "" }, "evidence_refs": [ { "note": "Issue reports the ops health endpoint can return ok without verifying upstream usability.", "type": "forgejo", "value": "issue-title-body-labels-and-target-snapshot" }, { "note": "Body states current health checks may only prove local liveness, not database reachability, MCP dispatch, or critical upstream access.", "type": "forgejo", "value": "issue-body-problem" }, { "note": "Scope proposes honest readiness checks or relabeling endpoints as liveness/config-only with documentation.", "type": "forgejo", "value": "issue-body-scope" } ], "impact": 4, "judge_actor": { "name": "iskra", "runtime": "openclaw" }, "judged_at": "2026-06-09T01:09:00Z", "labels_to_apply": [ "judge/p1", "judge/codex-candidate" ], "piotr_fit": "high", "priority": "p1", "rationale_summary": "This is P1 Codex-ready ops reliability work because misleading health checks can hide real service inability and break alerting trust.", "reach": 4, "recommended_next_action": "codex_candidate", "rerun_reason": "no_prior_judgment", "schema": "openclaw.judge.v0", "target": { "kind": "issue", "number": 59, "repo": "pdurlej/kan-ductor" }, "target_snapshot": { "body_hash": "sha256:1f098bc63f182fe79362a41df5b42252c24660a2af3acb07634b271e59482463", "commit_count": null, "evidence_hash": "sha256:f2372711456965cfaf6eac2f26806aef2f599f59ed8f2c3657abeba31b3db67e", "head_sha": null, "labels": [ "3plus3-followup", "priority:p1" ], "labels_hash": "sha256:eae246ad0747d73dd2fb96aea169d7d74574e5b2de03312edce7ec9f6d87a8f0", "state": "open", "title_hash": "sha256:d545bf15dbc576ecdfd7b3c4b7d6c13934ddb7344734b88de550845ef619cf93", "updated_at": "2026-06-03T10:37:46+02:00" }, "top_caveat": "Separate liveness from readiness clearly so cheap health checks do not overclaim real traffic readiness." }

Iskra added the

judge/codex-candidate

judge/p1

labels

2026-06-09 03:10:12 +02:00

Iskra referenced this issue from pdurlej/judging-claw

2026-06-09 03:10:12 +02:00

[Judging Claw] codex_candidate for pdurlej/kan-ductor#59 (p1) #78