ops(openclaw): scheduler observability, cron store reconciliation, and upgrade go/no-go gate #135
Labels
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
pdurlej/platform#135
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
OpenClaw scheduler observability, legacy/active cron split, and upgrade safety gate
Status
Proposed / needs triage
Context
A Signal notification surfaced a real scheduling anomaly while discussing Agent Wake Bus / Forji work:
forgejo-self-audit-20260504workingnextCheckpointAt:2026-05-04T08:00:00+02:00checkpointState:preparedscheduled:falsescheduledJobId:nullsignal:+48508463453This means the promise/open-loop ledger believed a checkpoint existed, but there was no active scheduled job backing it.
The local inventory then showed a broader split:
/run/user/1000/openclaw/cron/jobs.json— missing / empty/home/openclaw/.openclaw/cron/jobs.json— contains old jobsiskra-daily-checkin.timeropenclaw cron listfailed with a filesystem/runtime path error in plugin-runtime depsThis is exactly why Piotr does not want to blindly upgrade OpenClaw right now: if scheduler/cron/promise-delivery state is already split, an upgrade could silently break reminders, promise checkpoints, delivery receipts, or recovery paths.
Decision / goal
Before upgrading OpenClaw, create a local safety procedure:
Evidence from 2026-05-09
Commands run manually from Iskra session:
Key observed facts:
warnopenclaw_cron_list_failedforgejo-self-audit-20260504nextCheckpointAtScope
In scope
Out of scope
Required checks
1. Inventory owners
Run and persist output:
Expected output:
2. Promise ledger consistency
Run:
Expected:
nextCheckpointAtandscheduledJobId: null, unless explicitly marked blocked/needs_user and surfaced;3. Cron store split
Clarify source of truth:
/run/user/1000/openclaw/cron/jobs.jsonsupposed to be active store?/home/openclaw/.openclaw/cron/jobs.jsonstale legacy evidence or still used by anything?openclaw cron listfail with plugin-runtime deps path error?Produce a migration/cleanup recommendation, but do not apply destructive changes without approval.
4. Snapshot before upgrade
Before any OpenClaw upgrade:
/home/openclaw/.openclaw/home/openclaw/.openclaw/workspace/home/openclaw/vaults/Iskra-i-Piotror confirm Obsidian Sync stateSuggested artifact path:
5. Upgrade go/no-go
Go only if:
No-go if:
openclaw cron liststill crashes and no workaround/source-of-truth is documented;Acceptance criteria
This issue can close when:
open_loop_registry.py scan-overdue,audit-checkpoint-jobs, andaudit-deliverieshave clear pass/fail outputs.Operator note
Piotr’s current stance is correct: do not upgrade OpenClaw blindly while scheduler/cron state is inconsistent. First repair local observability, then snapshot, then upgrade audit, then go/no-go.
Related
platform#134— ADR proposal: Comment-driven agent orchestration / Agent Wake Busmemory/open-loops.json— promise debt source of truth/home/openclaw/.openclaw/workspace/HEARTBEAT.md— heartbeat contract/home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py@codex please run an OpenClaw scheduler upgrade preflight.
Goal: determine whether it is safe to upgrade OpenClaw.
Check:
openclaw cron listcrashesDo not modify state except writing a report.
Output:
Current operator stance: FIX-FIRST. No OpenClaw upgrade before preflight evidence + snapshot/rollback path.
Live mitigation applied from Signal thread on 2026-05-09:
promise-delivery-reconcilewas found in both cron stores:/run/user/1000/openclaw/cron/jobs.json, idffbaf0a6-812d-41cc-a90a-fa1da8dad034/home/openclaw/.openclaw/cron/jobs.json, idee29a6b6-5630-43e1-b085-a403eadefe7fBoth had:
…but the job completion still leaked into the live Signal thread as an async completion / user-facing message:
Mitigation: disabled only this reconciliation cron in both stores, with backups:
/run/user/1000/openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z/home/openclaw/.openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723ZReason added in job metadata:
Daily check-in / systemd timers were not changed.
This strengthens the issue: we need to verify not only scheduler ownership, but also whether
delivery.mode=noneactually suppresses user-facing async completion routing for cron jobs.Additional live symptom from Signal on 2026-05-09:
Anti-pattern: No-Op Notification Bus
A heartbeat/no-op maintenance result was delivered to the user as a normal Signal message:
This is logically correct but product-wise wrong: it surfaces the fact that there is nothing to surface.
Expected behavior:
HEARTBEAT_OKinternally / no Signal notification externally;This appears related to the earlier
promise-delivery-reconcileleak: internal cron/heartbeat/maintenance completions are crossing into the user-facing Signal delivery path despite being no-op / non-actionable.Suggested investigation item for this issue:
delivery.mode = none,Working name: No-Op Notification Bus / “Powiadamiam, że nie ma powiadomień.”
Additional live symptom from Signal on 2026-05-09, seen in screenshots from the operator:
Promise-ledger anomaly alerts are also leaking as user-facing chat
After a normal Signal exchange, the user received internal/operator-style heartbeat/promise audit messages in the Signal DM, e.g.:
Then another delivered message expanded the internal audit payload:
And later a no-op/no-new-alert audit result was also delivered:
This confirms the same product/runtime class as the earlier examples:
shouldAlert: false/ no-op summaries can surface to the user;promise ledger,session=agent:main...,messageId,matchedPatterns,observedRuns) that should not appear in the conversational surface.Expected behavior:
shouldAlert: truemay become a short, human-readable operator alert only if it genuinely requires the operator's attention;shouldAlert: false/ no-new-alert results must be silent externally;This should stay under #135 rather than a new issue because it strengthens the suspected root cause: async/maintenance completion routing is not respecting delivery/no-op/user-facing contracts consistently.
Root-cause hypothesis from live inspection on 2026-05-09:
This does not look like only one bad cron/job prompt.
promise-delivery-reconcilewas already disabled, andheartbeat-state.jsonshowed a silent/no-op outcome:Yet the operator still received Signal messages wrapping internal results, e.g.:
Hypothesis:
The async/heartbeat completion layer is treating diagnostic/tool output as user-facing assistant output
Likely failure boundary:
instead of:
In other words, the agent/heartbeat logic can correctly decide “do not send”, but a downstream completion/announcement layer still sends a summary of that decision.
This matches multiple symptoms under this issue:
delivery.mode=nonejob leaks;shouldAlert: falsestill becomes Signal chat;sessionKey,messageId,matchedPatterns,observedRuns) appear in the conversational surface.Suggested investigation:
delivery.mode=none→ never send;HEARTBEAT_OK→ never send externally;silent-ok/shouldAlert=false/no-open-loop-send→ never send a wrapper;Working phrase: agent says “do not send”; runtime sends a report about not sending.
M08 supplemental triage result: keep in M08. Iskra recommends keeping scheduler observability as current OpenClaw readiness work; #568 remains the WIP/spec artifact connected to this lane.
Post-close archival update: #631 merged the clean replacement Spec Kit for this topic from current
main, and #568 was closed as superseded because its source branch carried a stale broad diff.Current state: the OpenClaw scheduler observability material is preserved as inert planning/spec context; no runtime scheduler implementation was activated here.