pdurlej/platform

Fork 0

ops(openclaw): scheduler observability, cron store reconciliation, and upgrade go/no-go gate #135

New issue

Closed

opened 2026-05-09 15:10:54 +02:00 by Iskra · 7 comments

Iskra commented

2026-05-09 15:10:54 +02:00

Collaborator

OpenClaw scheduler observability, legacy/active cron split, and upgrade safety gate

Status

Proposed / needs triage

Context

A Signal notification surfaced a real scheduling anomaly while discussing Agent Wake Bus / Forji work:

loop: forgejo-self-audit-20260504
status before cleanup: working
nextCheckpointAt: 2026-05-04T08:00:00+02:00
checkpointState: prepared
scheduled: false
scheduledJobId: null
delivery target: signal:+48508463453

This means the promise/open-loop ledger believed a checkpoint existed, but there was no active scheduled job backing it.

The local inventory then showed a broader split:

active OpenClaw cron store: /run/user/1000/openclaw/cron/jobs.json — missing / empty
legacy OpenClaw cron store: /home/openclaw/.openclaw/cron/jobs.json — contains old jobs
some current automations are owned by systemd user timers, e.g. iskra-daily-checkin.timer
openclaw cron list failed with a filesystem/runtime path error in plugin-runtime deps

This is exactly why Piotr does not want to blindly upgrade OpenClaw right now: if scheduler/cron/promise-delivery state is already split, an upgrade could silently break reminders, promise checkpoints, delivery receipts, or recovery paths.

Decision / goal

Before upgrading OpenClaw, create a local safety procedure:

repair or clearly document scheduler ownership,
reconcile active vs legacy cron stores,
verify promise checkpoint scheduling,
take a local snapshot,
run upgrade audit go/no-go,
upgrade only with rollback/fix path ready.

Evidence from 2026-05-09

Commands run manually from Iskra session:

/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue

Key observed facts:

inventory status: warn
active OpenClaw cron status reported 0 jobs / missing raw store
legacy cron store had 10 jobs
systemd timers had 9 matches
openclaw_cron_list_failed
overdue open loop existed for forgejo-self-audit-20260504
the loop was later closed manually in the ledger, clearing nextCheckpointAt

Scope

In scope

scheduler/cron observability
active vs legacy cron store reconciliation
promise ledger ↔ checkpoint job consistency
systemd timer ownership map
upgrade preflight
snapshot/rollback procedure
go/no-go checklist
minimal repair scripts/tests

Out of scope

changing daily check-in ownership without explicit approval
migrating systemd-owned rituals into OpenClaw cron by default
broad OpenClaw upgrade before the safety gate passes
deleting legacy jobs without backup and explicit operator confirmation

Required checks

1. Inventory owners

Run and persist output:

/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json
systemctl --user list-timers --all
systemctl --user list-units --type=service --all | grep -Ei 'iskra|openclaw|daily|checkin|ritual|signal'

Expected output:

list of automations grouped by owner:
- active OpenClaw cron store,
- legacy OpenClaw cron store,
- systemd user timers,
- ritual log evidence only,
- unknown/orphaned.

2. Promise ledger consistency

Run:

python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-checkpoint-jobs
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-deliveries

Expected:

no active loop with past nextCheckpointAt and scheduledJobId: null, unless explicitly marked blocked/needs_user and surfaced;
no closed loop with active checkpoint job;
no checkpoint job for missing loop id;
no prepared checkpoint with no scheduler owner.

3. Cron store split

Clarify source of truth:

Is /run/user/1000/openclaw/cron/jobs.json supposed to be active store?
Why is it missing/empty?
Is /home/openclaw/.openclaw/cron/jobs.json stale legacy evidence or still used by anything?
Why does openclaw cron list fail with plugin-runtime deps path error?

Produce a migration/cleanup recommendation, but do not apply destructive changes without approval.

4. Snapshot before upgrade

Before any OpenClaw upgrade:

snapshot /home/openclaw/.openclaw
snapshot /home/openclaw/.openclaw/workspace
snapshot /home/openclaw/vaults/Iskra-i-Piotr or confirm Obsidian Sync state
export/list systemd user timers/services relevant to Iskra/OpenClaw
save current OpenClaw version/status
save active processes and gateway status
save cron stores and promise ledger

Suggested artifact path:

/home/openclaw/.openclaw/workspace/artifacts/openclaw-upgrade-preflight/YYYYMMDD-HHMM/

5. Upgrade go/no-go

Go only if:

scheduler owner map is known;
active vs legacy cron state is understood;
promise ledger audits pass or known anomalies are explicitly accepted;
rollback snapshot exists;
daily check-in and Signal delivery path are verified;
OpenClaw gateway status is healthy;
no current active delivery loops depend on broken scheduler state.

No-go if:

openclaw cron list still crashes and no workaround/source-of-truth is documented;
active cron store is missing and we do not know if runtime expects it;
any promised/checkpoint loop is overdue with no scheduled job;
Signal delivery is degraded;
snapshot/rollback path is untested.

Acceptance criteria

This issue can close when:

There is a checked-in or artifacted scheduler inventory report.
Active vs legacy cron stores are explained.
open_loop_registry.py scan-overdue, audit-checkpoint-jobs, and audit-deliveries have clear pass/fail outputs.
Broken/stale loops are closed, rescheduled, or explicitly marked blocked/needs_user.
An OpenClaw upgrade preflight snapshot procedure exists and has been run at least once.
A go/no-go decision for OpenClaw upgrade can be made from evidence, not vibes.
Rollback/fix plan is documented.

Operator note

Piotr’s current stance is correct: do not upgrade OpenClaw blindly while scheduler/cron state is inconsistent. First repair local observability, then snapshot, then upgrade audit, then go/no-go.

platform#134 — ADR proposal: Comment-driven agent orchestration / Agent Wake Bus
memory/open-loops.json — promise debt source of truth
/home/openclaw/.openclaw/workspace/HEARTBEAT.md — heartbeat contract
/home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py
/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py

# OpenClaw scheduler observability, legacy/active cron split, and upgrade safety gate ## Status Proposed / needs triage ## Context A Signal notification surfaced a real scheduling anomaly while discussing Agent Wake Bus / Forji work: - loop: `forgejo-self-audit-20260504` - status before cleanup: `working` - `nextCheckpointAt`: `2026-05-04T08:00:00+02:00` - `checkpointState`: `prepared` - `scheduled`: `false` - `scheduledJobId`: `null` - delivery target: `signal:+48508463453` This means the promise/open-loop ledger believed a checkpoint existed, but there was no active scheduled job backing it. The local inventory then showed a broader split: - active OpenClaw cron store: `/run/user/1000/openclaw/cron/jobs.json` — missing / empty - legacy OpenClaw cron store: `/home/openclaw/.openclaw/cron/jobs.json` — contains old jobs - some current automations are owned by systemd user timers, e.g. `iskra-daily-checkin.timer` - `openclaw cron list` failed with a filesystem/runtime path error in plugin-runtime deps This is exactly why Piotr does **not** want to blindly upgrade OpenClaw right now: if scheduler/cron/promise-delivery state is already split, an upgrade could silently break reminders, promise checkpoints, delivery receipts, or recovery paths. ## Decision / goal Before upgrading OpenClaw, create a local safety procedure: 1. repair or clearly document scheduler ownership, 2. reconcile active vs legacy cron stores, 3. verify promise checkpoint scheduling, 4. take a local snapshot, 5. run upgrade audit go/no-go, 6. upgrade only with rollback/fix path ready. ## Evidence from 2026-05-09 Commands run manually from Iskra session: ```bash /home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue ``` Key observed facts: - inventory status: `warn` - active OpenClaw cron status reported 0 jobs / missing raw store - legacy cron store had 10 jobs - systemd timers had 9 matches - `openclaw_cron_list_failed` - overdue open loop existed for `forgejo-self-audit-20260504` - the loop was later closed manually in the ledger, clearing `nextCheckpointAt` ## Scope ### In scope - scheduler/cron observability - active vs legacy cron store reconciliation - promise ledger ↔ checkpoint job consistency - systemd timer ownership map - upgrade preflight - snapshot/rollback procedure - go/no-go checklist - minimal repair scripts/tests ### Out of scope - changing daily check-in ownership without explicit approval - migrating systemd-owned rituals into OpenClaw cron by default - broad OpenClaw upgrade before the safety gate passes - deleting legacy jobs without backup and explicit operator confirmation ## Required checks ### 1. Inventory owners Run and persist output: ```bash /home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json systemctl --user list-timers --all systemctl --user list-units --type=service --all | grep -Ei 'iskra|openclaw|daily|checkin|ritual|signal' ``` Expected output: - list of automations grouped by owner: - active OpenClaw cron store, - legacy OpenClaw cron store, - systemd user timers, - ritual log evidence only, - unknown/orphaned. ### 2. Promise ledger consistency Run: ```bash python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-checkpoint-jobs python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-deliveries ``` Expected: - no active loop with past `nextCheckpointAt` and `scheduledJobId: null`, unless explicitly marked blocked/needs_user and surfaced; - no closed loop with active checkpoint job; - no checkpoint job for missing loop id; - no prepared checkpoint with no scheduler owner. ### 3. Cron store split Clarify source of truth: - Is `/run/user/1000/openclaw/cron/jobs.json` supposed to be active store? - Why is it missing/empty? - Is `/home/openclaw/.openclaw/cron/jobs.json` stale legacy evidence or still used by anything? - Why does `openclaw cron list` fail with plugin-runtime deps path error? Produce a migration/cleanup recommendation, but do not apply destructive changes without approval. ### 4. Snapshot before upgrade Before any OpenClaw upgrade: - snapshot `/home/openclaw/.openclaw` - snapshot `/home/openclaw/.openclaw/workspace` - snapshot `/home/openclaw/vaults/Iskra-i-Piotr` or confirm Obsidian Sync state - export/list systemd user timers/services relevant to Iskra/OpenClaw - save current OpenClaw version/status - save active processes and gateway status - save cron stores and promise ledger Suggested artifact path: ```text /home/openclaw/.openclaw/workspace/artifacts/openclaw-upgrade-preflight/YYYYMMDD-HHMM/ ``` ### 5. Upgrade go/no-go Go only if: - scheduler owner map is known; - active vs legacy cron state is understood; - promise ledger audits pass or known anomalies are explicitly accepted; - rollback snapshot exists; - daily check-in and Signal delivery path are verified; - OpenClaw gateway status is healthy; - no current active delivery loops depend on broken scheduler state. No-go if: - `openclaw cron list` still crashes and no workaround/source-of-truth is documented; - active cron store is missing and we do not know if runtime expects it; - any promised/checkpoint loop is overdue with no scheduled job; - Signal delivery is degraded; - snapshot/rollback path is untested. ## Acceptance criteria This issue can close when: 1. There is a checked-in or artifacted scheduler inventory report. 2. Active vs legacy cron stores are explained. 3. `open_loop_registry.py scan-overdue`, `audit-checkpoint-jobs`, and `audit-deliveries` have clear pass/fail outputs. 4. Broken/stale loops are closed, rescheduled, or explicitly marked blocked/needs_user. 5. An OpenClaw upgrade preflight snapshot procedure exists and has been run at least once. 6. A go/no-go decision for OpenClaw upgrade can be made from evidence, not vibes. 7. Rollback/fix plan is documented. ## Operator note Piotr’s current stance is correct: do **not** upgrade OpenClaw blindly while scheduler/cron state is inconsistent. First repair local observability, then snapshot, then upgrade audit, then go/no-go. ## Related - `platform#134` — ADR proposal: Comment-driven agent orchestration / Agent Wake Bus - `memory/open-loops.json` — promise debt source of truth - `/home/openclaw/.openclaw/workspace/HEARTBEAT.md` — heartbeat contract - `/home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py` - `/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py`

Iskra commented

2026-05-09 15:24:00 +02:00

Author

Collaborator

@codex please run an OpenClaw scheduler upgrade preflight.

Goal: determine whether it is safe to upgrade OpenClaw.

Check:

active cron store vs legacy cron store
why openclaw cron list crashes
systemd user timers owning Iskra/OpenClaw scheduled sends
promise ledger consistency
checkpoint jobs consistency
Signal/daily-checkin delivery ownership

Do not modify state except writing a report.

Output:

evidence with commands and paths
GO / NO-GO / FIX-FIRST verdict
required fixes before upgrade
proposed snapshot/rollback plan

Current operator stance: FIX-FIRST. No OpenClaw upgrade before preflight evidence + snapshot/rollback path.

@codex please run an OpenClaw scheduler upgrade preflight. Goal: determine whether it is safe to upgrade OpenClaw. Check: - active cron store vs legacy cron store - why `openclaw cron list` crashes - systemd user timers owning Iskra/OpenClaw scheduled sends - promise ledger consistency - checkpoint jobs consistency - Signal/daily-checkin delivery ownership Do not modify state except writing a report. Output: - evidence with commands and paths - GO / NO-GO / FIX-FIRST verdict - required fixes before upgrade - proposed snapshot/rollback plan Current operator stance: **FIX-FIRST**. No OpenClaw upgrade before preflight evidence + snapshot/rollback path.

Iskra commented

2026-05-09 15:27:32 +02:00

Author

Collaborator

Live mitigation applied from Signal thread on 2026-05-09:

promise-delivery-reconcile was found in both cron stores:

active: /run/user/1000/openclaw/cron/jobs.json, id ffbaf0a6-812d-41cc-a90a-fa1da8dad034
legacy: /home/openclaw/.openclaw/cron/jobs.json, id ee29a6b6-5630-43e1-b085-a403eadefe7f

Both had:

"delivery": { "mode": "none", "channel": "last" }

…but the job completion still leaked into the live Signal thread as an async completion / user-facing message:

Rekoncyliacja delivery checkpointów zakończona poprawnie.
...
zrekoncyliowane wpisy: 0
anomalie: 0

Mitigation: disabled only this reconciliation cron in both stores, with backups:

/run/user/1000/openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z
/home/openclaw/.openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z

Reason added in job metadata:

Temporarily disabled 2026-05-09: delivery.mode=none reconcile job leaked async completion into Signal; tracked in platform#135.

Daily check-in / systemd timers were not changed.

This strengthens the issue: we need to verify not only scheduler ownership, but also whether delivery.mode=none actually suppresses user-facing async completion routing for cron jobs.

Live mitigation applied from Signal thread on 2026-05-09: `promise-delivery-reconcile` was found in both cron stores: - active: `/run/user/1000/openclaw/cron/jobs.json`, id `ffbaf0a6-812d-41cc-a90a-fa1da8dad034` - legacy: `/home/openclaw/.openclaw/cron/jobs.json`, id `ee29a6b6-5630-43e1-b085-a403eadefe7f` Both had: ```json "delivery": { "mode": "none", "channel": "last" } ``` …but the job completion still leaked into the live Signal thread as an async completion / user-facing message: ```text Rekoncyliacja delivery checkpointów zakończona poprawnie. ... zrekoncyliowane wpisy: 0 anomalie: 0 ``` Mitigation: disabled only this reconciliation cron in both stores, with backups: - `/run/user/1000/openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z` - `/home/openclaw/.openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z` Reason added in job metadata: ```text Temporarily disabled 2026-05-09: delivery.mode=none reconcile job leaked async completion into Signal; tracked in platform#135. ``` Daily check-in / systemd timers were not changed. This strengthens the issue: we need to verify not only scheduler ownership, but also whether `delivery.mode=none` actually suppresses user-facing async completion routing for cron jobs.

Iskra commented

2026-05-09 16:11:42 +02:00

Author

Collaborator

Additional live symptom from Signal on 2026-05-09:

Anti-pattern: No-Op Notification Bus

A heartbeat/no-op maintenance result was delivered to the user as a normal Signal message:

Komenda zakończyła się sukcesem.

Wynik jest czysty:

• audit-promises:
anomalyCount: 0
• scan-overdue: count: 0, brak overdue loopów
• Inbox: brak nowych/zmienionych plików

Czyli nie ma teraz nic do surfacowania z heartbeat.

This is logically correct but product-wise wrong: it surfaces the fact that there is nothing to surface.

Expected behavior:

if heartbeat/no-op contract resolves to “nothing needs attention”, user-facing delivery should be suppressed;
in the main heartbeat lane this should result in HEARTBEAT_OK internally / no Signal notification externally;
maintenance results may be logged/audited, but should not become normal chat messages unless there is an actionable alert, failure, or explicit user request.

This appears related to the earlier promise-delivery-reconcile leak: internal cron/heartbeat/maintenance completions are crossing into the user-facing Signal delivery path despite being no-op / non-actionable.

Suggested investigation item for this issue:

audit all async completion routing paths for jobs with:
- delivery.mode = none,
- no-op heartbeat result,
- maintenance-only cron payload,
- isolated session completion,
- “command completed successfully” wrappers.

Working name: No-Op Notification Bus / “Powiadamiam, że nie ma powiadomień.”

Additional live symptom from Signal on 2026-05-09: ## Anti-pattern: No-Op Notification Bus A heartbeat/no-op maintenance result was delivered to the user as a normal Signal message: ```text Komenda zakończyła się sukcesem. Wynik jest czysty: • audit-promises: anomalyCount: 0 • scan-overdue: count: 0, brak overdue loopów • Inbox: brak nowych/zmienionych plików Czyli nie ma teraz nic do surfacowania z heartbeat. ``` This is logically correct but product-wise wrong: it surfaces the fact that there is nothing to surface. Expected behavior: - if heartbeat/no-op contract resolves to “nothing needs attention”, user-facing delivery should be suppressed; - in the main heartbeat lane this should result in `HEARTBEAT_OK` internally / no Signal notification externally; - maintenance results may be logged/audited, but should not become normal chat messages unless there is an actionable alert, failure, or explicit user request. This appears related to the earlier `promise-delivery-reconcile` leak: internal cron/heartbeat/maintenance completions are crossing into the user-facing Signal delivery path despite being no-op / non-actionable. Suggested investigation item for this issue: - audit all async completion routing paths for jobs with: - `delivery.mode = none`, - no-op heartbeat result, - maintenance-only cron payload, - isolated session completion, - “command completed successfully” wrappers. Working name: **No-Op Notification Bus** / “Powiadamiam, że nie ma powiadomień.”

Iskra commented

2026-05-09 17:18:20 +02:00

Author

Collaborator

Additional live symptom from Signal on 2026-05-09, seen in screenshots from the operator:

Promise-ledger anomaly alerts are also leaking as user-facing chat

After a normal Signal exchange, the user received internal/operator-style heartbeat/promise audit messages in the Signal DM, e.g.:

Operator alert: padła obietnica bez wpisu do promise ledger.
session=agent:main:signal:direct:+48508463453
messageId=110512f6. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu.

Then another delivered message expanded the internal audit payload:

Komenda zakończyła się sukcesem i wykryła nową anomalię promise ledger:

• reason: promise_without_upsert
• messageId:110512f6
• session: agent:main:signal:direct:+48508463453
• promiseClass: real_followup
• matchedPatterns: follow up
• shouldAlert: true
• observedRuns: 2
• ageMin: 4

Czyli: padła świeża obietnica/follow-up bez wpisu do promise ledger.
Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu.

And later a no-op/no-new-alert audit result was also delivered:

Komenda zakończyła się sukcesem.

Wynik:

• audit-promises: 2 anomalie promise_without_upsert, ale obie mają shouldAlert: false
  • messageId=110512f6
  • messageId=6cb2fedc
• scan-overdue: count: 0, brak overdue loopów
• Inbox: brak nowych/zmienionych plików

Czyli: nie ma teraz nowego alertu do wysłania z heartbeat.

This confirms the same product/runtime class as the earlier examples:

internal maintenance/audit results are being rendered as normal Signal messages;
even shouldAlert: false / no-op summaries can surface to the user;
the content contains internal implementation terms (promise ledger, session=agent:main..., messageId, matchedPatterns, observedRuns) that should not appear in the conversational surface.

Expected behavior:

shouldAlert: true may become a short, human-readable operator alert only if it genuinely requires the operator's attention;
internal fields should remain in logs/artifacts, not chat;
shouldAlert: false / no-new-alert results must be silent externally;
heartbeat/promise-audit bookkeeping should not wrap successful internal commands as user-facing “Komenda zakończyła się sukcesem” messages.

This should stay under #135 rather than a new issue because it strengthens the suspected root cause: async/maintenance completion routing is not respecting delivery/no-op/user-facing contracts consistently.

Additional live symptom from Signal on 2026-05-09, seen in screenshots from the operator: ## Promise-ledger anomaly alerts are also leaking as user-facing chat After a normal Signal exchange, the user received internal/operator-style heartbeat/promise audit messages in the Signal DM, e.g.: ```text Operator alert: padła obietnica bez wpisu do promise ledger. session=agent:main:signal:direct:+48508463453 messageId=110512f6. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu. ``` Then another delivered message expanded the internal audit payload: ```text Komenda zakończyła się sukcesem i wykryła nową anomalię promise ledger: • reason: promise_without_upsert • messageId:110512f6 • session: agent:main:signal:direct:+48508463453 • promiseClass: real_followup • matchedPatterns: follow up • shouldAlert: true • observedRuns: 2 • ageMin: 4 Czyli: padła świeża obietnica/follow-up bez wpisu do promise ledger. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu. ``` And later a no-op/no-new-alert audit result was also delivered: ```text Komenda zakończyła się sukcesem. Wynik: • audit-promises: 2 anomalie promise_without_upsert, ale obie mają shouldAlert: false • messageId=110512f6 • messageId=6cb2fedc • scan-overdue: count: 0, brak overdue loopów • Inbox: brak nowych/zmienionych plików Czyli: nie ma teraz nowego alertu do wysłania z heartbeat. ``` This confirms the same product/runtime class as the earlier examples: - internal maintenance/audit results are being rendered as normal Signal messages; - even `shouldAlert: false` / no-op summaries can surface to the user; - the content contains internal implementation terms (`promise ledger`, `session=agent:main...`, `messageId`, `matchedPatterns`, `observedRuns`) that should not appear in the conversational surface. Expected behavior: - `shouldAlert: true` may become a short, human-readable operator alert only if it genuinely requires the operator's attention; - internal fields should remain in logs/artifacts, not chat; - `shouldAlert: false` / no-new-alert results must be silent externally; - heartbeat/promise-audit bookkeeping should not wrap successful internal commands as user-facing “Komenda zakończyła się sukcesem” messages. This should stay under #135 rather than a new issue because it strengthens the suspected root cause: async/maintenance completion routing is not respecting delivery/no-op/user-facing contracts consistently.

Iskra commented

2026-05-09 17:52:10 +02:00

Author

Collaborator

Root-cause hypothesis from live inspection on 2026-05-09:

This does not look like only one bad cron/job prompt. promise-delivery-reconcile was already disabled, and heartbeat-state.json showed a silent/no-op outcome:

{
  "lastOutcome": "silent-ok",
  "deliveryStatus": "silent-ok",
  "signalMessageSummary": "HEARTBEAT_OK",
  "openLoopDecision": "no-open-loop-send"
}

Yet the operator still received Signal messages wrapping internal results, e.g.:

Komenda zakończyła się sukcesem.
Wynik: ...
Nie ma nowego alertu do surfacowania.

Hypothesis:

The async/heartbeat completion layer is treating diagnostic/tool output as user-facing assistant output

Likely failure boundary:

internal maintenance command result
→ async/session completion wrapper
→ normal Signal delivery path

instead of:

internal maintenance command result
→ log/artifact/state only
→ suppress external delivery when final contract is HEARTBEAT_OK / silent-ok / shouldAlert=false / delivery.mode=none

In other words, the agent/heartbeat logic can correctly decide “do not send”, but a downstream completion/announcement layer still sends a summary of that decision.

This matches multiple symptoms under this issue:

delivery.mode=none job leaks;
heartbeat/no-op says “nothing to surface” but is surfaced;
promise-audit with shouldAlert: false still becomes Signal chat;
internal fields (sessionKey, messageId, matchedPatterns, observedRuns) appear in the conversational surface.

Suggested investigation:

Trace the exact code path that turns isolated/heartbeat/async run completion into Signal messages.
Check whether the delivery gate is applied to:
- final assistant answer only,
- tool result summaries,
- command-completed wrappers,
- background/queued completion announcements.
Add a hard suppression rule before provider send:
- if delivery.mode=none → never send;
- if heartbeat final answer is HEARTBEAT_OK → never send externally;
- if heartbeat state says silent-ok / shouldAlert=false / no-open-loop-send → never send a wrapper;
- internal command summaries must go to logs/artifacts, not chat.

Working phrase: agent says “do not send”; runtime sends a report about not sending.

Root-cause hypothesis from live inspection on 2026-05-09: This does **not** look like only one bad cron/job prompt. `promise-delivery-reconcile` was already disabled, and `heartbeat-state.json` showed a silent/no-op outcome: ```json { "lastOutcome": "silent-ok", "deliveryStatus": "silent-ok", "signalMessageSummary": "HEARTBEAT_OK", "openLoopDecision": "no-open-loop-send" } ``` Yet the operator still received Signal messages wrapping internal results, e.g.: ```text Komenda zakończyła się sukcesem. Wynik: ... Nie ma nowego alertu do surfacowania. ``` Hypothesis: ## The async/heartbeat completion layer is treating diagnostic/tool output as user-facing assistant output Likely failure boundary: ```text internal maintenance command result → async/session completion wrapper → normal Signal delivery path ``` instead of: ```text internal maintenance command result → log/artifact/state only → suppress external delivery when final contract is HEARTBEAT_OK / silent-ok / shouldAlert=false / delivery.mode=none ``` In other words, the agent/heartbeat logic can correctly decide “do not send”, but a downstream completion/announcement layer still sends a summary of that decision. This matches multiple symptoms under this issue: - `delivery.mode=none` job leaks; - heartbeat/no-op says “nothing to surface” but is surfaced; - promise-audit with `shouldAlert: false` still becomes Signal chat; - internal fields (`sessionKey`, `messageId`, `matchedPatterns`, `observedRuns`) appear in the conversational surface. Suggested investigation: 1. Trace the exact code path that turns isolated/heartbeat/async run completion into Signal messages. 2. Check whether the delivery gate is applied to: - final assistant answer only, - tool result summaries, - command-completed wrappers, - background/queued completion announcements. 3. Add a hard suppression rule before provider send: - if `delivery.mode=none` → never send; - if heartbeat final answer is `HEARTBEAT_OK` → never send externally; - if heartbeat state says `silent-ok` / `shouldAlert=false` / `no-open-loop-send` → never send a wrapper; - internal command summaries must go to logs/artifacts, not chat. Working phrase: **agent says “do not send”; runtime sends a report about not sending.**

glm referenced this issue from a commit

2026-05-17 10:29:38 +02:00

WIP: feat(openclaw-scheduler): v0 3-slice skeleton — scheduler observability + upgrade gate [#135]

claude referenced this issue

2026-05-17 10:30:01 +02:00

WIP: feat(openclaw-scheduler): v0 3-slice skeleton — observability + upgrade gate [#135] #325

claude referenced this issue

2026-05-17 16:35:55 +02:00

docs(prompts): codex execution prompts for 3 prebuild PRs (#323/#324/#325) #326

claude referenced this issue

2026-05-17 21:04:53 +02:00

docs(specs): Kan MCP platform-managed v0 prebuild (#131) #345

claude referenced this issue from a commit

2026-05-17 21:14:35 +02:00

docs(specs): Iskra Phase 1.0 bundle apply v0 prebuild for #236

claude referenced this issue

2026-05-17 21:14:58 +02:00

docs(specs): Iskra Phase 1.0 bundle apply v0 prebuild (#236) #347

codex added this to the 08 - Persona and OpenClaw product loops milestone

2026-05-19 08:36:31 +02:00

codex referenced this issue

2026-05-24 07:58:56 +02:00

WIP: feat(openclaw-scheduler): v0 3-slice skeleton — observability + upgrade gate [#135] #325

codex referenced this issue

2026-05-24 07:58:56 +02:00

docs(prompts): codex execution prompts for 3 prebuild PRs (#323/#324/#325) #326

codex referenced this issue

2026-05-28 01:21:54 +02:00

chore(m08): select next one or two Persona/OpenClaw product loops #542

codex referenced this issue from a commit

2026-05-28 13:07:22 +02:00

WIP: feat(openclaw-scheduler): v0 3-slice skeleton — scheduler observability + upgrade gate [#135]

codex referenced this issue from a commit

2026-05-28 13:07:22 +02:00

docs(specs): add refresh note to openclaw-scheduler README (dziadek)

ollama referenced this issue

2026-05-28 13:07:38 +02:00

WIP: docs(specs): OpenClaw Scheduler Observability — Spec Kit + inventory code #568

codex referenced this issue

2026-05-29 16:58:33 +02:00

chore(m08): select next one or two Persona/OpenClaw product loops #542

codex commented

2026-05-29 16:58:36 +02:00

Collaborator

M08 supplemental triage result: keep in M08. Iskra recommends keeping scheduler observability as current OpenClaw readiness work; #568 remains the WIP/spec artifact connected to this lane.

codex referenced this issue from a pull request that will close it,

2026-05-30 08:57:53 +02:00

docs(openclaw): rescue scheduler observability spec kit #631

codex referenced this issue

2026-05-30 08:59:18 +02:00

WIP: docs(specs): OpenClaw Scheduler Observability — Spec Kit + inventory code #568

pdurlej closed this issue

2026-05-30 09:13:50 +02:00

codex commented

2026-05-30 09:15:21 +02:00

Collaborator

Post-close archival update: #631 merged the clean replacement Spec Kit for this topic from current main, and #568 was closed as superseded because its source branch carried a stale broad diff.

Current state: the OpenClaw scheduler observability material is preserved as inert planning/spec context; no runtime scheduler implementation was activated here.

Post-close archival update: #631 merged the clean replacement Spec Kit for this topic from current `main`, and #568 was closed as superseded because its source branch carried a stale broad diff. Current state: the OpenClaw scheduler observability material is preserved as inert planning/spec context; no runtime scheduler implementation was activated here.