ops(openclaw): scheduler observability, cron store reconciliation, and upgrade go/no-go gate #135

Closed
opened 2026-05-09 15:10:54 +02:00 by Iskra · 7 comments
Collaborator

OpenClaw scheduler observability, legacy/active cron split, and upgrade safety gate

Status

Proposed / needs triage

Context

A Signal notification surfaced a real scheduling anomaly while discussing Agent Wake Bus / Forji work:

  • loop: forgejo-self-audit-20260504
  • status before cleanup: working
  • nextCheckpointAt: 2026-05-04T08:00:00+02:00
  • checkpointState: prepared
  • scheduled: false
  • scheduledJobId: null
  • delivery target: signal:+48508463453

This means the promise/open-loop ledger believed a checkpoint existed, but there was no active scheduled job backing it.

The local inventory then showed a broader split:

  • active OpenClaw cron store: /run/user/1000/openclaw/cron/jobs.json — missing / empty
  • legacy OpenClaw cron store: /home/openclaw/.openclaw/cron/jobs.json — contains old jobs
  • some current automations are owned by systemd user timers, e.g. iskra-daily-checkin.timer
  • openclaw cron list failed with a filesystem/runtime path error in plugin-runtime deps

This is exactly why Piotr does not want to blindly upgrade OpenClaw right now: if scheduler/cron/promise-delivery state is already split, an upgrade could silently break reminders, promise checkpoints, delivery receipts, or recovery paths.

Decision / goal

Before upgrading OpenClaw, create a local safety procedure:

  1. repair or clearly document scheduler ownership,
  2. reconcile active vs legacy cron stores,
  3. verify promise checkpoint scheduling,
  4. take a local snapshot,
  5. run upgrade audit go/no-go,
  6. upgrade only with rollback/fix path ready.

Evidence from 2026-05-09

Commands run manually from Iskra session:

/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue

Key observed facts:

  • inventory status: warn
  • active OpenClaw cron status reported 0 jobs / missing raw store
  • legacy cron store had 10 jobs
  • systemd timers had 9 matches
  • openclaw_cron_list_failed
  • overdue open loop existed for forgejo-self-audit-20260504
  • the loop was later closed manually in the ledger, clearing nextCheckpointAt

Scope

In scope

  • scheduler/cron observability
  • active vs legacy cron store reconciliation
  • promise ledger ↔ checkpoint job consistency
  • systemd timer ownership map
  • upgrade preflight
  • snapshot/rollback procedure
  • go/no-go checklist
  • minimal repair scripts/tests

Out of scope

  • changing daily check-in ownership without explicit approval
  • migrating systemd-owned rituals into OpenClaw cron by default
  • broad OpenClaw upgrade before the safety gate passes
  • deleting legacy jobs without backup and explicit operator confirmation

Required checks

1. Inventory owners

Run and persist output:

/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json
systemctl --user list-timers --all
systemctl --user list-units --type=service --all | grep -Ei 'iskra|openclaw|daily|checkin|ritual|signal'

Expected output:

  • list of automations grouped by owner:
    • active OpenClaw cron store,
    • legacy OpenClaw cron store,
    • systemd user timers,
    • ritual log evidence only,
    • unknown/orphaned.

2. Promise ledger consistency

Run:

python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-checkpoint-jobs
python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-deliveries

Expected:

  • no active loop with past nextCheckpointAt and scheduledJobId: null, unless explicitly marked blocked/needs_user and surfaced;
  • no closed loop with active checkpoint job;
  • no checkpoint job for missing loop id;
  • no prepared checkpoint with no scheduler owner.

3. Cron store split

Clarify source of truth:

  • Is /run/user/1000/openclaw/cron/jobs.json supposed to be active store?
  • Why is it missing/empty?
  • Is /home/openclaw/.openclaw/cron/jobs.json stale legacy evidence or still used by anything?
  • Why does openclaw cron list fail with plugin-runtime deps path error?

Produce a migration/cleanup recommendation, but do not apply destructive changes without approval.

4. Snapshot before upgrade

Before any OpenClaw upgrade:

  • snapshot /home/openclaw/.openclaw
  • snapshot /home/openclaw/.openclaw/workspace
  • snapshot /home/openclaw/vaults/Iskra-i-Piotr or confirm Obsidian Sync state
  • export/list systemd user timers/services relevant to Iskra/OpenClaw
  • save current OpenClaw version/status
  • save active processes and gateway status
  • save cron stores and promise ledger

Suggested artifact path:

/home/openclaw/.openclaw/workspace/artifacts/openclaw-upgrade-preflight/YYYYMMDD-HHMM/

5. Upgrade go/no-go

Go only if:

  • scheduler owner map is known;
  • active vs legacy cron state is understood;
  • promise ledger audits pass or known anomalies are explicitly accepted;
  • rollback snapshot exists;
  • daily check-in and Signal delivery path are verified;
  • OpenClaw gateway status is healthy;
  • no current active delivery loops depend on broken scheduler state.

No-go if:

  • openclaw cron list still crashes and no workaround/source-of-truth is documented;
  • active cron store is missing and we do not know if runtime expects it;
  • any promised/checkpoint loop is overdue with no scheduled job;
  • Signal delivery is degraded;
  • snapshot/rollback path is untested.

Acceptance criteria

This issue can close when:

  1. There is a checked-in or artifacted scheduler inventory report.
  2. Active vs legacy cron stores are explained.
  3. open_loop_registry.py scan-overdue, audit-checkpoint-jobs, and audit-deliveries have clear pass/fail outputs.
  4. Broken/stale loops are closed, rescheduled, or explicitly marked blocked/needs_user.
  5. An OpenClaw upgrade preflight snapshot procedure exists and has been run at least once.
  6. A go/no-go decision for OpenClaw upgrade can be made from evidence, not vibes.
  7. Rollback/fix plan is documented.

Operator note

Piotr’s current stance is correct: do not upgrade OpenClaw blindly while scheduler/cron state is inconsistent. First repair local observability, then snapshot, then upgrade audit, then go/no-go.

  • platform#134 — ADR proposal: Comment-driven agent orchestration / Agent Wake Bus
  • memory/open-loops.json — promise debt source of truth
  • /home/openclaw/.openclaw/workspace/HEARTBEAT.md — heartbeat contract
  • /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py
  • /home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py
# OpenClaw scheduler observability, legacy/active cron split, and upgrade safety gate ## Status Proposed / needs triage ## Context A Signal notification surfaced a real scheduling anomaly while discussing Agent Wake Bus / Forji work: - loop: `forgejo-self-audit-20260504` - status before cleanup: `working` - `nextCheckpointAt`: `2026-05-04T08:00:00+02:00` - `checkpointState`: `prepared` - `scheduled`: `false` - `scheduledJobId`: `null` - delivery target: `signal:+48508463453` This means the promise/open-loop ledger believed a checkpoint existed, but there was no active scheduled job backing it. The local inventory then showed a broader split: - active OpenClaw cron store: `/run/user/1000/openclaw/cron/jobs.json` — missing / empty - legacy OpenClaw cron store: `/home/openclaw/.openclaw/cron/jobs.json` — contains old jobs - some current automations are owned by systemd user timers, e.g. `iskra-daily-checkin.timer` - `openclaw cron list` failed with a filesystem/runtime path error in plugin-runtime deps This is exactly why Piotr does **not** want to blindly upgrade OpenClaw right now: if scheduler/cron/promise-delivery state is already split, an upgrade could silently break reminders, promise checkpoints, delivery receipts, or recovery paths. ## Decision / goal Before upgrading OpenClaw, create a local safety procedure: 1. repair or clearly document scheduler ownership, 2. reconcile active vs legacy cron stores, 3. verify promise checkpoint scheduling, 4. take a local snapshot, 5. run upgrade audit go/no-go, 6. upgrade only with rollback/fix path ready. ## Evidence from 2026-05-09 Commands run manually from Iskra session: ```bash /home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue ``` Key observed facts: - inventory status: `warn` - active OpenClaw cron status reported 0 jobs / missing raw store - legacy cron store had 10 jobs - systemd timers had 9 matches - `openclaw_cron_list_failed` - overdue open loop existed for `forgejo-self-audit-20260504` - the loop was later closed manually in the ledger, clearing `nextCheckpointAt` ## Scope ### In scope - scheduler/cron observability - active vs legacy cron store reconciliation - promise ledger ↔ checkpoint job consistency - systemd timer ownership map - upgrade preflight - snapshot/rollback procedure - go/no-go checklist - minimal repair scripts/tests ### Out of scope - changing daily check-in ownership without explicit approval - migrating systemd-owned rituals into OpenClaw cron by default - broad OpenClaw upgrade before the safety gate passes - deleting legacy jobs without backup and explicit operator confirmation ## Required checks ### 1. Inventory owners Run and persist output: ```bash /home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py --json systemctl --user list-timers --all systemctl --user list-units --type=service --all | grep -Ei 'iskra|openclaw|daily|checkin|ritual|signal' ``` Expected output: - list of automations grouped by owner: - active OpenClaw cron store, - legacy OpenClaw cron store, - systemd user timers, - ritual log evidence only, - unknown/orphaned. ### 2. Promise ledger consistency Run: ```bash python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py scan-overdue python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-checkpoint-jobs python3 /home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py audit-deliveries ``` Expected: - no active loop with past `nextCheckpointAt` and `scheduledJobId: null`, unless explicitly marked blocked/needs_user and surfaced; - no closed loop with active checkpoint job; - no checkpoint job for missing loop id; - no prepared checkpoint with no scheduler owner. ### 3. Cron store split Clarify source of truth: - Is `/run/user/1000/openclaw/cron/jobs.json` supposed to be active store? - Why is it missing/empty? - Is `/home/openclaw/.openclaw/cron/jobs.json` stale legacy evidence or still used by anything? - Why does `openclaw cron list` fail with plugin-runtime deps path error? Produce a migration/cleanup recommendation, but do not apply destructive changes without approval. ### 4. Snapshot before upgrade Before any OpenClaw upgrade: - snapshot `/home/openclaw/.openclaw` - snapshot `/home/openclaw/.openclaw/workspace` - snapshot `/home/openclaw/vaults/Iskra-i-Piotr` or confirm Obsidian Sync state - export/list systemd user timers/services relevant to Iskra/OpenClaw - save current OpenClaw version/status - save active processes and gateway status - save cron stores and promise ledger Suggested artifact path: ```text /home/openclaw/.openclaw/workspace/artifacts/openclaw-upgrade-preflight/YYYYMMDD-HHMM/ ``` ### 5. Upgrade go/no-go Go only if: - scheduler owner map is known; - active vs legacy cron state is understood; - promise ledger audits pass or known anomalies are explicitly accepted; - rollback snapshot exists; - daily check-in and Signal delivery path are verified; - OpenClaw gateway status is healthy; - no current active delivery loops depend on broken scheduler state. No-go if: - `openclaw cron list` still crashes and no workaround/source-of-truth is documented; - active cron store is missing and we do not know if runtime expects it; - any promised/checkpoint loop is overdue with no scheduled job; - Signal delivery is degraded; - snapshot/rollback path is untested. ## Acceptance criteria This issue can close when: 1. There is a checked-in or artifacted scheduler inventory report. 2. Active vs legacy cron stores are explained. 3. `open_loop_registry.py scan-overdue`, `audit-checkpoint-jobs`, and `audit-deliveries` have clear pass/fail outputs. 4. Broken/stale loops are closed, rescheduled, or explicitly marked blocked/needs_user. 5. An OpenClaw upgrade preflight snapshot procedure exists and has been run at least once. 6. A go/no-go decision for OpenClaw upgrade can be made from evidence, not vibes. 7. Rollback/fix plan is documented. ## Operator note Piotr’s current stance is correct: do **not** upgrade OpenClaw blindly while scheduler/cron state is inconsistent. First repair local observability, then snapshot, then upgrade audit, then go/no-go. ## Related - `platform#134` — ADR proposal: Comment-driven agent orchestration / Agent Wake Bus - `memory/open-loops.json` — promise debt source of truth - `/home/openclaw/.openclaw/workspace/HEARTBEAT.md` — heartbeat contract - `/home/openclaw/.openclaw/workspace/scripts/open_loop_registry.py` - `/home/openclaw/.openclaw/workspace/scripts/iskra-scheduled-sends-inventory.py`
Author
Collaborator

@codex please run an OpenClaw scheduler upgrade preflight.

Goal: determine whether it is safe to upgrade OpenClaw.

Check:

  • active cron store vs legacy cron store
  • why openclaw cron list crashes
  • systemd user timers owning Iskra/OpenClaw scheduled sends
  • promise ledger consistency
  • checkpoint jobs consistency
  • Signal/daily-checkin delivery ownership

Do not modify state except writing a report.

Output:

  • evidence with commands and paths
  • GO / NO-GO / FIX-FIRST verdict
  • required fixes before upgrade
  • proposed snapshot/rollback plan

Current operator stance: FIX-FIRST. No OpenClaw upgrade before preflight evidence + snapshot/rollback path.

@codex please run an OpenClaw scheduler upgrade preflight. Goal: determine whether it is safe to upgrade OpenClaw. Check: - active cron store vs legacy cron store - why `openclaw cron list` crashes - systemd user timers owning Iskra/OpenClaw scheduled sends - promise ledger consistency - checkpoint jobs consistency - Signal/daily-checkin delivery ownership Do not modify state except writing a report. Output: - evidence with commands and paths - GO / NO-GO / FIX-FIRST verdict - required fixes before upgrade - proposed snapshot/rollback plan Current operator stance: **FIX-FIRST**. No OpenClaw upgrade before preflight evidence + snapshot/rollback path.
Author
Collaborator

Live mitigation applied from Signal thread on 2026-05-09:

promise-delivery-reconcile was found in both cron stores:

  • active: /run/user/1000/openclaw/cron/jobs.json, id ffbaf0a6-812d-41cc-a90a-fa1da8dad034
  • legacy: /home/openclaw/.openclaw/cron/jobs.json, id ee29a6b6-5630-43e1-b085-a403eadefe7f

Both had:

"delivery": { "mode": "none", "channel": "last" }

…but the job completion still leaked into the live Signal thread as an async completion / user-facing message:

Rekoncyliacja delivery checkpointów zakończona poprawnie.
...
zrekoncyliowane wpisy: 0
anomalie: 0

Mitigation: disabled only this reconciliation cron in both stores, with backups:

  • /run/user/1000/openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z
  • /home/openclaw/.openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z

Reason added in job metadata:

Temporarily disabled 2026-05-09: delivery.mode=none reconcile job leaked async completion into Signal; tracked in platform#135.

Daily check-in / systemd timers were not changed.

This strengthens the issue: we need to verify not only scheduler ownership, but also whether delivery.mode=none actually suppresses user-facing async completion routing for cron jobs.

Live mitigation applied from Signal thread on 2026-05-09: `promise-delivery-reconcile` was found in both cron stores: - active: `/run/user/1000/openclaw/cron/jobs.json`, id `ffbaf0a6-812d-41cc-a90a-fa1da8dad034` - legacy: `/home/openclaw/.openclaw/cron/jobs.json`, id `ee29a6b6-5630-43e1-b085-a403eadefe7f` Both had: ```json "delivery": { "mode": "none", "channel": "last" } ``` …but the job completion still leaked into the live Signal thread as an async completion / user-facing message: ```text Rekoncyliacja delivery checkpointów zakończona poprawnie. ... zrekoncyliowane wpisy: 0 anomalie: 0 ``` Mitigation: disabled only this reconciliation cron in both stores, with backups: - `/run/user/1000/openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z` - `/home/openclaw/.openclaw/cron/jobs.json.bak-disable-reconcile-20260509T132723Z` Reason added in job metadata: ```text Temporarily disabled 2026-05-09: delivery.mode=none reconcile job leaked async completion into Signal; tracked in platform#135. ``` Daily check-in / systemd timers were not changed. This strengthens the issue: we need to verify not only scheduler ownership, but also whether `delivery.mode=none` actually suppresses user-facing async completion routing for cron jobs.
Author
Collaborator

Additional live symptom from Signal on 2026-05-09:

Anti-pattern: No-Op Notification Bus

A heartbeat/no-op maintenance result was delivered to the user as a normal Signal message:

Komenda zakończyła się sukcesem.

Wynik jest czysty:

• audit-promises:
anomalyCount: 0
• scan-overdue: count: 0, brak overdue loopów
• Inbox: brak nowych/zmienionych plików

Czyli nie ma teraz nic do surfacowania z heartbeat.

This is logically correct but product-wise wrong: it surfaces the fact that there is nothing to surface.

Expected behavior:

  • if heartbeat/no-op contract resolves to “nothing needs attention”, user-facing delivery should be suppressed;
  • in the main heartbeat lane this should result in HEARTBEAT_OK internally / no Signal notification externally;
  • maintenance results may be logged/audited, but should not become normal chat messages unless there is an actionable alert, failure, or explicit user request.

This appears related to the earlier promise-delivery-reconcile leak: internal cron/heartbeat/maintenance completions are crossing into the user-facing Signal delivery path despite being no-op / non-actionable.

Suggested investigation item for this issue:

  • audit all async completion routing paths for jobs with:
    • delivery.mode = none,
    • no-op heartbeat result,
    • maintenance-only cron payload,
    • isolated session completion,
    • “command completed successfully” wrappers.

Working name: No-Op Notification Bus / “Powiadamiam, że nie ma powiadomień.”

Additional live symptom from Signal on 2026-05-09: ## Anti-pattern: No-Op Notification Bus A heartbeat/no-op maintenance result was delivered to the user as a normal Signal message: ```text Komenda zakończyła się sukcesem. Wynik jest czysty: • audit-promises: anomalyCount: 0 • scan-overdue: count: 0, brak overdue loopów • Inbox: brak nowych/zmienionych plików Czyli nie ma teraz nic do surfacowania z heartbeat. ``` This is logically correct but product-wise wrong: it surfaces the fact that there is nothing to surface. Expected behavior: - if heartbeat/no-op contract resolves to “nothing needs attention”, user-facing delivery should be suppressed; - in the main heartbeat lane this should result in `HEARTBEAT_OK` internally / no Signal notification externally; - maintenance results may be logged/audited, but should not become normal chat messages unless there is an actionable alert, failure, or explicit user request. This appears related to the earlier `promise-delivery-reconcile` leak: internal cron/heartbeat/maintenance completions are crossing into the user-facing Signal delivery path despite being no-op / non-actionable. Suggested investigation item for this issue: - audit all async completion routing paths for jobs with: - `delivery.mode = none`, - no-op heartbeat result, - maintenance-only cron payload, - isolated session completion, - “command completed successfully” wrappers. Working name: **No-Op Notification Bus** / “Powiadamiam, że nie ma powiadomień.”
Author
Collaborator

Additional live symptom from Signal on 2026-05-09, seen in screenshots from the operator:

Promise-ledger anomaly alerts are also leaking as user-facing chat

After a normal Signal exchange, the user received internal/operator-style heartbeat/promise audit messages in the Signal DM, e.g.:

Operator alert: padła obietnica bez wpisu do promise ledger.
session=agent:main:signal:direct:+48508463453
messageId=110512f6. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu.

Then another delivered message expanded the internal audit payload:

Komenda zakończyła się sukcesem i wykryła nową anomalię promise ledger:

• reason: promise_without_upsert
• messageId:110512f6
• session: agent:main:signal:direct:+48508463453
• promiseClass: real_followup
• matchedPatterns: follow up
• shouldAlert: true
• observedRuns: 2
• ageMin: 4

Czyli: padła świeża obietnica/follow-up bez wpisu do promise ledger.
Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu.

And later a no-op/no-new-alert audit result was also delivered:

Komenda zakończyła się sukcesem.

Wynik:

• audit-promises: 2 anomalie promise_without_upsert, ale obie mają shouldAlert: false
  • messageId=110512f6
  • messageId=6cb2fedc
• scan-overdue: count: 0, brak overdue loopów
• Inbox: brak nowych/zmienionych plików

Czyli: nie ma teraz nowego alertu do wysłania z heartbeat.

This confirms the same product/runtime class as the earlier examples:

  • internal maintenance/audit results are being rendered as normal Signal messages;
  • even shouldAlert: false / no-op summaries can surface to the user;
  • the content contains internal implementation terms (promise ledger, session=agent:main..., messageId, matchedPatterns, observedRuns) that should not appear in the conversational surface.

Expected behavior:

  • shouldAlert: true may become a short, human-readable operator alert only if it genuinely requires the operator's attention;
  • internal fields should remain in logs/artifacts, not chat;
  • shouldAlert: false / no-new-alert results must be silent externally;
  • heartbeat/promise-audit bookkeeping should not wrap successful internal commands as user-facing “Komenda zakończyła się sukcesem” messages.

This should stay under #135 rather than a new issue because it strengthens the suspected root cause: async/maintenance completion routing is not respecting delivery/no-op/user-facing contracts consistently.

Additional live symptom from Signal on 2026-05-09, seen in screenshots from the operator: ## Promise-ledger anomaly alerts are also leaking as user-facing chat After a normal Signal exchange, the user received internal/operator-style heartbeat/promise audit messages in the Signal DM, e.g.: ```text Operator alert: padła obietnica bez wpisu do promise ledger. session=agent:main:signal:direct:+48508463453 messageId=110512f6. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu. ``` Then another delivered message expanded the internal audit payload: ```text Komenda zakończyła się sukcesem i wykryła nową anomalię promise ledger: • reason: promise_without_upsert • messageId:110512f6 • session: agent:main:signal:direct:+48508463453 • promiseClass: real_followup • matchedPatterns: follow up • shouldAlert: true • observedRuns: 2 • ageMin: 4 Czyli: padła świeża obietnica/follow-up bez wpisu do promise ledger. Trzeba zrobić upsert albo jawnie wyjaśnić brak follow-upu. ``` And later a no-op/no-new-alert audit result was also delivered: ```text Komenda zakończyła się sukcesem. Wynik: • audit-promises: 2 anomalie promise_without_upsert, ale obie mają shouldAlert: false • messageId=110512f6 • messageId=6cb2fedc • scan-overdue: count: 0, brak overdue loopów • Inbox: brak nowych/zmienionych plików Czyli: nie ma teraz nowego alertu do wysłania z heartbeat. ``` This confirms the same product/runtime class as the earlier examples: - internal maintenance/audit results are being rendered as normal Signal messages; - even `shouldAlert: false` / no-op summaries can surface to the user; - the content contains internal implementation terms (`promise ledger`, `session=agent:main...`, `messageId`, `matchedPatterns`, `observedRuns`) that should not appear in the conversational surface. Expected behavior: - `shouldAlert: true` may become a short, human-readable operator alert only if it genuinely requires the operator's attention; - internal fields should remain in logs/artifacts, not chat; - `shouldAlert: false` / no-new-alert results must be silent externally; - heartbeat/promise-audit bookkeeping should not wrap successful internal commands as user-facing “Komenda zakończyła się sukcesem” messages. This should stay under #135 rather than a new issue because it strengthens the suspected root cause: async/maintenance completion routing is not respecting delivery/no-op/user-facing contracts consistently.
Author
Collaborator

Root-cause hypothesis from live inspection on 2026-05-09:

This does not look like only one bad cron/job prompt. promise-delivery-reconcile was already disabled, and heartbeat-state.json showed a silent/no-op outcome:

{
  "lastOutcome": "silent-ok",
  "deliveryStatus": "silent-ok",
  "signalMessageSummary": "HEARTBEAT_OK",
  "openLoopDecision": "no-open-loop-send"
}

Yet the operator still received Signal messages wrapping internal results, e.g.:

Komenda zakończyła się sukcesem.
Wynik: ...
Nie ma nowego alertu do surfacowania.

Hypothesis:

The async/heartbeat completion layer is treating diagnostic/tool output as user-facing assistant output

Likely failure boundary:

internal maintenance command result
→ async/session completion wrapper
→ normal Signal delivery path

instead of:

internal maintenance command result
→ log/artifact/state only
→ suppress external delivery when final contract is HEARTBEAT_OK / silent-ok / shouldAlert=false / delivery.mode=none

In other words, the agent/heartbeat logic can correctly decide “do not send”, but a downstream completion/announcement layer still sends a summary of that decision.

This matches multiple symptoms under this issue:

  • delivery.mode=none job leaks;
  • heartbeat/no-op says “nothing to surface” but is surfaced;
  • promise-audit with shouldAlert: false still becomes Signal chat;
  • internal fields (sessionKey, messageId, matchedPatterns, observedRuns) appear in the conversational surface.

Suggested investigation:

  1. Trace the exact code path that turns isolated/heartbeat/async run completion into Signal messages.
  2. Check whether the delivery gate is applied to:
    • final assistant answer only,
    • tool result summaries,
    • command-completed wrappers,
    • background/queued completion announcements.
  3. Add a hard suppression rule before provider send:
    • if delivery.mode=none → never send;
    • if heartbeat final answer is HEARTBEAT_OK → never send externally;
    • if heartbeat state says silent-ok / shouldAlert=false / no-open-loop-send → never send a wrapper;
    • internal command summaries must go to logs/artifacts, not chat.

Working phrase: agent says “do not send”; runtime sends a report about not sending.

Root-cause hypothesis from live inspection on 2026-05-09: This does **not** look like only one bad cron/job prompt. `promise-delivery-reconcile` was already disabled, and `heartbeat-state.json` showed a silent/no-op outcome: ```json { "lastOutcome": "silent-ok", "deliveryStatus": "silent-ok", "signalMessageSummary": "HEARTBEAT_OK", "openLoopDecision": "no-open-loop-send" } ``` Yet the operator still received Signal messages wrapping internal results, e.g.: ```text Komenda zakończyła się sukcesem. Wynik: ... Nie ma nowego alertu do surfacowania. ``` Hypothesis: ## The async/heartbeat completion layer is treating diagnostic/tool output as user-facing assistant output Likely failure boundary: ```text internal maintenance command result → async/session completion wrapper → normal Signal delivery path ``` instead of: ```text internal maintenance command result → log/artifact/state only → suppress external delivery when final contract is HEARTBEAT_OK / silent-ok / shouldAlert=false / delivery.mode=none ``` In other words, the agent/heartbeat logic can correctly decide “do not send”, but a downstream completion/announcement layer still sends a summary of that decision. This matches multiple symptoms under this issue: - `delivery.mode=none` job leaks; - heartbeat/no-op says “nothing to surface” but is surfaced; - promise-audit with `shouldAlert: false` still becomes Signal chat; - internal fields (`sessionKey`, `messageId`, `matchedPatterns`, `observedRuns`) appear in the conversational surface. Suggested investigation: 1. Trace the exact code path that turns isolated/heartbeat/async run completion into Signal messages. 2. Check whether the delivery gate is applied to: - final assistant answer only, - tool result summaries, - command-completed wrappers, - background/queued completion announcements. 3. Add a hard suppression rule before provider send: - if `delivery.mode=none` → never send; - if heartbeat final answer is `HEARTBEAT_OK` → never send externally; - if heartbeat state says `silent-ok` / `shouldAlert=false` / `no-open-loop-send` → never send a wrapper; - internal command summaries must go to logs/artifacts, not chat. Working phrase: **agent says “do not send”; runtime sends a report about not sending.**
Collaborator

M08 supplemental triage result: keep in M08. Iskra recommends keeping scheduler observability as current OpenClaw readiness work; #568 remains the WIP/spec artifact connected to this lane.

M08 supplemental triage result: keep in M08. Iskra recommends keeping scheduler observability as current OpenClaw readiness work; #568 remains the WIP/spec artifact connected to this lane.
Collaborator

Post-close archival update: #631 merged the clean replacement Spec Kit for this topic from current main, and #568 was closed as superseded because its source branch carried a stale broad diff.

Current state: the OpenClaw scheduler observability material is preserved as inert planning/spec context; no runtime scheduler implementation was activated here.

Post-close archival update: #631 merged the clean replacement Spec Kit for this topic from current `main`, and #568 was closed as superseded because its source branch carried a stale broad diff. Current state: the OpenClaw scheduler observability material is preserved as inert planning/spec context; no runtime scheduler implementation was activated here.
Sign in to join this conversation.
No labels
W6d-automerge-calibration
agent/claude-code
agent/codex
agent/hermes
agent/iskra
agent/ollama
agent/patchwarden
automerge-candidate
class/security-sensitive
cutover-gate
dependency/blocked
dependency/blocks-others
dependency/cross-repo
dependency/needs-confirmation
domain:agents
domain:ci
domain:docs
domain:forgejo
domain:infra
domain:memory
domain:runtime
domain:signal
domain:ux
flow/architecture
flow/blocked
flow/deployed
flow/done
flow/implementation
flow/intake
flow/maintained
flow/observed
flow/ready
flow/refining
flow/retired
flow/review
iterating
judge/codex-candidate
judge/hermes-candidate
judge/low-confidence
judge/needs-refinement
judge/operator-needed
judge/p0
judge/p1
judge/p2
judge/p3
judge/park
judge/patchwarden-candidate
judge/stale-priority
kind/adr
kind/bug
kind/chore
kind/feature
kind/infra
kind/ops
kind/refactor
kind/research
large-impact
merge/auto
merge/manual
merge/manual-dependency-conflict
merge/manual-failing-tests
merge/manual-merge-conflict
merge/manual-missing-review
merge/manual-operator-preference
merge/manual-red-zone
merge/manual-security-sensitive
merge/manual-unclear-scope
merge/manual-unknown
meta
mode:operator-only
mode:patchwarden-iskra-approved
mode:safe-auto
needs-operator-decision
needs-triage
not-ready
observed/erroring
observed/needs-followup
observed/pending
observed/retire-candidate
observed/unused
observed/used
operator-emotional
owner-attention
phase/02
phase/03
priority:p0
priority:p1
priority:p2
priority:p3
proposed
ready-for-agent
ready-for-operator
recovery
review:claude-reviewed
review:codex-reviewed
review:dziadek-reviewed
review:needs-human
risk/exposure
risk/process
risk/product
risk/runtime
safety:external-write
safety:no-prod-mutation
safety:prod-impact
safety:secret-touch
size/large
size/medium
size/small
size/tiny
size/unknown
source/adr
source/agent-generated
source/manual
source/operator-chat
source/voice-note
status:blocked
status:codex-ready
status:merged:pending-evidence
status:needs-evidence
status:operator-needed
status:parked
tier/full
tier/lite
tier/stacked
tier:0-platform-substrate
tier:1-iskra-value-layer
tier:2-tools-products-modules
type:bug
type:chore
type:docs
type:feat
type:policy
type:research
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pdurlej/platform#135
No description provided.