Event ledger · wiki guide

What this is

One append-only stream, many consumers

The event ledger is the platform's CDC (change data capture) substrate. Every meaningful state change writes one row to the events table — append-only, ULID-keyed, with structured JSON payload. Producers don't care who's reading; consumers don't care who's writing. That decoupling is what lets us add a new behavior (customer health watcher, anomaly detector, timeline indexer) without changing the workflows that fire the underlying events.

The ledger landed in R553 / Phase 52, per Mike's high-leverage moves list. Before it, workflows would call downstream code synchronously and tight-coupled everything. Now they fan events; downstream subscribers register against event_type globs and the workflow runner drains the queue on a cron.

Schema lives in migrations/schema/123_events_ledger.sql. Three tables: events (the stream), event_subscriptions (who listens to what), event_drain_state (per-consumer cursor).

Producers

Who writes events

sync — sync.tier_completed fires after each hot/warm/cold tier run.
HITL approval flow — hitl.approved, hitl.rejected, hitl.deferred per decision.
Pricing — price.changed, price.bid_line_changed, price.quote_line_changed.
Email — email.parsed, email.bounced, email.replied from the inbound triage pipeline.
Workflow runner — workflow.run_started, workflow.run_completed, workflow.run_failed.
AR/Customer — ar.bucket_moved, ar.collection_sent, customer.updated.
Vendor/Item — vendor.cost_changed, item.created, spec.updated.

Consumers

Who reads events

workflow_runner — drains pending events on cron, matches them against event_subscriptions globs, starts subscribed workflows.
reflexion — reads hitl.rejected to learn what proposals to avoid; reads workflow.run_failed to flag patterns.
customer health watcher — subscribes to ar.*, email.*, order.* to recompute health signals in near real-time.
timeline UI — per-entity timeline reads events filtered by entity_type/entity_id.
anomaly detector — windowed reads to flag unusual sequences.
replay tooling — can re-fire a window of events for debugging.

Anatomy of an event

Event shape

{ "event_id": "01J0XYZ...", // ULID — sorts lexicographically by time "event_type": "price.changed", "entity_type": "item", "entity_id": "10472", // NS item id "payload_json": { "old_price": 1.42, "new_price": 1.48, "bid_id": "B5875", "customer_id": "2147", "approved_by": "mike", "proposed_action_id": 8421 }, "caused_by": "workflow:bid_price_update", "workflow_run_id": "run_01J0...", "source_system": "platform", "occurred_at": "2026-05-25T14:23:11Z", "sequence_no": 142 // optional per-entity counter }

Idempotency

The idempotency_key contract

Some events represent operations that must be exactly-once on consumers. For those, the producer sets an idempotency_key in payload_json. Consumers store the seen keys in their own per-consumer dedupe table and skip events they've already processed.

The canonical example: hitl.approved with key proposed_action_id. If the consumer (e.g. NS push handler) crashes mid-process and the workflow runner re-fires the event, the dedupe check makes the second fire a no-op. Combined with the R560 atomic claim at the approval boundary, we get end-to-end exactly-once semantics.

Why ULID and not autoincrement?

ULIDs are time-sortable, client-generatable, and globally unique without a coordinator. Producers can emit events at the edge without round-tripping to D1 for an ID. The occurred_at ordering survives backfills because recorded_at tracks insert time separately.

Subscriptions

How a workflow subscribes

The event_subscriptions table maps event_type_glob patterns to workflow_type. The drain loop on cron walks new events, matches against enabled subscriptions, applies the optional filter_expr, transforms the payload via input_mapper, and starts the workflow.

-- Example: vendor cost changes trigger margin re-check workflow INSERT INTO event_subscriptions (event_type_glob, workflow_type, input_mapper) VALUES ('vendor.cost_changed', 'vendor_cost_review', '{"vendor_id":"$.entity_id"}'); -- Wildcard: any AR bucket move pings customer health INSERT INTO event_subscriptions (event_type_glob, workflow_type) VALUES ('ar.*', 'recompute_customer_health');

Drain

Cron schedule + cursors

The drain loop runs on cron schedule */2 * * * * (every 2 minutes) for the workflow_runner drainer. Each drainer tracks its position via event_drain_state.last_event_id. The next run reads events where event_id > last_event_id ORDER BY event_id ASC LIMIT 500.

Other drainers may run on different schedules: timeline indexer hourly, anomaly detector every 15 min, reflexion analyzer every 30 min. Each has its own cursor row so they don't interfere.

workflow_runner — every 2 min — starts subscribed workflows.
timeline_indexer — hourly — materializes per-entity event lists for UI.
anomaly_detector — every 15 min — windowed pattern matching.
reflexion_analyzer — every 30 min — feeds reflexion rules.
customer_health_watcher — every 5 min — incremental score recompute.

Step-by-step: from emit to action

How an event flows

01

Producer emits

The originating workflow calls emitEvent({type, entity, payload}). The helper generates a ULID, fills caused_by from the calling context, and inserts. ~5ms.

Writes events
Time ~5ms
02

Drainer wakes on cron

The 2-minute cron fires the workflow_runner drainer. It reads its cursor from event_drain_state and selects up to 500 newer events.

Reads event_drain_state, events
Time ~100ms
03

Match against subscriptions

For each event, the drainer fans against event_subscriptions where the glob matches the event_type. Each match generates a workflow start request.

Reads event_subscriptions
04

Idempotency check + start

If the subscription's workflow has an idempotency_key declared, the runner checks per-consumer dedupe. If unseen, it starts the workflow run via the same path as a manual trigger. Cursor advances on commit.

Writes workflow_run_log, event_drain_state.last_event_id

Outcomes

What the substrate enables

Decoupling

Total

producers don't know consumers

Replay

Possible

cursor rewind

Drain lag

≤ 2 min

workflow runner cadence

Idempotency

Per consumer

exactly-once where needed

New consumers can be added without touching producers — just register a subscription.
Per-entity timelines come for free — filter the events table by entity.
Replay tooling can re-fire historical events into a fresh consumer to backfill state.
Anomaly detection has a single uniform stream to watch.

Failure modes

What can go wrong

Drainer stalls

If the cron run fails, events accumulate but aren't lost. Next run picks up at the cursor. Long stalls (> 1 hour) trigger alerting via the platform health endpoint.

Schema drift in payload

Payloads are typed by event_type but not enforced at insert. Consumers should defensively read fields. Schema docs live next to each event_type producer in code comments.

Wildcard storms

A subscription on '*' would re-fire everything. The drainer caps fanout per event at 8 subscribers as a safety net; alerts on cap hits.

Adjacent substrate

For developers

Code paths + invariants

Concern	Where
Schema	migrations/schema/123_events_ledger.sql
Emitter	src/index.ts emitEvent helper
Drainer	src/lib/workflow_runner.ts drainEvents
Cron	wrangler.toml — /2 * * * for workflow_runner
ULID	client-generated for time-sortable ordering
Idempotency	payload_json.idempotency_key + per-consumer dedupe
Subscriptions	event_subscriptions table — glob matching
Per-event fanout cap	8 subscribers max — anti-storm safety

Changelog

Dated trail · spot stale claims

Dated trail of when this doc was last touched, what changed, and what to look at if it feels stale.

Date	Round	Change	Touched by
`2026-05-26`	`R586`	Added CHANGELOG · SCHEMA · RUNBOOK · BACKLOG sections — wiki became best-in-class operating documentation.	Mike + Claude
`2026-05-25`	`R584/R585`	Wiki originally shipped — 8-section structure (hero / what / when / steps / outcomes / failure-modes / related / for-developers).	Mike + Claude

If today is more than 60 days past the latest changelog row, treat live system behavior as the source of truth. The doc may have drifted — verify against the workflow contract in workflow_definitions WHERE workflow_type='events_substrate' before acting on these claims.

Schema · data contract

The machine-readable spec

Canonical fields, table names, endpoint signatures. What code should match, what tests should assert. workflow_type · events_substrate · risk_level · N/A (substrate).

Inputs (required + optional)

Field	Type	Description
`event_type`	`string`	Dotted namespace, e.g. 'price.changed'. Required.
`entity_ref`	`string`	Subject — 'customer:2147', 'bid:B5875'. Required.
`payload_json`	`json`	Event-specific data. Required.
`idempotency_key`	`string`	Producer-generated; prevents duplicate consumption.
`ulid`	`string`	Time-sortable identifier. Auto-generated.

D1 tables written

Table	Operation	Trigger
`events`	INSERT (one per emit)	Append-only — never updated, never deleted
`event_subscriptions`	READ	Drain logic resolves which workflows to trigger
`event_drain_cursor`	UPDATE	Tracks last processed ULID

Endpoints called

Method	Path	Purpose
`Helper`	`src/index.ts::emitEvent(env, event_type, entity_ref, payload, idem_key)`	All producers go through this
`GET`	`/api/events?type=&entity=`	Read recent events for an entity
`POST`	`/admin/events/drain`	Manual drain trigger

Events fired

event_type	When	Subscribers
`(N/A — events is the substrate, not a producer)`	—	—

Runbook · when it breaks

It broke at 2am — what now

Different from "how do I use this." This is the page Mike pulls up when something is wrong: logs to check, recovery steps, who to escalate to.

Scenario · Drain cursor stuck on a poison event

An event payload caused the drainer to throw and the cursor never advanced.

Identify: SELECT * FROM events WHERE id > (SELECT cursor_value FROM event_drain_cursor) ORDER BY id ASC LIMIT 5
Skip: UPDATE event_drain_cursor SET cursor_value = <poison_id> to advance past it.
Quarantine: Move the row to events_quarantine for forensic review.
Patch: Find what consumer threw; fix the defensive read.

Scenario · Subscription fired the wrong workflow_type

event_subscriptions pattern matched too broadly (e.g. '*' or 'price.*' instead of 'price.changed').

Inspect: SELECT * FROM event_subscriptions WHERE event_type_pattern LIKE '%' || ? || '%'
Narrow: Update pattern to exact match or tighter glob.
Audit: Scan all subscriptions for '*' wildcards — they're a smell.

Scenario · idempotency_key collision — two producers same key

Should produce one event, not two. If you see duplicates, the dedupe at emit failed.

Find dupes: SELECT idempotency_key, COUNT(*) FROM events GROUP BY idempotency_key HAVING COUNT(*)>1
Investigate: Both producers thought they were authoritative. Decide which one wins; delete the other (events are append-only so this is a tombstone insert, not a DELETE).
Prevent: Tighten the unique key generation rule per event_type.

Scenario · Wildcard storm — one event fired 50 subscribers

Anti-storm cap is 8/event. If you see more, the cap is bypassed.

Check cap: Constant MAX_FANOUT_PER_EVENT in src/lib/workflow_runner.ts
Find offender: SELECT event_type_pattern, COUNT(*) FROM event_subscriptions GROUP BY event_type_pattern HAVING COUNT(*) > 5
Cull: Delete or narrow patterns matching everything.

Logs to check

workflow_run_log · top-level run audit
workflow_step_log · per-step trace
workflow_verify_results · post-window verify outcomes
cron_locks · stuck cron lock detection
events · workflow.completed / workflow.failed event trail
reflexion_log · per-run narrative (if reflexion_enabled)
npx wrangler tail · live Worker logs

Kill switch · emergency stop

If this workflow is misbehaving in a high-impact way (creating bad proposed_actions in volume, pushing wrong things to NS), flip a kill switch:

kill:ns_writes · stops every NS push platform-wide
kill:proposed_apply · stops HITL approvals from executing fan-out
kill:high_risk_ops · stops risk_level >= 4 fan-out

See kill-switches-state-machine.html for the full state machine + recovery procedure.

Escalation

Primary: Mike Levine (single-admin) · mikelevine@globalfoodsolutions.co. For prolonged outage during business hours, notify warehouse lead + accounting lead so they can defer dependent work.

Backlog · open questions

What's not done · what's uncertain

What's not done, what's uncertain, what we punted. Captured so it survives context switches and doesn't die in someone's head.

OPEN
Schema enforcement per event_type
Today payload_json is free-form. Should be typed (Zod schema) per event_type. Backlog: events_schema table + emit-time validation.
DEFER
Event replay from cursor
If a consumer was broken for a week, we can't easily re-run it over the missed window. Need a per-consumer replay cursor.
STUB
Wildcard storm alerting
We cap at 8 but don't notify Mike when the cap hits — silent.
DECISION
Retention policy
Events grow forever. After 2 years, do we tombstone or summarize? Storage isn't free.
OPEN
Cross-region replication
Today single region. If we ever go multi-region, event ordering across regions is hard.