Workflow runner · wiki guide

What this is

The single execution surface for every workflow contract

The workflow runner is executeWorkflowContract in src/lib/workflow_runner.ts. It's the one function every workflow on the platform passes through — bid_price_update, vendor_cost_update, ar_aging_action_plan, all 22 of them. Reads a row from workflow_definitions, walks 7 stages, writes audit rows to workflow_step_log, workflow_run_log, and (conditionally) reflexion_log, proposed_actions, and workflow_verify_results.

It shipped in R552 as the Tier 1 "every role becomes a workflow-driven agent" move. R560 was the hardening pass — closing five bug classes the Codex audit found, including the silent-swallow on stage_proposed_action INSERT failures (the Cal-Maine-class bug: workflow reports "ok" with zero NS writes). R564 documented the workflow.completed event intent (not yet wired). R571 added Play-animation runtime hooks. R576 tuned the active-state dwell timer for the UI.

Diagram: substrate-workflow-runner.html. The runner is 615 LOC; it's small because the heavy lifting is in the contracts themselves.

When it fires

Four entrypoints

HTTP API — POST /api/workflow/execute with ?preview=true|confirm=true. Direct invocation. Preview runs all stages but skips writes in executeFanOut. Used by ops console + manual triggers.
Event-driven — drainEvents() reads new events from the ledger, matches against event_subscriptions, transforms payload via input_mapper JSONPath, invokes the runner with invoked_by: "event:<type>:<id>". KV concurrency lock prevents double-fire.
Chat-tool invocation — execute_workflow chat tool (admin-gated, currently STUB). Most chat surfaces stage a workflow_* proposed_action first and run on approval (HITL invariant per ADR-031).
Cron-driven — scheduled workflows (e.g. monthly_margin_review, annual_price_roll) fire from the cron handler. The verify-check scheduler at 45 5 * * * reads pending workflow_verify_results and runs verify SQL.

The seven stages

What every workflow goes through

01

loadContract(workflowType)

Single SELECT against workflow_definitions WHERE workflow_type=?1 AND enabled=1. Returns the contract row: precondition rules, fan-out targets, verify checks, risk level, reflexion flag. Null result → status='failed' with workflow_type_not_found.

Reads workflow_definitions
02

loadContext() — N parallel SELECTs

For each query in context_to_load_json: resolve binds (R560 fix: ordered positional bind to ? placeholders, not string-substitution), optionally skip if q.if = "X present" evaluates false. Errors no longer silently swallowed — they push to contextErrors[] which surfaces in result.errors.

Reads N parallel SELECTs
Writes none
03

checkPreconditions()

Walks preconditions_json and evaluates each rule. Grammar: "X present", "X is null", "X >= N" (also < <= = != >). severity:'block' + (fail OR unevaluable) → blocked=true, returns status='aborted'. severity:'warn' → push to warnings, continue.

Writes none (early return possible)
04

stageHitlProposal() — if risk_level ≥ 3

If risk_level ≥ 3 AND !opts.hitl_approved: INSERT proposed_actions with action_type='workflow_<type>', entity_type='workflow_run', entity_ref=run_id, status='pending', proposed_by='workflow_runner'. Return status='pending_hitl'. Stages 5-7 wait for approval. The decide endpoint re-invokes the runner with opts.hitl_approved=true.

Writes proposed_actions
Gate risk_level ≥ 3
05

executeFanOut() — per target

For each target in fan_out_targets_json: evaluate t.if → skip / continue. Dispatch by kind. REAL kinds: kv_invalidate, stage_proposed_action (R560: throws on INSERT failure). STUB kinds (R560 marks status='stub', NOT counted as executed): d1_write, ns_push, http_call, chat_tool, hitl_email_draft, flag, workflow_class_invoke, loop_*, dispatch_workflow. Per-step rows INSERTed into workflow_step_log on entry & exit. on_failure='abort' → break.

Writes workflow_step_log (2/step), conditional
R560 stubs ↔ executed=0
06

scheduleVerifyChecks()

Per check in verify_checks_json: INSERT workflow_verify_results with status='pending', expected_json=<window+expected+sql_check>, notes='scheduled by runner'. The verify scheduler at cron 45 5 * * * picks these up after the configured window and runs the SQL.

Writes workflow_verify_results
07

executePostActions()

Always: INSERT workflow_run_log (run_id, status, duration_ms, errors_count, output_json). If reflexion_enabled=1: INSERT reflexion_log with entity_type='workflow_run', tags='workflow:<type>,status:<status>'. Return status: completed | partial | failed | aborted | pending_hitl. Note: workflow.completed event documented (R564) but NOT yet wired.

Writes workflow_run_log, optionally reflexion_log

Worked example

Mike approves Driscoll's $0.06 bump on B5875

Mike taps Approve on a bid_price_update proposed_action for SKU 10472, $1.42 → $1.48 on bid B5875. The runner fires.

Stage 1 loads the bid_price_update contract from workflow_definitions (~2ms). Stage 2 loads context: the bid row, the customer row, the item row, the current price (~80ms, 4 parallel SELECTs). Stage 3 checks preconditions: "bid_id present" — pass; "item_id present" — pass; "new_price > 0" — pass. Stage 4 sees risk_level=3 but Mike's approval came in as opts.hitl_approved=true — skip. Stage 5 fans out 7 targets: kv_invalidate (real, deletes HUB_CACHE keys) + 6 STUB pushes (ns_push, d1_write, etc. — marked status='stub', NOT counted). 14 rows in workflow_step_log. Stage 6 schedules 2 verify checks ("NS price matches after 5 min", "hub page reflects new price after 60s"). Stage 7 writes workflow_run_log with status='completed', executed=1, stubbed=6 and a reflexion_log entry.

Mike sees: ok with executed=1. He knows the 6 stubs mean the actual NS push runs through the legacy NS_PUSH_QUEUE path, not the runner's stub dispatcher. That's the migration runway — gradually wire each stub to its real implementation without changing contract definitions.

Outcomes

What the substrate enables

Contracts

on the runner today

Fan-out targets

20 real, 74 stub

HITL-gated

risk ≥ 3

Code size

615 LOC

single function

New workflow = new row in workflow_definitions, no new code path.
Every run leaves a per-step audit trail in workflow_step_log.
R560 makes stubs visible — can't ship "fake green" runs anymore.
Verify checks decouple "did it happen" from "did it happen and the downstream system shows it".

R560 hardening

The five bug classes closed

Bug class (pre-R560)	Fix
loadContext silently swallowed query errors	errors push to `contextErrors[]` → `result.errors`
checkPreconditions treated unevaluable as 'pass'	block-severity unevaluable now blocks; warn downgrades with marker
stage_proposed_action INSERT used `.catch(()=>null)` — HITL bypass	try/catch → explicit throw; outer catch marks step 'failed'
Stub kinds returned ok & incremented `executed`	marked status='stub', NOT counted — surfaces wiring gaps
Concurrent drainer firings could double-invoke runner	KV concurrency lock per drainer (TTL 300s)

Failure modes

What can go wrong

Contract row missing

loadContract returns null → status='failed' with workflow_type_not_found. Detection: ops console "failed" filter. Recovery: migration to add the contract row; re-trigger.

Stub fanout reports completed with executed=0

This is by design (R560). It tells Mike a contract is wired in shape but not in implementation. Detection: workflow_run_log.steps_executed=0 AND steps_total>0. Recovery: wire the stub kinds for that contract's fan-out targets.

workflow.completed event not fired (documented intent only)

R564 documented the intent; not yet wired in stage 7. Downstream consumers waiting for workflow.completed currently poll workflow_run_log instead. Tracked in punch list.

HITL pending past expectation window

Stage 4 returns pending_hitl; stages 5-7 never fire until decide. Detection: workflow_run_log rows in pending_hitl > N hours. Recovery: ping Mike via recap.

Adjacent substrate

For developers

Code paths + invariants

Concern	Where
Runner	src/lib/workflow_runner.ts (615 LOC)
Schema	migrations/schema/117_workflow_definitions_v2.sql, 118_workflow_contracts_v2.sql
run_log	migrations/schema/121_workflow_run_log.sql
reflexion_log	migrations/schema/122_reflexion_log.sql
HTTP entry	src/index.ts:14304 — POST /api/workflow/execute
Cron entry	src/index.ts:33255 — 45 5 * * * verify scheduler
Event drainer	src/lib/events.ts drainEvents()
R552 commit	substrate ship: workflow runner + 7-stage execution
R560 hardening	5 bug classes closed, stub-vs-real surface
R564 event intent	workflow.completed (documented, not wired)
R571 Play animation	runtime hooks for diagram play-through
R576 dwell timer	active-state dwell tuning in UI

// 7-stage outline (executeWorkflowContract) const contract = await loadContract(workflowType); // 1 const context = await loadContext(contract, input); // 2 const pre = checkPreconditions(contract, context); // 3 if (pre.blocked) return { status: 'aborted', ... }; if (contract.risk_level >= 3 && !opts.hitl_approved) { // 4 await stageHitlProposal(contract, run_id); return { status: 'pending_hitl' }; } const fan = await executeFanOut(contract, context); // 5 await scheduleVerifyChecks(contract, run_id); // 6 await executePostActions(contract, run_id, fan); // 7

Changelog

Dated trail · spot stale claims

Dated trail of when this doc was last touched, what changed, and what to look at if it feels stale.

Date	Round	Change	Touched by
`2026-05-26`	`R586`	Added CHANGELOG · SCHEMA · RUNBOOK · BACKLOG sections — wiki became best-in-class operating documentation.	Mike + Claude
`2026-05-25`	`R584/R585`	Wiki originally shipped — 8-section structure (hero / what / when / steps / outcomes / failure-modes / related / for-developers).	Mike + Claude

If today is more than 60 days past the latest changelog row, treat live system behavior as the source of truth. The doc may have drifted — verify against the workflow contract in workflow_definitions WHERE workflow_type='workflow_runner_substrate' before acting on these claims.

Schema · data contract

The machine-readable spec

Canonical fields, table names, endpoint signatures. What code should match, what tests should assert. workflow_type · workflow_runner_substrate · risk_level · N/A (substrate).

Inputs (required + optional)

Field	Type	Description
`workflow_type`	`string`	Required — must match workflow_definitions row.
`inputs_json`	`json`	Workflow-specific inputs.
`idempotency_key`	`string`	Prevents duplicate execution.

D1 tables written

Table	Operation	Trigger
`workflow_run_log`	INSERT	Top-level run audit
`workflow_step_log`	INSERT (one per step)	Per-step trace
`workflow_verify_results`	INSERT	Post-window verify outcomes
`reflexion_log`	INSERT	Per-run narrative (reflexion_enabled)
`cron_locks`	INSERT / DELETE	Stuck-cron detection

Endpoints called

Method	Path	Purpose
`POST`	`/api/workflow/execute`	Runner entrypoint
`GET`	`/api/workflow/runs/:run_id`	Run status + step trace
`POST`	`/api/workflow/runs/:run_id/retry`	Manual retry from last failed step

Events fired

event_type	When	Subscribers
`workflow.started`	On invocation	audit
`workflow.completed`	Final step settled	downstream subscribers
`workflow.failed`	Halt or timeout	alerting

Runbook · when it breaks

It broke at 2am — what now

Different from "how do I use this." This is the page Mike pulls up when something is wrong: logs to check, recovery steps, who to escalate to.

Scenario · Workflow stuck in pending_hitl with no proposed_action row

stage_proposed_action step ran but the INSERT failed silently.

Check the runner log: SELECT * FROM workflow_run_log WHERE id='<run_id>'
Check HITL stage: grep [runner.stage_proposed_action] in wrangler tail.
Recovery: Manually INSERT a proposed_actions row matching the workflow's expected shape, OR force run status='failed' and re-invoke.
Escalate: Mike (single-admin).

Scenario · Fan-out target failed and run halted

A step's status='failed' halts subsequent steps.

Inspect step: SELECT step_id, status, output_json, error FROM workflow_step_log WHERE workflow_run_id='<run_id>' ORDER BY started_at
Identify the failed kind: First status='failed' row.
Recovery: Depends on kind. ns_push stub = no recovery (kind unimplemented). d1_write = manually INSERT/UPDATE.

Scenario · Run is 'stub' for every step

All fan-out kinds for this workflow_type are unimplemented in the dispatch table.

Inspect dispatch: src/lib/workflow_runner.ts around line 340 (placeholder branch).
Add kind handler: Implement the missing kind.
Until then: Workflow is documentation-only.

Scenario · Cron lock stuck — workflow won't start again

Prior run crashed before releasing lock.

Inspect: SELECT * FROM cron_locks WHERE cron_name='<workflow_type>'
Release: DELETE FROM cron_locks WHERE cron_name=? AND acquired_at < datetime('now','-30 minutes')
Investigate: Why did the lock not release? Check the prior run's last step.

Logs to check

workflow_run_log · top-level run audit
workflow_step_log · per-step trace
workflow_verify_results · post-window verify outcomes
cron_locks · stuck cron lock detection
events · workflow.completed / workflow.failed event trail
reflexion_log · per-run narrative (if reflexion_enabled)
npx wrangler tail · live Worker logs

Kill switch · emergency stop

If this workflow is misbehaving in a high-impact way (creating bad proposed_actions in volume, pushing wrong things to NS), flip a kill switch:

kill:ns_writes · stops every NS push platform-wide
kill:proposed_apply · stops HITL approvals from executing fan-out
kill:high_risk_ops · stops risk_level >= 4 fan-out

See kill-switches-state-machine.html for the full state machine + recovery procedure.

Escalation

Primary: Mike Levine (single-admin) · mikelevine@globalfoodsolutions.co. For prolonged outage during business hours, notify warehouse lead + accounting lead so they can defer dependent work.

Backlog · open questions

What's not done · what's uncertain

What's not done, what's uncertain, what we punted. Captured so it survives context switches and doesn't die in someone's head.

STUB
Fan-out kinds: ns_push, http_call, hitl_email_draft, flag, loop_over_*
Most kinds emit status='stub' today. Implement the dispatch table in src/lib/workflow_runner.ts.
OPEN
Retry policy per step (vs per workflow)
Today retry is at workflow level. Per-step retry (with different policies per kind) would be more granular.
DEFER
Saga pattern for cross-step compensation
If step 5 fails, steps 1-4 may need compensation. Today no compensation.
DECISION
Should reflexion_enabled be per-workflow or per-run?
Today per-workflow (in workflow_definitions). Per-run would let Mike enable on a one-off.