HITL Lifecycle — proposed_actions state machine

Substrate diagram · R537 risk tiers, R357 fail-closed, R377 atomic, R560 race fix · src/index.ts:25005-25109

Every AI-proposed mutation lands in proposed_actions for admin review (ADR-031). Approval enqueues an ns_pending_pushes row that drains to NetSuite. The lifecycle is small but the consistency story is large — this doc lays it out, including the R560 race fix that closed a duplicate-dispatch window flagged by the codex audit.

HITL invariant: ADR-031 5 risk tiers R560 race fix

Schema recap — proposed_actions

columntypenotes
action_idINTEGER PKautoincrement
action_typeTEXTe.g. price_change, bulk_cost_basis, workflow_<type>, propose_email_to_customer
entity_typeTEXTcustomer | item | workflow_run | etc.
entity_refTEXTNS id or run_id; identifies the target of the change
current_state_jsonTEXTpre-image (what's there now)
proposed_change_jsonTEXTpost-image (the diff the agent wants applied)
rationaleTEXTthe why-now, surfaced in the queue UI
statusTEXTpendingapprovedpushingapplied | failed; or pendingrejected
risk_levelINTEGER1-5, see tier table below (migration 113)
decided_at / decided_byTEXTset atomically in the claim UPDATE
proposed_byTEXTe.g. workflow_runner, r290:executor, chat:role=admin
proposed_atTEXTcreation timestamp

Risk tiers — migration 113 (R537)

tiernameexample action_typesHITL gate
L1note / tagnote, tag, classificationauto-approve in workflow_runner (no proposal staged)
L2safe NS fieldns_field_update, spec_update, create_customer_program, otherauto-approve in workflow_runner
L3medium writeprice_change, bid_status_update, quote_draft, vendor_failover, bulk_cost_basis, collection_actionHITL required (risk_level ≥ 3 gates in runner)
L4creates new entitypropose_create_customer, propose_create_vendor, propose_create_item, soft_deleteHITL required
L5destructive bulkbulk_delete, destructive_bulk, mass_price_changeHITL required + cumulative-ceiling guardrails (CostCapDO)

The runner's HITL gate at stage 4 checks risk_level ≥ 3 AND !opts.hitl_approved. L1-L2 contracts skip the gate entirely. The decide endpoint enforces a per-step approval regardless — risk_level is advisory there, not authoritative.

State machine

proposed_by workflow_runner chat tool cron / drift pending INSERT status='pending' decide:reject rejected (terminal) UPDATE status='rejected' recordEvent hitl.rejected decide:approve (atomic claim) approved UPDATE...RETURNING + INSERT ns_pending_pushes drainer picks pushing NS_PUSH_QUEUE in flight to NS NS ok retries exhausted applied (terminal) + mirror row + decision_corpus failed (terminal) last_error stored DLQ if configured 409 already_decided (race losers) UPDATE...WHERE status='pending' returns no row → current status surfaced in response HITL invariant (ADR-031 + CLAUDE.md) every write to NS / business state passes through pending Mike is the only one who can transition pending → approved X-Edit-Token + admin-role both required X-Edit-Token required · api:bulk-decide | api:decide

The race fix — R560 / codex audit CRITICAL #1

Two simultaneous approvers (two browser tabs, two admins, or one admin + a programmatic retry) could both claim the same action. The fix moves to an atomic UPDATE...RETURNING that's idempotent under concurrency — only one approver receives the action_id, only one enqueues.

OLD — broken (pre-R560)

// 3-statement D1 batch:
// 1. INSERT ns_pending_pushes  ← both win
// 2. UPDATE proposed_actions
//    SET status='approved' WHERE action_id=?
// 3. INSERT decision_corpus

// Race: two approvers
// each INSERT push row first.
// Then both UPDATE — the loser's
// cleanup DELETE could remove
// the winner's queue row,
// OR both rows could dispatch.
// No SELECT...FOR UPDATE in D1.

NEW — R560 atomic claim

// 1. Atomic claim — only ONE row wins:
const claim = await env.DB.prepare(
  `UPDATE proposed_actions
   SET status='approved',
       decided_at=datetime('now'),
       decided_by='admin:api'
   WHERE action_id=?2 AND status='pending'
   RETURNING action_id`
).bind(notes, actionId).first();

if (!claim?.action_id) {
  // Loser path: 409 already_decided
  return json({ ok:false,
    error:'already_decided',
    current_status: ... }, 409);
}

// 2. Only the winner enqueues:
await env.DB.batch([
  INSERT ns_pending_pushes,
  INSERT decision_corpus,
]);

// 3. If enqueue fails AFTER claim:
//    revert claim (status ← 'pending')
//    so operator can retry.

The decide endpoint — POST /api/proposed-actions/:id/decide

Located at src/index.ts:25005-25109. Body: { decision: 'approved' | 'rejected', notes?: string }. Requires X-Edit-Token (R356).

  1. GatecheckEditToken(request, env). Read-only API key cannot mutate HITL state.
  2. LookupSELECT * FROM proposed_actions WHERE action_id=?1. 404 if not found; 400 if status !== 'pending'.
  3. Decision: rejected — single UPDATE...RETURNING, then recordEvent('hitl.rejected'). Done.
  4. Decision: approved — atomic claim (UPDATE...RETURNING). Loser gets 409. Winner proceeds.
  5. Winner: enqueue batchD1.batch([INSERT ns_pending_pushes, INSERT decision_corpus]).
  6. If batch fails — revert claim (UPDATE status='pending') so operator can retry; return 500 with "claim reverted; safe to retry".
  7. SuccessrecordEvent('hitl.approved', payload={...risk_level, queued_to:'ns_pending_pushes'}); return 200.

Drainer — ns_pending_pushes → NetSuite

stagestate transitionside effects
1. Approval enqueuesproposed_actions: pending → approved · ns_pending_pushes INSERT (status='queued')decision_corpus row written
2. Drainer picks upns_pending_pushes: queued → picking (picked_at set)proposed_actions still 'approved' — transition to 'pushing' is implicit via push status
3. NS push (via NS_PUSH_QUEUE / CF Queue consumer)ns_pending_pushes: picking → sent (sent_at set)NS RESTlet write; OAuth1 TBA
4a. NS write confirmedns_pending_pushes: sent → applied · proposed_actions: approved → applied+ proposed_actions_applied_mirror INSERT (audit)
4b. NS error after retriesns_pending_pushes: sent → failed · proposed_actions: approved → failedlast_error stored; DLQ row if configured

Stub note: POST /api/ns-push/drain (src/index.ts:25124) currently has a "dry-run by default" path; production NS RESTlet wiring is gated on TBA token + dedicated restlet build (per inline comment R294).

Mirror writes — who writes what when

storagerow(s)trigger
proposed_actions_applied_mirror1 row per applied actionNS write confirmed (transition to applied)
decision_corpus1 row per approval (pattern_rule)approve endpoint (in the atomic batch)
reflexion_log1 row per workflow runworkflow_runner only (post_actions stage 7), NOT the decide endpoint
events1 row hitl.approved or hitl.rejecteddecide endpoint after successful state transition

Bulk decide — R532

POST /api/proposed-actions/bulk-decide (src/index.ts:14054) lets the operator apply one decision (approved/rejected) to many action_ids at once. Rate-limited to 30/min. Uses decided_by='api:bulk-decide' in the UPDATE.

Bulk approve enqueues one ns_pending_pushes row per action in a single batch. The race fix applies per-action via the same WHERE status='pending' guard; race losers in the bulk path return as { action_id, skipped: 'already_decided' } entries in the response.

HITL invariant — cited references

Standing rule from CLAUDE.md: Every write to NS or business state goes through the HITL queue. Mike is always the loop step. No exceptions for "low-risk" categories beyond L1-L2 auto-approve in the workflow runner. The decide endpoint is the only authorized state-mutation path; bulk-decide is the same logic batched.

ADR-031: Defines the proposed_actions schema + queue/drain pattern. See data/decisions.json.

Source files