HITL security invariant — the 4 gates between AI and NetSuite audit storyADR-031

4 gates · 3 kill switches · R560 race-fix · ADR-031

The audit story for the platform's security invariant: no NS write happens without all 4 gates. Gate 1 CF Access JWT (edge-enforced). Gate 2 Role palette (X-Role-Id → tool_role_palettes). Gate 3 X-Edit-Token on every write endpoint. Gate 4 proposed_actions HITL queue (ADR-031). Plus 3 kill switches (ns_writes, proposed_apply, high_risk_ops) that flip writes off in seconds. The R560 atomic-claim race fix is shown explicitly: drainer uses UPDATE...RETURNING so two instances cannot race.

0 · Visual flow 7 lanes · 14 nodes

System flow
01 / Gate 1 — Cloudflare Access JWT (every request) 02 / Gate 2 — Role gate (X-Role-Id → tool palette) 03 / Gate 3 — Edit token (X-Edit-Token on every write) 04 / Gate 4 — proposed_actions queue (Mike approves) 05 / Kill switches (KV-flag flippable) 06 / Drainer (claim → NS_PUSH_QUEUE → NS → status) 07 / Failure path (DLQ + alert) Every request to api.ai-globalfoodsolutions.co must carry a valid Cloudflare Access cookie/JWT. Policy enforced at CF edge before request reaches Worker code. Configured in CF dashboard, not in source. GATE 1: CF Access JWT enforcement: CF edge (pre-Worker) policy_source: CF dashboard applies_to: api.ai-globalfoodsolutions.co identity_provider: GFS Google Workspace ADR: docs/DOMAIN_SETUP.md i CF Access JWT every request to api.ai-globalfoodsolutions.co enforced at CF edge (zero code in Worker) X-Role-Id header maps to a tool palette via tool_role_palettes (R556). filterToolsForRole strips tools not in the palette BEFORE the LLM sees the catalog. Prevents accidental tool calls. GATE 2: role gate enforcement: filterToolsForRole (pre-LLM) table: tool_role_palettes roles: 10 (admin, pricing, ar, bid, nutrition, production, ops, relationship, order_mgmt, all) tool_count: 175+ tools, 50+ gated i Role gate · X-Role-Id 50+ tools each mapped to one or more roles R556 · tool_role_palettes table Every write endpoint validates X-Edit-Token against env.EDIT_TOKEN via checkEditToken. Read endpoints don't require it. Missing/wrong token returns 403. GATE 3: edit token enforcement: checkEditToken(request, env) env_var: EDIT_TOKEN scope: every POST/PUT/DELETE write endpoint failure: 403 Forbidden i X-Edit-Token required on every write endpoint checkEditToken(request, env) · env.EDIT_TOKEN Every NS write is staged as a row in proposed_actions with status='pending'. Mike reviews in /proposed-actions.html, clicks Approve or Reject. Only on Approve does the write enter NS_PUSH_QUEUE. GATE 4: proposed_actions table: proposed_actions status_flow: pending → approved | rejected → applied | failed risk_tiers: L1-L5 (L1 auto-applies, L3+ requires Mike) surface: /proposed-actions.html ADR: 031 i proposed_actions queue every NS write goes through HITL ADR-031 · risk-tiered L1-L5 Drainer claims a proposed_action via atomic UPDATE proposed_actions SET status='applying', claimed_by='drainer_N', claimed_at=now() WHERE action_id=? AND status='approved' RETURNING *. Prevents two drainer instances from racing on the same row. GATE 4: R560 race fix query: UPDATE...RETURNING guarantee: at-most-once apply per row trigger: codex audit CRITICAL #1 date: R560 i R560 atomic-claim race fix UPDATE...RETURNING claim codex CRITICAL #1 finding closed Static admin page where Mike sees pending proposed_actions, filters by risk + entity_type, bulk-approves low-risk batches, rejects with reason. GATE 4: approval UI surface: /proposed-actions.html authn: CF Access + X-Edit-Token actions: approve, reject, bulk-approve, edit i /proposed-actions.html Mike's approval surface bulk approve · risk filter · audit panel KV flag. Set to 'off' to halt ALL NS writes platform-wide. Drainer checks this flag before every NS_PUSH_QUEUE consume. KILL: ns_writes kv_key: kill.ns_writes values: 'on' | 'off' effect: drainer skips all NS pushes when off i ns_writes master kill KV flag. Set to 'off' to halt the drainer from consuming approved rows (but new rows can still be staged + approved). KILL: proposed_apply kv_key: kill.proposed_apply values: 'on' | 'off' effect: drainer pauses, queue grows i proposed_apply block drainer KV flag. Set to 'off' to halt drain of risk_level >= 4 proposed_actions only. Lower-risk items still flow. KILL: high_risk_ops kv_key: kill.high_risk_ops values: 'on' | 'off' effect: drainer skips risk >= 4 rows when off i high_risk_ops block L4+ Approved proposed_actions translated into ns_pending_pushes rows with status='pending'. NS payload assembled here. DRAINER: staging table: ns_pending_pushes status: pending → claimed → applied | failed i ns_pending_pushes D1 staging Cloudflare Queue. Producer is the drainer; consumer is the NS-API caller. DRAINER: queue queue: NS_PUSH_QUEUE batch_size: 10 retry: 3 then DLQ i NS_PUSH_QUEUE CF Queue Consumes from NS_PUSH_QUEUE, calls customscript_gfs_platform_query custom RESTlet via TBA OAuth1. On success flips ns_pending_pushes.status='applied'. On failure increments retry_count. DRAINER: consumer endpoint: customscript_gfs_platform_query auth: TBA OAuth1 success: status='applied' failure: retry_count++ then DLQ i NS API consumer TBA + RESTlet Failed writes (after 3 retries) routed to NS_PUSH_QUEUE_DLQ. Manual recovery via /admin-dashboard.html. FAILURE: DLQ queue: NS_PUSH_QUEUE_DLQ trigger: retry_count >= 3 recovery: /admin-dashboard.html i DLQ NS_PUSH_QUEUE_DLQ 3 retries exceeded DLQ entry fires workflow.failed event into ledger; observability stack drafts an email to Mike. FAILURE: alert event: workflow.failed channel: email to Mike i alert workflow.failed event + email to mike observability stack

1 · The 4 gates no NS write happens without all 4

GateWhere enforcedFailure mode
1 · CF Access JWTCloudflare edge (pre-Worker)302 redirect to CF Access login
2 · X-Role-Id role gatefilterToolsForRole (pre-LLM)Tool not in catalog → LLM cannot call
3 · X-Edit-TokencheckEditToken on every write endpoint403 Forbidden
4 · proposed_actions queueexecuteWorkflowContract stages row · drainer claims atomically (R560)Pending forever until Mike approves

2 · Kill switches KV-flag flippable in seconds

SwitchKV keyEffect
ns_writeskill.ns_writesHalts ALL NS writes platform-wide
proposed_applykill.proposed_applyHalts the drainer; queue continues to grow
high_risk_opskill.high_risk_opsHalts only risk_level ≥ 4 drains

3 · The drainer chain

After Mike approves in /proposed-actions.html, the row flows:

  1. proposed_actions.status → 'approved'
  2. Drainer atomic claim: UPDATE proposed_actions SET status='applying', claimed_by=?, claimed_at=now() WHERE action_id=? AND status='approved' RETURNING * (R560 race fix)
  3. Row translated into ns_pending_pushes with NS payload
  4. Pushed to NS_PUSH_QUEUE (Cloudflare Queue, batch=10)
  5. Consumer calls customscript_gfs_platform_query RESTlet via TBA OAuth1
  6. On success: status='applied' on both rows
  7. On failure: retry_count++ up to 3, then DLQ + workflow.failed event + alert email to Mike

4 · How to read it

ColorMeaning
frontendUser-facing surface (chat UI, admin HTML pages)
backendWorker logic / agent code / business rules
databaseD1 table / R2 object / KV key / Vectorize index
cloudExternal system (NetSuite, Anthropic, etc.)
securityGate / policy / HITL approval / kill switch
messagebusEvent ledger, Queues, async fan-out
externalInbound source (email, webhook, cron tick, user input)
→ solidSynchronous call (request → response)
→ greenApproved / happy-path
→ red dashedPolicy or security check
→ grey dashedOptional / conditional / async