What this is

The fifth pillar (alongside SO / PO / WO / Bid Center)

The Data Tagger turns inbound semi-structured documents into NS-ready records. Mike (or anyone with the visual tagger UI at /data-tagger.html) uploads a sample document, draws boxes around interesting regions, assigns each to a target NetSuite field, picks one of 9 extraction strategies per field, and saves the template. From that point forward any future document of that (customer x doc_type x ns_record_type) auto-extracts using the saved template.

Three worked use cases: Path 1 customer PO -> SO (the first deployed reference, Driscoll Foods); Path 2 vendor COA -> compliance; Path 3 bid RFP -> pipeline (which bridges into the Bid Center pillar).

Diagram: ns-data-tagger-master.html. Live tool: /data-tagger.html (Agent BB-2 owns the visual UI; Agent BB-3 owns the chat tools; Agent BB-1 owns migration 142 with the 9 strategy schemas + templates + extractions tables).

When to use it

Trigger conditions

A customer or vendor emails a structured document (PO, COA, RFP) and we want auto-extraction.
Mike wants to onboard a new document type for a customer that doesn't have a template yet (train mode).
Inbound email lands and an existing template applies (apply mode · the steady-state path).
A chat-driven extraction: power-user pastes a PDF into chat and asks the agent to tag/extract.

Two operator modes

Visual at /data-tagger.html (drag boxes onto rendered PDF) · Chat-driven via data_tagger_train, data_tagger_apply, data_tagger_save_template tools.

HITL invariant

ADR-031 holds: every NS write goes through proposed_actions with Mike approval. Confidence above 0.85 auto-stages a draft; below 0.85 surfaces a review-first card.

Worked example

Driscoll Foods PO → SO (the first deployed use case)

Scenario

Driscoll Foods purchasing@driscoll-foods.com sends a PO PDF (PO_8801772.pdf) to orders@ai-globalfoodsolutions.co. src/email.ts logs the email, saves the PDF to R2, and document_converter.ts parses it to markdown.

Sender domain match resolves customer_id = 478 (Driscoll Foods). doc_type classifier returns po_inbound. ns_record_type maps to SalesOrd. Template lookup finds tpl_driscoll_po_so_v3 with 8 field tags (47 prior hits, 95.7% success).

The 8 strategies run in parallel: regex_after_label captures P.O. # 8801772; literal_constant locks entity = Driscoll Foods; multi_line_span grabs the ship address; regex_after_label grabs the delivery date; whole_section captures memo notes; three table_with_headers walk Item # / Qty / Price columns. Overall weighted confidence: 0.92.

Above the 0.85 threshold, so a draft SO is auto-staged in proposed_actions. Mike opens admin-dashboard, sees the side-by-side PDF + extracted form, spot-checks the memo (which was at 0.83), approves. NS_PUSH_QUEUE drains: PushMutexDO per customer 478, POST /api/ns/push/sales-order. NS SO created with internal_id 1842738 and otherrefnum = "8801772" — the PO# trace thread that carries through Invoice and CashSale.

Reflexion fires: hit_count: 47 -> 48, success_count: 45 -> 46. events.so.created_from_po emits; customer_health watcher recomputes.

Step-by-step what happens

Intake → tag/train → apply/push

01
Intake (3 channels)
UI upload at /data-tagger.html, inbound email auto from one of 5 mailboxes, or chat upload. PDF lands in R2.
02
Parse to markdown
document_converter.ts -> markdown + span coordinates.
03
Identify (3-key thread)
customer_id + doc_type + ns_record_type — e.g. Driscoll Foods / po_inbound / SalesOrd.
04
Lookup template or train new
If template exists -> apply. If not -> operator visually tags + picks strategies + saves template (versioned).
05
Apply 9 strategies (per field)
regex_after_label, regex_before_label, fixed_region, table_with_headers, multi_line_span, whole_section, formula, llm_with_schema, literal_constant.
06
Confidence + HITL stage
Weighted overall; threshold 0.85 -> auto-draft; below -> review-first. proposed_actions INSERT.
07
Operator approve
Side-by-side PDF + editable form. Approve / edit+approve / reject / reassign.
08
NS_PUSH_QUEUE writes
Routes per ns_record_type: SalesOrd push, vendor_coas insert, bid_external_pipeline insert. PushMutexDO per customer.
09
Reflexion + events
Template hit_count/success_count increment; events fire; subscribers react (customer_health, bid pipeline, compliance).

Outcomes

What's different after the cycle

Strategies

migration 142

Use cases

PO/SO · COA · RFP

Reference

Driscoll

Path 1 deployed

Confidence

0.85

auto-draft threshold

Inbound POs auto-create NS Sales Orders with otherrefnum threading intact.
Vendor COAs feed compliance log (pending vendor_coas migration).
Bid RFPs auto-log into bid_external_pipeline bridging into Bid Center.
Per-customer templates get smarter over time via reflexion.

Failure modes

What can go wrong

Unknown sender domain

Email arrives from a domain not in customers.email_domain. Surfaces in HITL with NEW customer prompt; Mike resolves manually.

Template confidence below threshold

Some field extractions fail (e.g. PDF was scanned image rather than text). System falls back to review-first; Mike corrects, which feeds reflexion.

llm_with_schema cost blowup

A misconfigured template that uses llm_with_schema for all fields could burn cost. Needs CostCapDO integration.

Path 2 vendor_coas table missing

Path 2 wiki documents the use case but the destination table needs its own migration before the path can actually write. This is the blocking gap for Path 2.

Adjacent flows + diagrams

For developers

Code paths + invariants

Concern	Where
Visual UI	/data-tagger.html (Agent BB-2)
Chat tools	src/chat_tools/impls.ts data_tagger_* (Agent BB-3)
Document parser	src/document_converter.ts
Email pipeline	src/email.ts (5 mailboxes)
Migration	142_data_tagger.sql (Agent BB-1)
D1 tables	data_tagger_templates, data_tagger_extractions, data_tagger_template_corrections, data_tagger_doc_types, data_tagger_uploads
Durable Object	PushMutexDO (per customer)
NS RESTlets	customscript_gfs_platform_push_so (Path 1)
R2 buckets	gfs-data-tagger-samples, gfs-inbound-attachments

// Trace thread invariant type ThreadKey = { customer_id: number, doc_type: string, ns_record_type: string }; // Apply template async function applyTemplate(template, markdown) { const results = []; for (const tag of template.field_tags) { const { value, confidence } = await runStrategy(tag.strategy, tag.pattern, markdown); results.push({ ns_field: tag.ns_field, value, confidence }); } const overall = weightedAvg(results); if (overall > 0.85) autoStageDraft(results); else stageReviewFirst(results); }

Changelog

Dated trail

Date	Round	Change	Touched by
`2026-05-27`	`R598`	Data Tagger 5th pillar shipped — master + 3 path diagrams + 4 wikis. 9 extraction strategies documented. Path 1 (Driscoll PO/SO) deployed reference. Threading: customer + doc_type + ns_record_type.	Mike + Claude

Schema · data contract

The machine-readable spec

Master workflow_type · data_tagger_lifecycle · risk_level 3. Sub-contracts: data_tagger_po_to_so_path, data_tagger_coa_to_compliance_path, data_tagger_bid_rfp_path.

The 3-key trace thread

Record / table	Field carrying customer + doc_type + ns_record_type	Sample
`data_tagger_templates`	`customer_id + doc_type + ns_record_type` (thread origin)	`478 / po_inbound / SalesOrd`
`data_tagger_extractions`	`template_id + extraction_id`	`tpl_driscoll_po_so_v3 / ext_2026-05-27_a8f`
`proposed_actions`	`payload.template_id + payload.extraction_id`	same
`ns_pending_pushes`	`payload.customer_id + ns_record_type`	`478 / SalesOrd`
NS `SalesOrd`	`otherrefnum` (customer PO#)	`8801772` (Path 1 secondary thread)

The 9 extraction strategies (migration 142)

Strategy	Purpose
`regex_after_label`	find label text, capture text after (e.g. "P.O. #")
`regex_before_label`	find label text, capture text before
`fixed_region`	coordinates always at same x,y,w,h on page
`table_with_headers`	locate table by header row, walk column
`multi_line_span`	span starts at anchor and runs N lines
`whole_section`	everything between two anchors
`formula`	compute from prior extracted values (e.g. `qty * rate`)
`llm_with_schema`	last resort, expensive, schema-constrained Workers AI call
`literal_constant`	just return the constant (trusted from outside)

D1 tables (migration 142 - Agent BB-1)

Table	Purpose
`data_tagger_templates`	per-(customer, doc_type, ns_record_type) template; versioned
`data_tagger_extractions`	one row per inbound doc processed
`data_tagger_template_corrections`	operator edit log for reflexion
`data_tagger_doc_types`	doc_type -> ns_record_type mapping
`data_tagger_uploads`	raw upload audit

Endpoints

Method	Path	Purpose
`POST`	`/api/data-tagger/upload`	UI upload
`POST`	`/api/data-tagger/train`	save tagged template
`POST`	`/api/data-tagger/apply`	run template against inbound doc
`POST`	`/api/proposed-actions/decide`	approve / reject extraction
`POST`	`/api/ns/push/sales-order`	NS SO write-back (Path 1)

Events fired

event_type	When
`data_tagger.extracted_to_ns`	every successful apply
`data_tagger.template_used`	every apply
`data_tagger.template_corrected`	operator edited before approve
`so.created_from_po`	Path 1 NS SO write success
`coa.received`	Path 2 vendor_coas insert success
`bid.rfp_logged`	Path 3 bid_external_pipeline insert success

Runbook · when it breaks

It broke - what now

Scenario · Driscoll PO extracted with wrong PO#

Mike opens proposed_action, sees otherrefnum is wrong (regex caught a different number on the page).

Edit the field in the side-by-side review, click Approve - correction logs to data_tagger_template_corrections
Inspect template: SELECT field_tags FROM data_tagger_templates WHERE template_id='tpl_driscoll_po_so_v3'
If frequent (miss_count/hit_count > 0.2): train a new version with tighter regex or switch strategy to fixed_region
Bump version: prior gets status='superseded', new gets status='active'

Scenario · A vendor COA never staged (Path 2)

Vendor says they sent the COA; nothing in proposed_actions.

Check vendor_coas table exists - currently TBD; Path 2 is blocked until table lands
Check inbound_email_log: SELECT * FROM inbound_email_log WHERE mailbox='vendors@' AND from_addr LIKE '%vendor.com%'
Check classifier: did doc_type resolve to coa? Look at data_tagger_uploads.classified_as
Run manually: POST /api/data-tagger/apply with the R2 key

Scenario · llm_with_schema cost spike

Cost dashboard shows unusual Workers AI spend.

Identify culprit template: SELECT template_id, COUNT(*) FROM data_tagger_extractions WHERE created_at > ... GROUP BY template_id
Check strategy mix: any template overusing llm_with_schema?
Add CostCapDO guard to /api/data-tagger/apply
Retrain with cheaper strategies where possible

Logs to check

data_tagger_extractions · per-extraction confidence + field values
data_tagger_template_corrections · reflexion source data
inbound_email_log · intake audit
proposed_actions · HITL queue (kind=data_tagger_extraction)
events · data_tagger.*
npx wrangler tail · live Worker logs

Kill switches

kill:data_tagger_apply · stops auto-extraction
kill:ns_writes · stops every NS push (incl. Data Tagger outputs)
kill:proposed_apply · stops HITL approvals from executing fan-out

Backlog · open questions

What's not done · what's uncertain

STUB
Migration 142 (data_tagger_*)
Agent BB-1 owns. Templates / extractions / corrections / doc_types / uploads tables not yet landed. Path 1 contract documented; runtime stubs.
STUB
/data-tagger.html visual UI
Agent BB-2 owns. Drag-rect overlay + NS schema autocomplete + side-panel md preview not yet shipped.
STUB
data_tagger_* chat tools
Agent BB-3 owns. data_tagger_train, data_tagger_apply, data_tagger_save_template not yet registered.
STUB
Path 2 vendor_coas table
NOT YET CREATED. Path 2 cannot write until its own migration (proposed mig 144) lands. This is the explicit open TBD called out in the brief.
OPEN
Confidence threshold tuning
0.85 is a guess. First-month telemetry needed for evidence-based threshold.
OPEN
llm_with_schema cost cap integration
CostCapDO not wired to data tagger apply path yet. Needs guard against template misconfig.
DECISION
Template inheritance scope
Should templates inherit across similar customers (e.g. all NYC schools share an RFP template)? Path 3 hints at yes.
DECISION
item_lots tracking design (Path 2)
Lots column on items vs separate item_lots table. Affects Path 2 step 12.