The fifth pillar (alongside SO / PO / WO / Bid Center)
The Data Tagger turns inbound semi-structured documents into NS-ready records. Mike (or anyone with the visual tagger UI at /data-tagger.html) uploads a sample document, draws boxes around interesting regions, assigns each to a target NetSuite field, picks one of 9 extraction strategies per field, and saves the template. From that point forward any future document of that (customer x doc_type x ns_record_type) auto-extracts using the saved template.
Three worked use cases: Path 1 customer PO -> SO (the first deployed reference, Driscoll Foods); Path 2 vendor COA -> compliance; Path 3 bid RFP -> pipeline (which bridges into the Bid Center pillar).
Diagram: ns-data-tagger-master.html. Live tool: /data-tagger.html (Agent BB-2 owns the visual UI; Agent BB-3 owns the chat tools; Agent BB-1 owns migration 142 with the 9 strategy schemas + templates + extractions tables).
Trigger conditions
- A customer or vendor emails a structured document (PO, COA, RFP) and we want auto-extraction.
- Mike wants to onboard a new document type for a customer that doesn't have a template yet (train mode).
- Inbound email lands and an existing template applies (apply mode · the steady-state path).
- A chat-driven extraction: power-user pastes a PDF into chat and asks the agent to tag/extract.
Visual at /data-tagger.html (drag boxes onto rendered PDF) · Chat-driven via data_tagger_train, data_tagger_apply, data_tagger_save_template tools.
ADR-031 holds: every NS write goes through proposed_actions with Mike approval. Confidence above 0.85 auto-stages a draft; below 0.85 surfaces a review-first card.
Driscoll Foods PO → SO (the first deployed use case)
Driscoll Foods purchasing@driscoll-foods.com sends a PO PDF (PO_8801772.pdf) to orders@ai-globalfoodsolutions.co. src/email.ts logs the email, saves the PDF to R2, and document_converter.ts parses it to markdown.
Sender domain match resolves customer_id = 478 (Driscoll Foods). doc_type classifier returns po_inbound. ns_record_type maps to SalesOrd. Template lookup finds tpl_driscoll_po_so_v3 with 8 field tags (47 prior hits, 95.7% success).
The 8 strategies run in parallel: regex_after_label captures P.O. # 8801772; literal_constant locks entity = Driscoll Foods; multi_line_span grabs the ship address; regex_after_label grabs the delivery date; whole_section captures memo notes; three table_with_headers walk Item # / Qty / Price columns. Overall weighted confidence: 0.92.
Above the 0.85 threshold, so a draft SO is auto-staged in proposed_actions. Mike opens admin-dashboard, sees the side-by-side PDF + extracted form, spot-checks the memo (which was at 0.83), approves. NS_PUSH_QUEUE drains: PushMutexDO per customer 478, POST /api/ns/push/sales-order. NS SO created with internal_id 1842738 and otherrefnum = "8801772" — the PO# trace thread that carries through Invoice and CashSale.
Reflexion fires: hit_count: 47 -> 48, success_count: 45 -> 46. events.so.created_from_po emits; customer_health watcher recomputes.
Intake → tag/train → apply/push
- 01
Intake (3 channels)
UI upload at
/data-tagger.html, inbound email auto from one of 5 mailboxes, or chat upload. PDF lands in R2. - 02
Parse to markdown
document_converter.ts-> markdown + span coordinates. - 03
Identify (3-key thread)
customer_id + doc_type + ns_record_type— e.g.Driscoll Foods / po_inbound / SalesOrd. - 04
Lookup template or train new
If template exists -> apply. If not -> operator visually tags + picks strategies + saves template (versioned).
- 05
Apply 9 strategies (per field)
regex_after_label, regex_before_label, fixed_region, table_with_headers, multi_line_span, whole_section, formula, llm_with_schema, literal_constant.
- 06
Confidence + HITL stage
Weighted overall; threshold 0.85 -> auto-draft; below -> review-first.
proposed_actionsINSERT. - 07
Operator approve
Side-by-side PDF + editable form. Approve / edit+approve / reject / reassign.
- 08
NS_PUSH_QUEUE writes
Routes per ns_record_type: SalesOrd push, vendor_coas insert, bid_external_pipeline insert. PushMutexDO per customer.
- 09
Reflexion + events
Template hit_count/success_count increment; events fire; subscribers react (customer_health, bid pipeline, compliance).
What's different after the cycle
- Inbound POs auto-create NS Sales Orders with
otherrefnumthreading intact. - Vendor COAs feed compliance log (pending
vendor_coasmigration). - Bid RFPs auto-log into
bid_external_pipelinebridging into Bid Center. - Per-customer templates get smarter over time via reflexion.
What can go wrong
Email arrives from a domain not in customers.email_domain. Surfaces in HITL with NEW customer prompt; Mike resolves manually.
Some field extractions fail (e.g. PDF was scanned image rather than text). System falls back to review-first; Mike corrects, which feeds reflexion.
A misconfigured template that uses llm_with_schema for all fields could burn cost. Needs CostCapDO integration.
Path 2 wiki documents the use case but the destination table needs its own migration before the path can actually write. This is the blocking gap for Path 2.
Adjacent flows + diagrams
Code paths + invariants
| Concern | Where |
|---|---|
| Visual UI | /data-tagger.html (Agent BB-2) |
| Chat tools | src/chat_tools/impls.ts data_tagger_* (Agent BB-3) |
| Document parser | src/document_converter.ts |
| Email pipeline | src/email.ts (5 mailboxes) |
| Migration | 142_data_tagger.sql (Agent BB-1) |
| D1 tables | data_tagger_templates, data_tagger_extractions, data_tagger_template_corrections, data_tagger_doc_types, data_tagger_uploads |
| Durable Object | PushMutexDO (per customer) |
| NS RESTlets | customscript_gfs_platform_push_so (Path 1) |
| R2 buckets | gfs-data-tagger-samples, gfs-inbound-attachments |
Dated trail
| Date | Round | Change | Touched by |
|---|---|---|---|
2026-05-27 | R598 | Data Tagger 5th pillar shipped — master + 3 path diagrams + 4 wikis. 9 extraction strategies documented. Path 1 (Driscoll PO/SO) deployed reference. Threading: customer + doc_type + ns_record_type. | Mike + Claude |
The machine-readable spec
Master workflow_type · data_tagger_lifecycle · risk_level 3. Sub-contracts: data_tagger_po_to_so_path, data_tagger_coa_to_compliance_path, data_tagger_bid_rfp_path.
The 3-key trace thread
| Record / table | Field carrying customer + doc_type + ns_record_type | Sample |
|---|---|---|
data_tagger_templates | customer_id + doc_type + ns_record_type (thread origin) | 478 / po_inbound / SalesOrd |
data_tagger_extractions | template_id + extraction_id | tpl_driscoll_po_so_v3 / ext_2026-05-27_a8f |
proposed_actions | payload.template_id + payload.extraction_id | same |
ns_pending_pushes | payload.customer_id + ns_record_type | 478 / SalesOrd |
NS SalesOrd | otherrefnum (customer PO#) | 8801772 (Path 1 secondary thread) |
The 9 extraction strategies (migration 142)
| Strategy | Purpose |
|---|---|
regex_after_label | find label text, capture text after (e.g. "P.O. #") |
regex_before_label | find label text, capture text before |
fixed_region | coordinates always at same x,y,w,h on page |
table_with_headers | locate table by header row, walk column |
multi_line_span | span starts at anchor and runs N lines |
whole_section | everything between two anchors |
formula | compute from prior extracted values (e.g. qty * rate) |
llm_with_schema | last resort, expensive, schema-constrained Workers AI call |
literal_constant | just return the constant (trusted from outside) |
D1 tables (migration 142 - Agent BB-1)
| Table | Purpose |
|---|---|
data_tagger_templates | per-(customer, doc_type, ns_record_type) template; versioned |
data_tagger_extractions | one row per inbound doc processed |
data_tagger_template_corrections | operator edit log for reflexion |
data_tagger_doc_types | doc_type -> ns_record_type mapping |
data_tagger_uploads | raw upload audit |
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST | /api/data-tagger/upload | UI upload |
POST | /api/data-tagger/train | save tagged template |
POST | /api/data-tagger/apply | run template against inbound doc |
POST | /api/proposed-actions/decide | approve / reject extraction |
POST | /api/ns/push/sales-order | NS SO write-back (Path 1) |
Events fired
| event_type | When |
|---|---|
data_tagger.extracted_to_ns | every successful apply |
data_tagger.template_used | every apply |
data_tagger.template_corrected | operator edited before approve |
so.created_from_po | Path 1 NS SO write success |
coa.received | Path 2 vendor_coas insert success |
bid.rfp_logged | Path 3 bid_external_pipeline insert success |
It broke - what now
Scenario · Driscoll PO extracted with wrong PO#
Mike opens proposed_action, sees otherrefnum is wrong (regex caught a different number on the page).
- Edit the field in the side-by-side review, click Approve - correction logs to
data_tagger_template_corrections - Inspect template:
SELECT field_tags FROM data_tagger_templates WHERE template_id='tpl_driscoll_po_so_v3' - If frequent (miss_count/hit_count > 0.2): train a new version with tighter regex or switch strategy to
fixed_region - Bump version: prior gets
status='superseded', new getsstatus='active'
Scenario · A vendor COA never staged (Path 2)
Vendor says they sent the COA; nothing in proposed_actions.
- Check vendor_coas table exists - currently TBD; Path 2 is blocked until table lands
- Check inbound_email_log:
SELECT * FROM inbound_email_log WHERE mailbox='vendors@' AND from_addr LIKE '%vendor.com%' - Check classifier: did doc_type resolve to
coa? Look atdata_tagger_uploads.classified_as - Run manually: POST
/api/data-tagger/applywith the R2 key
Scenario · llm_with_schema cost spike
Cost dashboard shows unusual Workers AI spend.
- Identify culprit template:
SELECT template_id, COUNT(*) FROM data_tagger_extractions WHERE created_at > ... GROUP BY template_id - Check strategy mix: any template overusing
llm_with_schema? - Add CostCapDO guard to
/api/data-tagger/apply - Retrain with cheaper strategies where possible
Logs to check
data_tagger_extractions· per-extraction confidence + field valuesdata_tagger_template_corrections· reflexion source datainbound_email_log· intake auditproposed_actions· HITL queue (kind=data_tagger_extraction)events·data_tagger.*npx wrangler tail· live Worker logs
Kill switches
kill:data_tagger_apply· stops auto-extractionkill:ns_writes· stops every NS push (incl. Data Tagger outputs)kill:proposed_apply· stops HITL approvals from executing fan-out
What's not done · what's uncertain
- STUBMigration 142 (data_tagger_*)
Agent BB-1 owns. Templates / extractions / corrections / doc_types / uploads tables not yet landed. Path 1 contract documented; runtime stubs.
- STUB/data-tagger.html visual UI
Agent BB-2 owns. Drag-rect overlay + NS schema autocomplete + side-panel md preview not yet shipped.
- STUBdata_tagger_* chat tools
Agent BB-3 owns.
data_tagger_train,data_tagger_apply,data_tagger_save_templatenot yet registered. - STUBPath 2 vendor_coas table
NOT YET CREATED. Path 2 cannot write until its own migration (proposed mig 144) lands. This is the explicit open TBD called out in the brief.
- OPENConfidence threshold tuning
0.85 is a guess. First-month telemetry needed for evidence-based threshold.
- OPENllm_with_schema cost cap integration
CostCapDO not wired to data tagger apply path yet. Needs guard against template misconfig.
- DECISIONTemplate inheritance scope
Should templates inherit across similar customers (e.g. all NYC schools share an RFP template)? Path 3 hints at yes.
- DECISIONitem_lots tracking design (Path 2)
Lots column on items vs separate
item_lotstable. Affects Path 2 step 12.