Wiki · Data Tagger master · R598 · 5th pillar

GFS Data Tagger — master

The 5th pillar (alongside SO / PO / WO / Bid Center). A platform-level layer between inbound documents and downstream actions: upload a sample PDF, parse to markdown, visually tag regions to NS fields, save a per-customer template, and future inbound docs auto-extract. Trace thread: customer + doc_type + ns_record_type.

5th pillar · 3-lane intake/tag/apply 9 extraction strategies · migration 142
What this is

The fifth pillar (alongside SO / PO / WO / Bid Center)

The Data Tagger turns inbound semi-structured documents into NS-ready records. Mike (or anyone with the visual tagger UI at /data-tagger.html) uploads a sample document, draws boxes around interesting regions, assigns each to a target NetSuite field, picks one of 9 extraction strategies per field, and saves the template. From that point forward any future document of that (customer x doc_type x ns_record_type) auto-extracts using the saved template.

Three worked use cases: Path 1 customer PO -> SO (the first deployed reference, Driscoll Foods); Path 2 vendor COA -> compliance; Path 3 bid RFP -> pipeline (which bridges into the Bid Center pillar).

Diagram: ns-data-tagger-master.html. Live tool: /data-tagger.html (Agent BB-2 owns the visual UI; Agent BB-3 owns the chat tools; Agent BB-1 owns migration 142 with the 9 strategy schemas + templates + extractions tables).

When to use it

Trigger conditions

Two operator modes

Visual at /data-tagger.html (drag boxes onto rendered PDF) · Chat-driven via data_tagger_train, data_tagger_apply, data_tagger_save_template tools.

HITL invariant

ADR-031 holds: every NS write goes through proposed_actions with Mike approval. Confidence above 0.85 auto-stages a draft; below 0.85 surfaces a review-first card.

Worked example

Driscoll Foods PO → SO (the first deployed use case)

Scenario

Driscoll Foods purchasing@driscoll-foods.com sends a PO PDF (PO_8801772.pdf) to orders@ai-globalfoodsolutions.co. src/email.ts logs the email, saves the PDF to R2, and document_converter.ts parses it to markdown.

Sender domain match resolves customer_id = 478 (Driscoll Foods). doc_type classifier returns po_inbound. ns_record_type maps to SalesOrd. Template lookup finds tpl_driscoll_po_so_v3 with 8 field tags (47 prior hits, 95.7% success).

The 8 strategies run in parallel: regex_after_label captures P.O. # 8801772; literal_constant locks entity = Driscoll Foods; multi_line_span grabs the ship address; regex_after_label grabs the delivery date; whole_section captures memo notes; three table_with_headers walk Item # / Qty / Price columns. Overall weighted confidence: 0.92.

Above the 0.85 threshold, so a draft SO is auto-staged in proposed_actions. Mike opens admin-dashboard, sees the side-by-side PDF + extracted form, spot-checks the memo (which was at 0.83), approves. NS_PUSH_QUEUE drains: PushMutexDO per customer 478, POST /api/ns/push/sales-order. NS SO created with internal_id 1842738 and otherrefnum = "8801772" — the PO# trace thread that carries through Invoice and CashSale.

Reflexion fires: hit_count: 47 -> 48, success_count: 45 -> 46. events.so.created_from_po emits; customer_health watcher recomputes.

Step-by-step what happens

Intake → tag/train → apply/push

  1. 01

    Intake (3 channels)

    UI upload at /data-tagger.html, inbound email auto from one of 5 mailboxes, or chat upload. PDF lands in R2.

  2. 02

    Parse to markdown

    document_converter.ts -> markdown + span coordinates.

  3. 03

    Identify (3-key thread)

    customer_id + doc_type + ns_record_type — e.g. Driscoll Foods / po_inbound / SalesOrd.

  4. 04

    Lookup template or train new

    If template exists -> apply. If not -> operator visually tags + picks strategies + saves template (versioned).

  5. 05

    Apply 9 strategies (per field)

    regex_after_label, regex_before_label, fixed_region, table_with_headers, multi_line_span, whole_section, formula, llm_with_schema, literal_constant.

  6. 06

    Confidence + HITL stage

    Weighted overall; threshold 0.85 -> auto-draft; below -> review-first. proposed_actions INSERT.

  7. 07

    Operator approve

    Side-by-side PDF + editable form. Approve / edit+approve / reject / reassign.

  8. 08

    NS_PUSH_QUEUE writes

    Routes per ns_record_type: SalesOrd push, vendor_coas insert, bid_external_pipeline insert. PushMutexDO per customer.

  9. 09

    Reflexion + events

    Template hit_count/success_count increment; events fire; subscribers react (customer_health, bid pipeline, compliance).

Outcomes

What's different after the cycle

Strategies
9
migration 142
Use cases
3
PO/SO · COA · RFP
Reference
Driscoll
Path 1 deployed
Confidence
0.85
auto-draft threshold
Failure modes

What can go wrong

Unknown sender domain

Email arrives from a domain not in customers.email_domain. Surfaces in HITL with NEW customer prompt; Mike resolves manually.

Template confidence below threshold

Some field extractions fail (e.g. PDF was scanned image rather than text). System falls back to review-first; Mike corrects, which feeds reflexion.

llm_with_schema cost blowup

A misconfigured template that uses llm_with_schema for all fields could burn cost. Needs CostCapDO integration.

Path 2 vendor_coas table missing

Path 2 wiki documents the use case but the destination table needs its own migration before the path can actually write. This is the blocking gap for Path 2.

Related

Adjacent flows + diagrams

For developers

Code paths + invariants

ConcernWhere
Visual UI/data-tagger.html (Agent BB-2)
Chat toolssrc/chat_tools/impls.ts data_tagger_* (Agent BB-3)
Document parsersrc/document_converter.ts
Email pipelinesrc/email.ts (5 mailboxes)
Migration142_data_tagger.sql (Agent BB-1)
D1 tablesdata_tagger_templates, data_tagger_extractions, data_tagger_template_corrections, data_tagger_doc_types, data_tagger_uploads
Durable ObjectPushMutexDO (per customer)
NS RESTletscustomscript_gfs_platform_push_so (Path 1)
R2 bucketsgfs-data-tagger-samples, gfs-inbound-attachments
// Trace thread invariant type ThreadKey = { customer_id: number, doc_type: string, ns_record_type: string }; // Apply template async function applyTemplate(template, markdown) { const results = []; for (const tag of template.field_tags) { const { value, confidence } = await runStrategy(tag.strategy, tag.pattern, markdown); results.push({ ns_field: tag.ns_field, value, confidence }); } const overall = weightedAvg(results); if (overall > 0.85) autoStageDraft(results); else stageReviewFirst(results); }
Changelog

Dated trail

DateRoundChangeTouched by
2026-05-27R598Data Tagger 5th pillar shipped — master + 3 path diagrams + 4 wikis. 9 extraction strategies documented. Path 1 (Driscoll PO/SO) deployed reference. Threading: customer + doc_type + ns_record_type.Mike + Claude
Schema · data contract

The machine-readable spec

Master workflow_type · data_tagger_lifecycle · risk_level 3. Sub-contracts: data_tagger_po_to_so_path, data_tagger_coa_to_compliance_path, data_tagger_bid_rfp_path.

The 3-key trace thread

Record / tableField carrying customer + doc_type + ns_record_typeSample
data_tagger_templatescustomer_id + doc_type + ns_record_type (thread origin)478 / po_inbound / SalesOrd
data_tagger_extractionstemplate_id + extraction_idtpl_driscoll_po_so_v3 / ext_2026-05-27_a8f
proposed_actionspayload.template_id + payload.extraction_idsame
ns_pending_pushespayload.customer_id + ns_record_type478 / SalesOrd
NS SalesOrdotherrefnum (customer PO#)8801772 (Path 1 secondary thread)

The 9 extraction strategies (migration 142)

StrategyPurpose
regex_after_labelfind label text, capture text after (e.g. "P.O. #")
regex_before_labelfind label text, capture text before
fixed_regioncoordinates always at same x,y,w,h on page
table_with_headerslocate table by header row, walk column
multi_line_spanspan starts at anchor and runs N lines
whole_sectioneverything between two anchors
formulacompute from prior extracted values (e.g. qty * rate)
llm_with_schemalast resort, expensive, schema-constrained Workers AI call
literal_constantjust return the constant (trusted from outside)

D1 tables (migration 142 - Agent BB-1)

TablePurpose
data_tagger_templatesper-(customer, doc_type, ns_record_type) template; versioned
data_tagger_extractionsone row per inbound doc processed
data_tagger_template_correctionsoperator edit log for reflexion
data_tagger_doc_typesdoc_type -> ns_record_type mapping
data_tagger_uploadsraw upload audit

Endpoints

MethodPathPurpose
POST/api/data-tagger/uploadUI upload
POST/api/data-tagger/trainsave tagged template
POST/api/data-tagger/applyrun template against inbound doc
POST/api/proposed-actions/decideapprove / reject extraction
POST/api/ns/push/sales-orderNS SO write-back (Path 1)

Events fired

event_typeWhen
data_tagger.extracted_to_nsevery successful apply
data_tagger.template_usedevery apply
data_tagger.template_correctedoperator edited before approve
so.created_from_poPath 1 NS SO write success
coa.receivedPath 2 vendor_coas insert success
bid.rfp_loggedPath 3 bid_external_pipeline insert success
Runbook · when it breaks

It broke - what now

Scenario · Driscoll PO extracted with wrong PO#

Mike opens proposed_action, sees otherrefnum is wrong (regex caught a different number on the page).

  1. Edit the field in the side-by-side review, click Approve - correction logs to data_tagger_template_corrections
  2. Inspect template: SELECT field_tags FROM data_tagger_templates WHERE template_id='tpl_driscoll_po_so_v3'
  3. If frequent (miss_count/hit_count > 0.2): train a new version with tighter regex or switch strategy to fixed_region
  4. Bump version: prior gets status='superseded', new gets status='active'

Scenario · A vendor COA never staged (Path 2)

Vendor says they sent the COA; nothing in proposed_actions.

  1. Check vendor_coas table exists - currently TBD; Path 2 is blocked until table lands
  2. Check inbound_email_log: SELECT * FROM inbound_email_log WHERE mailbox='vendors@' AND from_addr LIKE '%vendor.com%'
  3. Check classifier: did doc_type resolve to coa? Look at data_tagger_uploads.classified_as
  4. Run manually: POST /api/data-tagger/apply with the R2 key

Scenario · llm_with_schema cost spike

Cost dashboard shows unusual Workers AI spend.

  1. Identify culprit template: SELECT template_id, COUNT(*) FROM data_tagger_extractions WHERE created_at > ... GROUP BY template_id
  2. Check strategy mix: any template overusing llm_with_schema?
  3. Add CostCapDO guard to /api/data-tagger/apply
  4. Retrain with cheaper strategies where possible

Logs to check

Kill switches

Backlog · open questions

What's not done · what's uncertain