Pipeline — 3 lanes (intake / tag+train / apply+push) · (customer + doc_type + ns_record_type) threading

idle

Glossary · cluster colors + thread terms

Database (templates, extractions, identity)

Backend (parser, strategies, reflexion)

Cloud (tagger UI, live surface)

Messagebus (events, NS_PUSH_QUEUE, branch)

HITL gate (proposed_actions)

★ NS write

customer + doc_type + ns_record_type: the trace thread for the Data Tagger pillar

9 strategies: regex_after_label, regex_before_label, fixed_region, table_with_headers, multi_line_span, whole_section, formula, llm_with_schema, literal_constant

data_tagger_templates: per-(customer, doc, record) template; versioned

data_tagger_extractions: one row per inbound document processed

Phase detail — 3 lanes

L1 Intake — 3 channels REAL

UI upload via /data-tagger.html, inbound email auto-route (5 mailboxes), or chat upload (Agent BB-3 tools).

UI surface

/data-tagger.html (Agent BB-2)

Email pipeline

src/email.ts + 5 mailboxes

Chat tools

data_tagger_train, data_tagger_apply, data_tagger_save_template (Agent BB-3)

L2 Tag + Train migration 142 + Agent BB-2 in flight

Parse to markdown, identify the 3-key thread, lookup template; if absent, operator visually tags each field and picks a strategy from the 9; template saved versioned.

Thread key

customer_id + doc_type + ns_record_type (e.g. Driscoll Foods / po_inbound / SalesOrd)

Parser

src/document_converter.ts

Strategies

regex_after_label · regex_before_label · fixed_region · table_with_headers · multi_line_span · whole_section · formula · llm_with_schema · literal_constant

L3 Apply + Push REAL for SO; STUB for COA / bid

Per-field strategies run, per-field + overall confidence, HITL stage, operator approve, NS_PUSH_QUEUE writes, reflexion updates template metrics, events fire.

HITL

ADR-031 invariant - every NS write needs a proposed_actions row

Confidence

threshold 0.85 -> auto-draft path; below -> review-first

Reflexion

hit_count / success_count / miss_count per template

Tables, files, endpoints, code paths

kind	name	purpose
Live tool	`/data-tagger.html`	visual tagger UI (Agent BB-2)
D1 table	`data_tagger_templates`	per-(customer,doc,record) template; versioned (migration 142)
D1 table	`data_tagger_extractions`	one row per inbound document processed
D1 table	`data_tagger_template_corrections`	operator edits for reflexion
D1 table	`data_tagger_doc_types`	doc_type -> ns_record_type mapping
D1 table	`data_tagger_uploads`	raw upload audit
D1 table	`inbound_email_log`	existing email audit
D1 table	`proposed_actions`	HITL queue (kind=`data_tagger_extraction`)
D1 table	`events`	event ledger (R549)
R2 bucket	`gfs-data-tagger-samples`	uploaded sample PDFs
R2 bucket	`gfs-inbound-attachments`	email attachments
Endpoint	`POST /api/data-tagger/upload`	UI upload
Endpoint	`POST /api/data-tagger/train`	save tagged template
Endpoint	`POST /api/data-tagger/apply`	run template against inbound doc
Endpoint	`POST /api/proposed-actions/decide`	approve / reject extraction
Endpoint	`POST /api/ns/push/sales-order`	NS SO write-back (Path 1)
Code path	`src/document_converter.ts`	PDF/DOCX/XLSX -> markdown
Code path	`src/email.ts`	5-mailbox inbound pipeline
Code path	`src/chat_tools/impls.ts`	data_tagger_* tools (Agent BB-3)
Migration	`142_data_tagger.sql`	Agent BB-1 - 9 strategies + templates + extractions schema
Durable Object	`PushMutexDO`	per-customer NS write mutex

Open gaps — honest punch list

Migration 142 (9 strategy schema + templates + extractions tables) — Agent BB-1 owns. Schemas drafted; landing in flight.
/data-tagger.html visual UI — Agent BB-2 owns. Drag-rect overlay + NS schema autocomplete + side-panel md preview not yet shipped.
Chat tools data_tagger_* — Agent BB-3 owns. Three tools (train / apply / save_template) not yet registered in src/chat_tools/impls.ts.
Path 2 vendor_coas table NOT YET CREATED — Path 2 wiki documents the use case but the destination table needs its own migration before the path can actually write.
Auto-confidence threshold tuning — threshold 0.85 is a guess; needs first-month telemetry to land on a defensible value.
llm_with_schema cost cap — expensive strategy. Needs CostCapDO integration so a misconfigured template doesn't burn the cost budget.

GFS Data Tagger — master