Knowledge corpus · wiki guide

What this is

One retrievable surface for everything structural

The knowledge corpus is the platform's RAG (retrieval-augmented generation) substrate. Every piece of structural knowledge the AI might need — what NetSuite records exist, what saved searches do what, which ADRs decided which architectural questions, how the system guide explains a feature — lives as a chunk in the knowledge_chunks D1 table and as a vector in the ns_knowledge Vectorize namespace.

Before the corpus, each chat tool had its own ad-hoc context source: one tool would inline the manifest, another would re-read the system guide, a third would hard-code ADR names. The corpus consolidates all of that into a single retrievable surface. A chat tool says "what do I know about NetSuite Saved Search #1247?" and gets back a focused chunk with the right metadata.

Schema: migrations/schema/124_knowledge_corpus.sql, landed R555 / Phase 57. The single Vectorize index uses BGE-base embeddings (cheap, fast, plenty accurate for structural retrieval).

Composition

What's in the 3,360 chunks

Total chunks

3,360

unified surface

Manifest

2,361

NS records / fields / scripts

Saved search

963

SS definitions

ADR + guide

32 ADRs + 4 guide sections

Source type	Count	What it covers
`manifest`	2,361	NS records, fields, lists, reports, roles, scripts, suitelets, workflows, dashboards, forms, KPIs
`saved_search`	963	Saved-search definitions — title, filters, columns, used_by_roles
`adr`	32	Architecture decision records from `data/decisions.json`
`guide`	4	Long-form sections from the system guide / wiki prose

Each chunk has: chunk_id (e.g. manifest:saved_search:1247), source_type, source_id, title, body (the indexable text), metadata_json (category, tags, NS id), content_hash (SHA-256, gates re-embedding), vector_id (= chunk_id in Vectorize).

The chat tool

search_netsuite_knowledge

The corpus is consumed via the search_netsuite_knowledge chat tool. The chairman selects it when a question is about structural NS knowledge — "which saved search shows me past-due AR by customer", "what ADR decided the HITL pattern", "is there a Suitelet for vendor cost upload".

Pipeline: query → BGE-base embed → Vectorize top-K lookup (default K=8) → join back to knowledge_chunks for body + metadata → return ranked. Filters can narrow by source_type when the chairman knows the answer is structural vs prose.

// Example tool invocation search_netsuite_knowledge({ query: "saved search past due ar by customer", filters: { source_type: ["saved_search"] }, top_k: 5 }) // Returns: ranked chunks with title, body, metadata, ns_id link

Worked example

"Which ADR decided we use D1 as the mirror?"

Scenario

Mike asks the chairman: "What ADR explains why D1 is the mirror and NS is system of record?"

The chairman picks search_netsuite_knowledge with filter source_type=['adr']. BGE-base embeds the query, Vectorize returns top-5. The top result is ADR-012 — D1 as read mirror, NS as system of record. The chunk includes the decision text, the alternatives considered, the date, and the link to the canonical version in data/decisions.json. The chairman narrates the answer with the citation embedded.

Wall-clock: ~280ms (BGE embed) + ~80ms (Vectorize) + ~10ms (D1 join) = ~370ms.

Step-by-step ingestion

How chunks get built and embedded

01

Source scan

scripts/build-knowledge-corpus.mjs walks the four source surfaces: system-manifest.json, the saved-search export, data/decisions.json, and the wiki prose files in R2.

Reads manifest, SS exports, ADR JSON, R2 wiki
02

Chunk synthesis

Each source item becomes one chunk. The chunk_id is deterministic (manifest:saved_search:1247). The body is the indexable text — for an SS it's title + filters + columns serialized; for an ADR it's the prose. SHA-256 of body becomes content_hash.

Writes knowledge_chunks staging
03

Hash-gated embed

If content_hash matches the existing row, skip — no need to re-embed. Otherwise call Workers AI BGE-base, store the vector in Vectorize with vector_id = chunk_id, and update embedded_at.

Writes Vectorize ns_knowledge, knowledge_chunks.embedded_at
Cost only on changed content
04

Verify + swap

After all chunks are processed, a verify pass confirms the Vectorize count matches D1 count and a random-sample query returns sensible results. If verify passes, the new corpus is live.

Time ~3 min full rebuild on a quiet day

Refresh schedule

Cron: `15 4 * * SUN`

The corpus refreshes once a week — Sunday at 4:15am UTC. Most sources (NS manifest, saved searches, ADRs) don't change frequently enough to warrant nightly rebuilds, and BGE-base embedding has per-token cost that adds up if you over-rebuild.

The hash-gating means a weekly run is cheap: if nothing changed, no embeddings are produced. Typical Sunday run touches ~40 chunks out of 3,360 (a few NS scripts got updated, one ADR landed, the guide had a section edit).

Manual refresh is also available: POST /admin/knowledge-corpus/refresh with optional ?source_type=adr filter to scope.

Why the hash gate matters

Without it, every Sunday would burn ~3,360 BGE-base calls. With it, we typically spend on 1-3% of that. Over a year that's a meaningful AI-cost reduction and lets us refresh more often (weekly vs monthly) at the same budget.

Outcomes

What the substrate enables

Any chat tool can ask "what do I know about X" and get focused structural context.
New AI tools don't have to inline manifest or guide prose — they call the search tool.
Wiki updates land in the corpus next Sunday automatically — no code change needed.
When NS schema changes (new field, new SS), the next manifest sync picks it up and the next corpus refresh embeds it.
Retrieval evals (in eval/knowledge-retrieval.yaml) gate corpus changes — if a known-good query stops returning the right chunk, the build fails.

Failure modes

What can go wrong

Vectorize / D1 drift

If embed succeeds but D1 row update fails (or vice versa), counts diverge. Verify pass catches this and refuses to swap. Manual recovery: scripts/reconcile-corpus.mjs walks the diff and re-syncs.

Stale manifest export

If the upstream NS manifest export hasn't run, the corpus build uses yesterday's data. Daily digest flags manifest-export staleness so we catch it before the Sunday rebuild.

BGE-base rate limit

Mass re-embeds (when content changes everywhere — e.g., after a chunk-format rewrite) can hit Workers AI rate caps. The script paces calls at 50/s and retries with backoff.

Adjacent substrate

For developers

Code paths + invariants

Concern	Where
Schema	migrations/schema/124_knowledge_corpus.sql
Build script	scripts/build-knowledge-corpus.mjs
Search tool	src/chat_tools/impls.ts search_netsuite_knowledge
Vectorize namespace	ns_knowledge
Embedding model	Workers AI BGE-base
Hash gate	content_hash SHA-256 — re-embed only on change
Cron schedule	15 4 * * SUN — weekly Sunday rebuild
Retrieval eval	eval/knowledge-retrieval.yaml — gates builds

Changelog

Dated trail · spot stale claims

Dated trail of when this doc was last touched, what changed, and what to look at if it feels stale.

Date	Round	Change	Touched by
`2026-05-26`	`R586`	Added CHANGELOG · SCHEMA · RUNBOOK · BACKLOG sections — wiki became best-in-class operating documentation.	Mike + Claude
`2026-05-25`	`R584/R585`	Wiki originally shipped — 8-section structure (hero / what / when / steps / outcomes / failure-modes / related / for-developers).	Mike + Claude

If today is more than 60 days past the latest changelog row, treat live system behavior as the source of truth. The doc may have drifted — verify against the workflow contract in workflow_definitions WHERE workflow_type='knowledge_corpus_substrate' before acting on these claims.

Schema · data contract

The machine-readable spec

Canonical fields, table names, endpoint signatures. What code should match, what tests should assert. workflow_type · knowledge_corpus_substrate · risk_level · N/A (substrate).

Inputs (required + optional)

Field	Type	Description
`query`	`string`	Natural-language question. Required.
`k`	`integer`	Top-k chunks to return; default 8.
`namespace`	`string?`	Optional filter: 'platform_docs', 'adrs', 'audits'.

D1 tables written

Table	Operation	Trigger
`kb_chunks`	INSERT (during rebuild only)	Chunk + embedding pointer
`kb_query_log`	INSERT (every query)	What got asked, what we returned

Endpoints called

Method	Path	Purpose
`POST`	`/api/kb/query`	Vector + keyword hybrid search
`POST`	`/api/kb/rebuild`	Full corpus rebuild (cron + manual)
`GET`	`/api/kb/stats`	Chunk count + last rebuild timestamp

Events fired

event_type	When	Subscribers
`kb.rebuilt`	After full rebuild completes	audit only
`kb.query.miss`	When result_count = 0	improve corpus or query rewriting

Runbook · when it breaks

It broke at 2am — what now

Different from "how do I use this." This is the page Mike pulls up when something is wrong: logs to check, recovery steps, who to escalate to.

Scenario · Chat answers "no relevant context found" when relevant docs exist

Either the doc wasn't ingested or the embedding similarity is too low.

Check ingestion: SELECT COUNT(*) FROM kb_chunks WHERE source_path LIKE '%<filename>%'
If 0: Rebuild: POST /api/kb/rebuild — and check whether the file extension is in the allowlist (.md, .html, .json).
If >0 but search misses: Inspect Vectorize index; chunks may have stale embeddings.

Scenario · Stale answers — chat cites an old version of CLAUDE.md

Corpus not rebuilt since last edit.

Check freshness: SELECT MAX(ingested_at) FROM kb_chunks vs the file's mtime.
Rebuild: POST /api/kb/rebuild — takes ~3-5 min for 3,360 chunks.
Auto-trigger: Cron rebuilds nightly; manual rebuild for urgent updates.

Scenario · Vectorize quota exceeded — rebuild fails

CF Vectorize has per-account vector limits.

Check usage: Cloudflare dashboard → Vectorize → index sizes.
Cull: Drop low-value namespaces (e.g. old audit reports) to free vectors.
Long-term: Move to chunk pruning — only keep most-cited chunks.

Logs to check

workflow_run_log · top-level run audit
workflow_step_log · per-step trace
workflow_verify_results · post-window verify outcomes
cron_locks · stuck cron lock detection
events · workflow.completed / workflow.failed event trail
reflexion_log · per-run narrative (if reflexion_enabled)
npx wrangler tail · live Worker logs

Kill switch · emergency stop

If this workflow is misbehaving in a high-impact way (creating bad proposed_actions in volume, pushing wrong things to NS), flip a kill switch:

kill:ns_writes · stops every NS push platform-wide
kill:proposed_apply · stops HITL approvals from executing fan-out
kill:high_risk_ops · stops risk_level >= 4 fan-out

See kill-switches-state-machine.html for the full state machine + recovery procedure.

Escalation

Primary: Mike Levine (single-admin) · mikelevine@globalfoodsolutions.co. For prolonged outage during business hours, notify warehouse lead + accounting lead so they can defer dependent work.

Backlog · open questions

What's not done · what's uncertain

What's not done, what's uncertain, what we punted. Captured so it survives context switches and doesn't die in someone's head.

STUB
Full source-walking rebuild requires Mike's terminal
Current Worker-resident rebuild is a partial port. Walking all repo files (specs, attachments, NS exports) still requires running a script on Mike's machine, not the Worker.
OPEN
Per-namespace re-indexing
Today rebuild is all-or-nothing. Should be able to rebuild just 'adrs' without touching 'specs'.
DEFER
Citation in chat answers
Chat returns relevant chunks but doesn't always inline citation. Should always say "per ADR-031..."
DECISION
Should rebuilt content emit kb.rebuilt event with diff?
Today event has no diff payload. Knowing what NEW chunks were added would help debug stale answers.