One retrievable surface for everything structural
The knowledge corpus is the platform's RAG (retrieval-augmented generation) substrate. Every piece of structural knowledge the AI might need — what NetSuite records exist, what saved searches do what, which ADRs decided which architectural questions, how the system guide explains a feature — lives as a chunk in the knowledge_chunks D1 table and as a vector in the ns_knowledge Vectorize namespace.
Before the corpus, each chat tool had its own ad-hoc context source: one tool would inline the manifest, another would re-read the system guide, a third would hard-code ADR names. The corpus consolidates all of that into a single retrievable surface. A chat tool says "what do I know about NetSuite Saved Search #1247?" and gets back a focused chunk with the right metadata.
Schema: migrations/schema/124_knowledge_corpus.sql, landed R555 / Phase 57. The single Vectorize index uses BGE-base embeddings (cheap, fast, plenty accurate for structural retrieval).
What's in the 3,360 chunks
| Source type | Count | What it covers |
|---|---|---|
manifest | 2,361 | NS records, fields, lists, reports, roles, scripts, suitelets, workflows, dashboards, forms, KPIs |
saved_search | 963 | Saved-search definitions — title, filters, columns, used_by_roles |
adr | 32 | Architecture decision records from data/decisions.json |
guide | 4 | Long-form sections from the system guide / wiki prose |
Each chunk has: chunk_id (e.g. manifest:saved_search:1247), source_type, source_id, title, body (the indexable text), metadata_json (category, tags, NS id), content_hash (SHA-256, gates re-embedding), vector_id (= chunk_id in Vectorize).
search_netsuite_knowledge
The corpus is consumed via the search_netsuite_knowledge chat tool. The chairman selects it when a question is about structural NS knowledge — "which saved search shows me past-due AR by customer", "what ADR decided the HITL pattern", "is there a Suitelet for vendor cost upload".
Pipeline: query → BGE-base embed → Vectorize top-K lookup (default K=8) → join back to knowledge_chunks for body + metadata → return ranked. Filters can narrow by source_type when the chairman knows the answer is structural vs prose.
"Which ADR decided we use D1 as the mirror?"
Mike asks the chairman: "What ADR explains why D1 is the mirror and NS is system of record?"
The chairman picks search_netsuite_knowledge with filter source_type=['adr']. BGE-base embeds the query, Vectorize returns top-5. The top result is ADR-012 — D1 as read mirror, NS as system of record. The chunk includes the decision text, the alternatives considered, the date, and the link to the canonical version in data/decisions.json. The chairman narrates the answer with the citation embedded.
Wall-clock: ~280ms (BGE embed) + ~80ms (Vectorize) + ~10ms (D1 join) = ~370ms.
How chunks get built and embedded
-
01
Source scan
scripts/build-knowledge-corpus.mjswalks the four source surfaces:system-manifest.json, the saved-search export,data/decisions.json, and the wiki prose files in R2. -
02
Chunk synthesis
Each source item becomes one chunk. The chunk_id is deterministic (
manifest:saved_search:1247). The body is the indexable text — for an SS it's title + filters + columns serialized; for an ADR it's the prose. SHA-256 of body becomescontent_hash. -
03
Hash-gated embed
If
content_hashmatches the existing row, skip — no need to re-embed. Otherwise call Workers AI BGE-base, store the vector in Vectorize withvector_id = chunk_id, and updateembedded_at. -
04
Verify + swap
After all chunks are processed, a verify pass confirms the Vectorize count matches D1 count and a random-sample query returns sensible results. If verify passes, the new corpus is live.
Cron: 15 4 * * SUN
The corpus refreshes once a week — Sunday at 4:15am UTC. Most sources (NS manifest, saved searches, ADRs) don't change frequently enough to warrant nightly rebuilds, and BGE-base embedding has per-token cost that adds up if you over-rebuild.
The hash-gating means a weekly run is cheap: if nothing changed, no embeddings are produced. Typical Sunday run touches ~40 chunks out of 3,360 (a few NS scripts got updated, one ADR landed, the guide had a section edit).
Manual refresh is also available: POST /admin/knowledge-corpus/refresh with optional ?source_type=adr filter to scope.
Without it, every Sunday would burn ~3,360 BGE-base calls. With it, we typically spend on 1-3% of that. Over a year that's a meaningful AI-cost reduction and lets us refresh more often (weekly vs monthly) at the same budget.
What the substrate enables
- Any chat tool can ask "what do I know about X" and get focused structural context.
- New AI tools don't have to inline manifest or guide prose — they call the search tool.
- Wiki updates land in the corpus next Sunday automatically — no code change needed.
- When NS schema changes (new field, new SS), the next manifest sync picks it up and the next corpus refresh embeds it.
- Retrieval evals (in
eval/knowledge-retrieval.yaml) gate corpus changes — if a known-good query stops returning the right chunk, the build fails.
What can go wrong
If embed succeeds but D1 row update fails (or vice versa), counts diverge. Verify pass catches this and refuses to swap. Manual recovery: scripts/reconcile-corpus.mjs walks the diff and re-syncs.
If the upstream NS manifest export hasn't run, the corpus build uses yesterday's data. Daily digest flags manifest-export staleness so we catch it before the Sunday rebuild.
Mass re-embeds (when content changes everywhere — e.g., after a chunk-format rewrite) can hit Workers AI rate caps. The script paces calls at 50/s and retries with backoff.
Adjacent substrate
Code paths + invariants
| Concern | Where |
|---|---|
| Schema | migrations/schema/124_knowledge_corpus.sql |
| Build script | scripts/build-knowledge-corpus.mjs |
| Search tool | src/chat_tools/impls.ts search_netsuite_knowledge |
| Vectorize namespace | ns_knowledge |
| Embedding model | Workers AI BGE-base |
| Hash gate | content_hash SHA-256 — re-embed only on change |
| Cron schedule | 15 4 * * SUN — weekly Sunday rebuild |
| Retrieval eval | eval/knowledge-retrieval.yaml — gates builds |