Wiki · Substrate piece

Knowledge corpus

How the AI knows what it knows. 3,360 chunks unifying NetSuite manifest, saved searches, ADRs, and prose guide — embedded into one Vectorize namespace and queried by every chat tool that needs structural context.

Real · R555 Phase 57
What this is

One retrievable surface for everything structural

The knowledge corpus is the platform's RAG (retrieval-augmented generation) substrate. Every piece of structural knowledge the AI might need — what NetSuite records exist, what saved searches do what, which ADRs decided which architectural questions, how the system guide explains a feature — lives as a chunk in the knowledge_chunks D1 table and as a vector in the ns_knowledge Vectorize namespace.

Before the corpus, each chat tool had its own ad-hoc context source: one tool would inline the manifest, another would re-read the system guide, a third would hard-code ADR names. The corpus consolidates all of that into a single retrievable surface. A chat tool says "what do I know about NetSuite Saved Search #1247?" and gets back a focused chunk with the right metadata.

Schema: migrations/schema/124_knowledge_corpus.sql, landed R555 / Phase 57. The single Vectorize index uses BGE-base embeddings (cheap, fast, plenty accurate for structural retrieval).

Composition

What's in the 3,360 chunks

Total chunks
3,360
unified surface
Manifest
2,361
NS records / fields / scripts
Saved search
963
SS definitions
ADR + guide
36
32 ADRs + 4 guide sections
Source typeCountWhat it covers
manifest2,361NS records, fields, lists, reports, roles, scripts, suitelets, workflows, dashboards, forms, KPIs
saved_search963Saved-search definitions — title, filters, columns, used_by_roles
adr32Architecture decision records from data/decisions.json
guide4Long-form sections from the system guide / wiki prose

Each chunk has: chunk_id (e.g. manifest:saved_search:1247), source_type, source_id, title, body (the indexable text), metadata_json (category, tags, NS id), content_hash (SHA-256, gates re-embedding), vector_id (= chunk_id in Vectorize).

The chat tool

search_netsuite_knowledge

The corpus is consumed via the search_netsuite_knowledge chat tool. The chairman selects it when a question is about structural NS knowledge — "which saved search shows me past-due AR by customer", "what ADR decided the HITL pattern", "is there a Suitelet for vendor cost upload".

Pipeline: query → BGE-base embed → Vectorize top-K lookup (default K=8) → join back to knowledge_chunks for body + metadata → return ranked. Filters can narrow by source_type when the chairman knows the answer is structural vs prose.

// Example tool invocation search_netsuite_knowledge({ query: "saved search past due ar by customer", filters: { source_type: ["saved_search"] }, top_k: 5 }) // Returns: ranked chunks with title, body, metadata, ns_id link
Worked example

"Which ADR decided we use D1 as the mirror?"

Scenario

Mike asks the chairman: "What ADR explains why D1 is the mirror and NS is system of record?"

The chairman picks search_netsuite_knowledge with filter source_type=['adr']. BGE-base embeds the query, Vectorize returns top-5. The top result is ADR-012 — D1 as read mirror, NS as system of record. The chunk includes the decision text, the alternatives considered, the date, and the link to the canonical version in data/decisions.json. The chairman narrates the answer with the citation embedded.

Wall-clock: ~280ms (BGE embed) + ~80ms (Vectorize) + ~10ms (D1 join) = ~370ms.

Step-by-step ingestion

How chunks get built and embedded

  1. 01

    Source scan

    scripts/build-knowledge-corpus.mjs walks the four source surfaces: system-manifest.json, the saved-search export, data/decisions.json, and the wiki prose files in R2.

    Reads manifest, SS exports, ADR JSON, R2 wiki
  2. 02

    Chunk synthesis

    Each source item becomes one chunk. The chunk_id is deterministic (manifest:saved_search:1247). The body is the indexable text — for an SS it's title + filters + columns serialized; for an ADR it's the prose. SHA-256 of body becomes content_hash.

    Writes knowledge_chunks staging
  3. 03

    Hash-gated embed

    If content_hash matches the existing row, skip — no need to re-embed. Otherwise call Workers AI BGE-base, store the vector in Vectorize with vector_id = chunk_id, and update embedded_at.

    Writes Vectorize ns_knowledge, knowledge_chunks.embedded_at
    Cost only on changed content
  4. 04

    Verify + swap

    After all chunks are processed, a verify pass confirms the Vectorize count matches D1 count and a random-sample query returns sensible results. If verify passes, the new corpus is live.

    Time ~3 min full rebuild on a quiet day
Refresh schedule

Cron: 15 4 * * SUN

The corpus refreshes once a week — Sunday at 4:15am UTC. Most sources (NS manifest, saved searches, ADRs) don't change frequently enough to warrant nightly rebuilds, and BGE-base embedding has per-token cost that adds up if you over-rebuild.

The hash-gating means a weekly run is cheap: if nothing changed, no embeddings are produced. Typical Sunday run touches ~40 chunks out of 3,360 (a few NS scripts got updated, one ADR landed, the guide had a section edit).

Manual refresh is also available: POST /admin/knowledge-corpus/refresh with optional ?source_type=adr filter to scope.

Why the hash gate matters

Without it, every Sunday would burn ~3,360 BGE-base calls. With it, we typically spend on 1-3% of that. Over a year that's a meaningful AI-cost reduction and lets us refresh more often (weekly vs monthly) at the same budget.

Outcomes

What the substrate enables

Failure modes

What can go wrong

Vectorize / D1 drift

If embed succeeds but D1 row update fails (or vice versa), counts diverge. Verify pass catches this and refuses to swap. Manual recovery: scripts/reconcile-corpus.mjs walks the diff and re-syncs.

Stale manifest export

If the upstream NS manifest export hasn't run, the corpus build uses yesterday's data. Daily digest flags manifest-export staleness so we catch it before the Sunday rebuild.

BGE-base rate limit

Mass re-embeds (when content changes everywhere — e.g., after a chunk-format rewrite) can hit Workers AI rate caps. The script paces calls at 50/s and retries with backoff.

Related

Adjacent substrate

For developers

Code paths + invariants

ConcernWhere
Schemamigrations/schema/124_knowledge_corpus.sql
Build scriptscripts/build-knowledge-corpus.mjs
Search toolsrc/chat_tools/impls.ts search_netsuite_knowledge
Vectorize namespacens_knowledge
Embedding modelWorkers AI BGE-base
Hash gatecontent_hash SHA-256 — re-embed only on change
Cron schedule15 4 * * SUN — weekly Sunday rebuild
Retrieval evaleval/knowledge-retrieval.yaml — gates builds