06 · deep tech · technical

Embedding model, chunking, retrieval math

The knowledge vault is the answer to "what should I know about this drift signal?" The retrieval stack is a hybrid: dense vector KNN plus sparse lexical BM25, fused by Reciprocal Rank Fusion, decayed by age, then re-ranked by graph centrality. Four small SQLite-backed processes serve one HTTP endpoint each. All math is reproducible: the indexer source — the chunking, lexical, and graph-compute modules — ships inside the downloadable kit.

The embedding model

BAAI/bge-large-en-v1.5

Model	`BAAI/bge-large-en-v1.5` (Hugging Face)
Dimension	1024
Similarity metric	cosine, computed via sqlite-vec
Storage on disk	float32 little-endian blob (4096 bytes per vector)
Query-side prefix (bge convention)	`"Represent this sentence for searching relevant passages: "`
Document-side prefix	none — documents encoded raw
Why this model	Strong English retrieval benchmarks, runs on CPU at acceptable latency for a personal corpus, dimensionality matches the MTEB sweet spot for sub-100k chunks

Chunking strategy

The chunker is a routine in the vault indexer. It is heading-aware by design, with a size target and overlap counted in whitespace words (a cheap token approximation — the config keys say "tokens", the loop counts words):

Unit	H2 section (markdown `##`)
Size target	~600 words (`chunk_target_tokens=600` — a whitespace-word approximation of tokens)
Overlap	80 words (`chunk_overlap_tokens=80`)
Algorithm	scan headings → partition per H2 → per-section size check → sliding window subdivision on long sections
Returns	`List[(text, section_label)]`
Why H2	preserves narrative context; for The Street snapshots a single ticker = one section = one chunk
Why 600 — and the 512-token caveat	most H2 sections in this corpus land well under the target, so they embed intact within bge-large's 512-token encoder window. A chunk that genuinely runs to 600 words exceeds that window: the encoder silently truncates, so the vector represents the head of the section, while FTS5 BM25 and excerpt selection still see the full chunk text. Accepted trade-off — the head of an H2 section carries its topic, and lexical recall covers the tail.

The four indexer processes

Four sibling FastAPI processes. Same code, different DOMAIN= env var, separate cache file. Physical isolation by design.

Scope derivation

Each process boots with DOMAIN=<slug> and reads _domains.yaml at the vault root. From the registry it derives INCLUDE/EXCLUDE prefixes for every walk. The finance indexer's scope is legacy: true: it includes everything except other domains' subfolders (auto-derived).

# _domains.yaml (abridged)
finance:
  legacy: true
  taxonomy: _taxonomy.md
  review_queue: _review-queue.md
  embedding: BAAI/bge-large-en-v1.5
  embedding_dim: 1024
  decay:
    mode: ranked_grouped
    ladder: [1.0, 0.6, 0.45, 0.35, 0.25]
    recency_boost_days: 30
    recency_boost_amount: 0.05
    kind_overrides:
      Filings: { group_by_path_prefix: [1.0, 0.6] }
  lexical:
    weight: 0.5

fitness: { ... decay.mode: off, lexical.weight: 0.05 }
nutrition: { ... decay.mode: off, lexical.weight: 0.05 }
learning: { ... decay.mode: off, lexical.weight: 0.05 }

Cold-start tagging. When a brand-new note is ingested, a small Claude Haiku pass proposes 1–5 controlled-vocabulary tags from the domain taxonomy. It is best-effort and operator-review-gated through the domain's _review-queue.md — the model suggests, the operator confirms before tags become canonical.

The retrieval stack — four layers

query (+ bge prefix) → vector KNN (1024-d cosine, sqlite-vec) in parallel with FTS5 BM25 (lexical) → layer 1 · RRF 1/(k + rank) → layer 2 · decay (ranked-grouped, finance only) → layer 3 · graph re-rank (0.6·vector + 0.25·PageRank + 0.15·centrality) → top-K excerpts passed to the LLM as evidence.

Two parallel retrievers → RRF merge → optional decay → graph re-rank with recency boost. Hover any layer. This same four-layer stack serves both the fast path and deep mode — deep mode just widens the knobs (see the fast-vs-deep table below).

Layer 1 · Hybrid Reciprocal Rank Fusion

Vector KNN and FTS5 BM25 run independently. Each produces a ranked list. RRF combines them rank-wise (not score-wise):

rrf_score(d) = Σ over retrievers r:  weight(r) × 1 / (k + rank_r(d))

# k = 60 (standard RRF constant)
# rank_r(d) = position of doc d in retriever r's ranking (1-indexed)
# weight(r) per-domain in _domains.yaml
#   finance: vector=1.0, lexical=0.5   (lexical matters for tickers/filings)
#   others:  vector=1.0, lexical=0.05  (timeless concepts → vector-dominant)

RRF is robust against score-scale mismatch between vector cosine [-1, 1] and BM25 (unbounded). By using rank position rather than raw score it works well even when one retriever is dominant.

Layer 2 · Decay (finance only by default)

The finance corpus has time-sensitive material (filings, snapshots, news). The mode is ranked_grouped — and despite earlier wording on this site, it is not a time-based half-life. Results are grouped by author, ranked within each author group by published_at descending, and the ladder [1.0, 0.6, 0.45, 0.35, 0.25] is applied by rank-position, with 0.25 as the floor past the ladder's length:

An author's most recent item: rank 0 → ×1.0; each older item steps down the ladder
A single-item author group → rank 0 → ×1.0 (no penalty)
Evergreen-tagged nodes opt out — always ×1.0
The Filings kind_overrides uses a path-prefix grouping [1.0, 0.6] — current quarter vs everything else

Why rank-within-author rather than a time half-life: authors publish on wildly different cadences (a daily newsletter vs a quarterly letter). A pure time decay would bury infrequent and evergreen sources regardless of quality; ranking within each author gives every source a fair shot at the top slot on its own cadence. (Not to be confused with the operator-attention signal, which does use a 7-day half-life — different mechanism, see the Lakshmi door.)

Fitness, nutrition, and learning use mode: off — content is timeless.

Layer 3 · Graph re-ranking

When citation edges exist (a vault note links to another), a graph re-rank applies a linear combination:

hybrid_score(d) = α × vector_score(d)
                + β × pagerank(d)
                + γ × eigenvector_centrality(d)
                + recency_bonus(d)

# α = 0.6, β = 0.25, γ = 0.15
# recency_bonus = +0.05 if modified within 30 days, else 0
# applies to ALL FOUR DOMAINS (not finance-only)
# only activates if ≥10 citation edges exist in the touched subgraph

PageRank rewards well-cited notes — the canonical references in your reading. Computed on a directed graph of citation edges only.
Eigenvector centrality rewards notes that are well-cited by other well-cited notes — the "load-bearing" concepts. Computed on an undirected graph of citations + wikilinks (also the input to Louvain community clustering); falls back to degree centrality if the eigenvector solver fails to converge.
Recency boost ensures fresh thinking is not buried under decade-old classics.

Graceful degradation. If a node has no PageRank or no centrality (e.g. brand-new, uncited), that term's weight folds back into α rather than scoring it as zero — so a fresh note still ranks on pure vector similarity instead of being penalised for lacking graph history.

Iterative-deepening graph traversal

/traverse/<path>?depth=N seeds the graph search from a specific node and walks the citation graph outward. The traversal is quality-gated, not exhaustive:

Depth cap: 4 hops (max_hops)
Beam width: 5 — at most 5 candidates carried forward per hop
Per-hop prune floor: 0.50 — a candidate below 0.50 cosine to the query is dropped at that hop
Early-stop condition: stop once the pool reaches target_k = 10 AND the pool's average similarity is ≥ 0.65 (most queries converge in 1–2 hops)

The two thresholds do different jobs: 0.50 prunes individual weak nodes at each hop; 0.65 is the average-quality bar that lets the search stop early. They are not the same floor.

Typical traversal touches ~100 embeddings per query — never bulk-loads.

From a seed vault note, hop-1 expansion yields 4 candidates — 3 retained above the 0.50 per-hop prune floor, 1 pruned. Hop-2 expands from the retained hop-1 nodes only: 2 retained, 1 pruned. Five nodes reach the top-K accepted set; the search stops early once the pool's average similarity clears 0.65, after touching roughly 100 embeddings.

The fast path in action. Dashed + dimmed nodes were dropped (below the 0.50 per-hop prune floor). Solid nodes feed the next hop; the walk stops early once the retained pool's average similarity clears 0.65. Hover any node to trace its lineage to the top-K output. (Deep mode widens all of this — next section.)

Deep retrieval mode — retrieve wide, filter late

The fast path above is tuned to be cheap and always-on (~100 embeddings per query). When its thesis_match comes back weak or the corpus is sparse, an opt-in deep mode runs the same graph search with the brakes off. Crucially it does not run in the always-on app — it runs on-demand in a Claude Code session (the operator runs /rx-deep-retrieve <rec-id>), where the <$2/mo and <100-embeddings/query budgets don't apply. The retrieval itself is pure Python over the local bge model + sqlite-vec — zero LLM, no API key, no billing; the curated result POSTs back to the app's write-back endpoint (/v1/rx/deep) so the panel can show it. This cost seam is what makes deep retrieval affordable (see the enrichment lane on Pipeline).

Parameter	Fast (always-on)	Deep (on-demand)
`beam_width`	5	12
`max_hops`	4	4
`seed_count`	3	8
`target_k`	10	50
`prune_threshold`	0.50	0.30 (floor only — drop pure noise)
early-stop `quality_floor`	0.65 (active)	disabled (explore full depth)
decay	filter (drops `decay≤0`)	feature (keeps, flags `decay_zero`)
pruned candidates	dropped silently	returned with a `drop_reason`

The principle is separate retrieval from filtering. Deep mode retrieves wide and attaches metadata to every candidate, then lets the judgment layer (running where LLM judgment is ~free) filter with full visibility — so nothing is dropped before it is evaluated. Each retained candidate carries its similarity, hop distance, decay_weight, a decay_zero flag, and a retain_reason (seed · kept_despite_decay_zero · above_prune_floor). Pruned candidates are also returned — each with a drop_reason (below_prune_threshold · beam_overflow) — so a dropped-but-relevant doc can be rescued by the curator instead of vanishing silently.

What's proven vs pending. The deep-mode mechanism is shipped and unit-proven. The actual recall gain ("deep surfaced N docs the fast path dropped") is not yet measured — it needs the laptop's bge model + live corpus via an offline eval script (--mode compare). The in-app "Deep retrieval available" affordance is specified for the finance panel but not yet confirmed rendered there; the fitness Coach view already renders deep results (a "Deep enrichment" section).

Citation verification — is the quote actually in the source?

A separate, deterministic, app-side check (no LLM): at rec compose and at deep-result write-back, every quoted span must be a normalized substring of the chunk it cites. Normalization is, in order: NFKC unicode → smart-punctuation fold (curly quotes → straight, en/em/minus dashes → -, non-breaking space → space) → casefold → whitespace collapse. Ellipsis-elided quotes are split and each fragment must appear in order. Quotes shorter than 12 normalized characters are marked too_short — unverifiable, and deliberately not a pass, so a trivial match can't manufacture confidence.

Per-citation	`citation_verified` (bool) + `citation_reason`: `match` · `no_quote` · `no_chunk_text` · `too_short` · `not_found`
Per-rec (derived on read)	`citations_status`: `no_quotes` · `all_verified` · `has_mismatch` · `unverifiable`
On a miss	the rec still publishes — verification is annotate-only, never blocks ingest. A fabricated citation is flagged (`has_mismatch`), not suppressed.

Excerpt extraction

After top-K chunks are picked, each one's excerpt (the snippet shown in the rec's "Sources" section) is selected sentence-by-sentence:

1. Split chunk into sentences
2. Encode each sentence with the same bge prefix
3. Cosine-score against the query embedding
4. Return the top-2 sentences in document order

This keeps citations focused — instead of a 600-word chunk wall, the operator sees the two sentences that best match the rec's framing question.

End-to-end query latency budget

Step	Typical time	Bottleneck
Encode query (bge-large)	40–120 ms	CPU forward pass
Vector KNN (sqlite-vec)	5–20 ms	linear scan over ~50k vectors
FTS5 BM25	2–8 ms	tokeniser + inverted index
RRF merge	< 1 ms	—
Decay weighting	< 1 ms	—
Graph re-rank	20–80 ms	networkx PageRank on subgraph
Excerpt selection	30–60 ms	sentence-level encoding for top-K chunks
Total	100–290 ms	encoder forward passes dominate

Why this model + stack?

Four design choices. Click any to expand.

Why bge-large-en-v1.5?

Top of MTEB English retrieval leaderboard at 1B-param tier (≈335M params)
Runs on CPU at acceptable latency for a personal corpus (~50k chunks)
1024-d is the sweet spot — bigger doesn't help on sub-100k corpora; smaller (768-d) loses recall on subtle queries
Symmetric retrieval (same model for query + doc) — simpler ops than asymmetric retrievers

Why sqlite-vec instead of pgvector?

Per-domain physical isolation — one cache file per indexer
Zero-server local-first deploys (no postgres dependency for the indexer)
Atomic cp-able caches → trivial backup/restore
Crash-safe by SQLite WAL
Performance sufficient up to ~100k vectors; this corpus is well under that

Why hybrid (vector + lexical)?

Vector retrieval misses exact ticker symbols, model names, formula notation
BM25 nails proper nouns + technical terms
Finance especially needs both: "AAPL Q3" is a ticker (lexical) and "Apple's growth quarter" is a concept (vector)
Fitness leans vector-heavy (0.05 lexical) because timeless concepts dominate

Why graph re-rank?

A vault is not a flat corpus — it has citation structure
PageRank surfaces canonical references the operator has already validated by citing them
Centrality finds "load-bearing" notes — those well-cited by other well-cited notes
Recency boost prevents the system from being trapped by old foundational notes

Caches: what's inside each domain's cache file

Table	Purpose
`vault_node`	One row per markdown file. Frontmatter, evergreen flag, domain tag, last-modified.
`vault_chunk`	One row per chunk (~600-word target). Section label, embedding (1024-d float32 blob).
`vault_chunk_vec`	sqlite-vec virtual table — KNN index over `vault_chunk.embedding`.
`vault_chunk_fts`	FTS5 virtual table — BM25 over `vault_chunk.text`.
`vault_edge`	Citation graph edges (forward links + backlinks). Source for PageRank + centrality.

Endpoints — the indexer HTTP surface

Endpoint	Returns
`GET /health`	Vault path, DB path, embedding model, embedding_dim
`POST /reload`	Full vault rescan (idempotent, diff-aware)
`GET /search?q=&k=`	Hybrid top-K with excerpts
`GET /traverse/<path>?depth=`	Local subgraph from a seed node
`GET /node/<path>`	Full chunk listing for a single file
`POST /promote`	Apply ticks from `_review-queue.md`, regenerate queue
`POST /apply-renames`	Scope-aware atomic taxonomy renames

TradingV's vault module proxies these endpoints from the FastAPI app to the indexer. The frontend never hits Port 1 directly — it hits /v1/vault/search which forwards.

← prev

05 · Feedback

07 · Maintenance