06 · deep tech · technical

Embedding model, chunking, retrieval math

The knowledge vault is the answer to "what should I know about this drift signal?" The retrieval stack is a hybrid: dense vector KNN plus sparse lexical BM25, fused by Reciprocal Rank Fusion, decayed by age, then re-ranked by graph centrality. Four small SQLite-backed processes serve one HTTP endpoint each. All math is reproducible: the indexer source — the chunking, lexical, and graph-compute modules — ships inside the downloadable kit.

The embedding model Immutability per cache. The embedding_dim is written into the SQLite schema at cache creation. Swapping the EMBEDDING_MODEL env var without deleting the cache crashes startup (the dimension wouldn't match). To swap models cleanly: stop the indexer → delete each domain's cache file → restart → full re-embed.

BAAI/bge-large-en-v1.5

ModelBAAI/bge-large-en-v1.5 (Hugging Face)
Dimension1024
Similarity metriccosine, computed via sqlite-vec
Storage on diskfloat32 little-endian blob (4096 bytes per vector)
Query-side prefix (bge convention)"Represent this sentence for searching relevant passages: "
Document-side prefixnone — documents encoded raw
Why this modelStrong English retrieval benchmarks, runs on CPU at acceptable latency for a personal corpus, dimensionality matches the MTEB sweet spot for sub-100k chunks

Chunking strategy "Tokens" here means whitespace words. The config keys are named chunk_target_tokens=600 and chunk_overlap_tokens=80, but the chunker is a deliberately naive token-approximation: it counts whitespace-split words, and no tokenizer runs at chunking time. Earlier drafts of this doc flip-flopped between "words" and "tokens" — the code is the referee.

The chunker is a routine in the vault indexer. It is heading-aware by design, with a size target and overlap counted in whitespace words (a cheap token approximation — the config keys say "tokens", the loop counts words):

UnitH2 section (markdown ##)
Size target~600 words (chunk_target_tokens=600 — a whitespace-word approximation of tokens)
Overlap80 words (chunk_overlap_tokens=80)
Algorithmscan headings → partition per H2 → per-section size check → sliding window subdivision on long sections
ReturnsList[(text, section_label)]
Why H2preserves narrative context; for The Street snapshots a single ticker = one section = one chunk
Why 600 — and the 512-token caveatmost H2 sections in this corpus land well under the target, so they embed intact within bge-large's 512-token encoder window. A chunk that genuinely runs to 600 words exceeds that window: the encoder silently truncates, so the vector represents the head of the section, while FTS5 BM25 and excerpt selection still see the full chunk text. Accepted trade-off — the head of an H2 section carries its topic, and lexical recall covers the tail.

The four indexer processes

Port 1 · finance finance cache · 18 MB decay: ranked_grouped lexical weight: 0.5 includes Filings kind_overrides Port 2 · fitness fitness cache · 14 MB decay: off lexical weight: 0.05 vector-heavy retrieval Port 3 · nutrition nutrition cache · 6.3 MB decay: off lexical weight: 0.05 vector-heavy retrieval Port 4 · learning learning cache · 4 KB decay: off lexical weight: 0.05 empty corpus today
Four sibling FastAPI processes. Same code, different DOMAIN= env var, separate cache file. Physical isolation by design.

Scope derivation

Each process boots with DOMAIN=<slug> and reads _domains.yaml at the vault root. From the registry it derives INCLUDE/EXCLUDE prefixes for every walk. The finance indexer's scope is legacy: true: it includes everything except other domains' subfolders (auto-derived).

# _domains.yaml (abridged)
finance:
  legacy: true
  taxonomy: _taxonomy.md
  review_queue: _review-queue.md
  embedding: BAAI/bge-large-en-v1.5
  embedding_dim: 1024
  decay:
    mode: ranked_grouped
    ladder: [1.0, 0.6, 0.45, 0.35, 0.25]
    recency_boost_days: 30
    recency_boost_amount: 0.05
    kind_overrides:
      Filings: { group_by_path_prefix: [1.0, 0.6] }
  lexical:
    weight: 0.5

fitness: { ... decay.mode: off, lexical.weight: 0.05 }
nutrition: { ... decay.mode: off, lexical.weight: 0.05 }
learning: { ... decay.mode: off, lexical.weight: 0.05 }

Cold-start tagging. When a brand-new note is ingested, a small Claude Haiku pass proposes 1–5 controlled-vocabulary tags from the domain taxonomy. It is best-effort and operator-review-gated through the domain's _review-queue.md — the model suggests, the operator confirms before tags become canonical.

The retrieval stack — four layers

query (+ bge prefix) → vector KNN (1024-d cosine, sqlite-vec) in parallel with FTS5 BM25 (lexical) → layer 1 · RRF 1/(k + rank)layer 2 · decay (ranked-grouped, finance only) → layer 3 · graph re-rank (0.6·vector + 0.25·PageRank + 0.15·centrality) → top-K excerpts passed to the LLM as evidence.

Two parallel retrievers → RRF merge → optional decay → graph re-rank with recency boost. Hover any layer. This same four-layer stack serves both the fast path and deep mode — deep mode just widens the knobs (see the fast-vs-deep table below).

Layer 1 · Hybrid Reciprocal Rank Fusion

Vector KNN and FTS5 BM25 run independently. Each produces a ranked list. RRF combines them rank-wise (not score-wise):

rrf_score(d) = Σ over retrievers r:  weight(r) × 1 / (k + rank_r(d))

# k = 60 (standard RRF constant)
# rank_r(d) = position of doc d in retriever r's ranking (1-indexed)
# weight(r) per-domain in _domains.yaml
#   finance: vector=1.0, lexical=0.5   (lexical matters for tickers/filings)
#   others:  vector=1.0, lexical=0.05  (timeless concepts → vector-dominant)

RRF is robust against score-scale mismatch between vector cosine [-1, 1] and BM25 (unbounded). By using rank position rather than raw score it works well even when one retriever is dominant.

Layer 2 · Decay (finance only by default)

The finance corpus has time-sensitive material (filings, snapshots, news). The mode is ranked_grouped — and despite earlier wording on this site, it is not a time-based half-life. Results are grouped by author, ranked within each author group by published_at descending, and the ladder [1.0, 0.6, 0.45, 0.35, 0.25] is applied by rank-position, with 0.25 as the floor past the ladder's length:

Why rank-within-author rather than a time half-life: authors publish on wildly different cadences (a daily newsletter vs a quarterly letter). A pure time decay would bury infrequent and evergreen sources regardless of quality; ranking within each author gives every source a fair shot at the top slot on its own cadence. (Not to be confused with the operator-attention signal, which does use a 7-day half-life — different mechanism, see the Lakshmi door.)

Fitness, nutrition, and learning use mode: off — content is timeless.

Layer 3 · Graph re-ranking

When citation edges exist (a vault note links to another), a graph re-rank applies a linear combination:

hybrid_score(d) = α × vector_score(d)
                + β × pagerank(d)
                + γ × eigenvector_centrality(d)
                + recency_bonus(d)

# α = 0.6, β = 0.25, γ = 0.15
# recency_bonus = +0.05 if modified within 30 days, else 0
# applies to ALL FOUR DOMAINS (not finance-only)
# only activates if ≥10 citation edges exist in the touched subgraph

Graceful degradation. If a node has no PageRank or no centrality (e.g. brand-new, uncited), that term's weight folds back into α rather than scoring it as zero — so a fresh note still ranks on pure vector similarity instead of being penalised for lacking graph history.

Iterative-deepening graph traversal

/traverse/<path>?depth=N seeds the graph search from a specific node and walks the citation graph outward. The traversal is quality-gated, not exhaustive:

The two thresholds do different jobs: 0.50 prunes individual weak nodes at each hop; 0.65 is the average-quality bar that lets the search stop early. They are not the same floor.

Typical traversal touches ~100 embeddings per query — never bulk-loads.

From a seed vault note, hop-1 expansion yields 4 candidates — 3 retained above the 0.50 per-hop prune floor, 1 pruned. Hop-2 expands from the retained hop-1 nodes only: 2 retained, 1 pruned. Five nodes reach the top-K accepted set; the search stops early once the pool's average similarity clears 0.65, after touching roughly 100 embeddings.

The fast path in action. Dashed + dimmed nodes were dropped (below the 0.50 per-hop prune floor). Solid nodes feed the next hop; the walk stops early once the retained pool's average similarity clears 0.65. Hover any node to trace its lineage to the top-K output. (Deep mode widens all of this — next section.)

Deep retrieval mode — retrieve wide, filter late

The fast path above is tuned to be cheap and always-on (~100 embeddings per query). When its thesis_match comes back weak or the corpus is sparse, an opt-in deep mode runs the same graph search with the brakes off. Crucially it does not run in the always-on app — it runs on-demand in a Claude Code session (the operator runs /rx-deep-retrieve <rec-id>), where the <$2/mo and <100-embeddings/query budgets don't apply. The retrieval itself is pure Python over the local bge model + sqlite-vec — zero LLM, no API key, no billing; the curated result POSTs back to the app's write-back endpoint (/v1/rx/deep) so the panel can show it. This cost seam is what makes deep retrieval affordable (see the enrichment lane on Pipeline).

ParameterFast (always-on)Deep (on-demand)
beam_width512
max_hops44
seed_count38
target_k1050
prune_threshold0.500.30 (floor only — drop pure noise)
early-stop quality_floor0.65 (active)disabled (explore full depth)
decayfilter (drops decay≤0)feature (keeps, flags decay_zero)
pruned candidatesdropped silentlyreturned with a drop_reason

The principle is separate retrieval from filtering. Deep mode retrieves wide and attaches metadata to every candidate, then lets the judgment layer (running where LLM judgment is ~free) filter with full visibility — so nothing is dropped before it is evaluated. Each retained candidate carries its similarity, hop distance, decay_weight, a decay_zero flag, and a retain_reason (seed · kept_despite_decay_zero · above_prune_floor). Pruned candidates are also returned — each with a drop_reason (below_prune_threshold · beam_overflow) — so a dropped-but-relevant doc can be rescued by the curator instead of vanishing silently.

What's proven vs pending. The deep-mode mechanism is shipped and unit-proven. The actual recall gain ("deep surfaced N docs the fast path dropped") is not yet measured — it needs the laptop's bge model + live corpus via an offline eval script (--mode compare). The in-app "Deep retrieval available" affordance is specified for the finance panel but not yet confirmed rendered there; the fitness Coach view already renders deep results (a "Deep enrichment" section).

Citation verification — is the quote actually in the source?

A separate, deterministic, app-side check (no LLM): at rec compose and at deep-result write-back, every quoted span must be a normalized substring of the chunk it cites. Normalization is, in order: NFKC unicode → smart-punctuation fold (curly quotes → straight, en/em/minus dashes → -, non-breaking space → space) → casefold → whitespace collapse. Ellipsis-elided quotes are split and each fragment must appear in order. Quotes shorter than 12 normalized characters are marked too_short — unverifiable, and deliberately not a pass, so a trivial match can't manufacture confidence.

Per-citationcitation_verified (bool) + citation_reason: match · no_quote · no_chunk_text · too_short · not_found
Per-rec (derived on read)citations_status: no_quotes · all_verified · has_mismatch · unverifiable
On a missthe rec still publishes — verification is annotate-only, never blocks ingest. A fabricated citation is flagged (has_mismatch), not suppressed.

Excerpt extraction

After top-K chunks are picked, each one's excerpt (the snippet shown in the rec's "Sources" section) is selected sentence-by-sentence:

1. Split chunk into sentences
2. Encode each sentence with the same bge prefix
3. Cosine-score against the query embedding
4. Return the top-2 sentences in document order

This keeps citations focused — instead of a 600-word chunk wall, the operator sees the two sentences that best match the rec's framing question.

End-to-end query latency budget

StepTypical timeBottleneck
Encode query (bge-large)40–120 msCPU forward pass
Vector KNN (sqlite-vec)5–20 mslinear scan over ~50k vectors
FTS5 BM252–8 mstokeniser + inverted index
RRF merge< 1 ms
Decay weighting< 1 ms
Graph re-rank20–80 msnetworkx PageRank on subgraph
Excerpt selection30–60 mssentence-level encoding for top-K chunks
Total100–290 msencoder forward passes dominate

Why this model + stack?

Four design choices. Click any to expand.

Why bge-large-en-v1.5?
  • Top of MTEB English retrieval leaderboard at 1B-param tier (≈335M params)
  • Runs on CPU at acceptable latency for a personal corpus (~50k chunks)
  • 1024-d is the sweet spot — bigger doesn't help on sub-100k corpora; smaller (768-d) loses recall on subtle queries
  • Symmetric retrieval (same model for query + doc) — simpler ops than asymmetric retrievers
Why sqlite-vec instead of pgvector?
  • Per-domain physical isolation — one cache file per indexer
  • Zero-server local-first deploys (no postgres dependency for the indexer)
  • Atomic cp-able caches → trivial backup/restore
  • Crash-safe by SQLite WAL
  • Performance sufficient up to ~100k vectors; this corpus is well under that
Why hybrid (vector + lexical)?
  • Vector retrieval misses exact ticker symbols, model names, formula notation
  • BM25 nails proper nouns + technical terms
  • Finance especially needs both: "AAPL Q3" is a ticker (lexical) and "Apple's growth quarter" is a concept (vector)
  • Fitness leans vector-heavy (0.05 lexical) because timeless concepts dominate
Why graph re-rank?
  • A vault is not a flat corpus — it has citation structure
  • PageRank surfaces canonical references the operator has already validated by citing them
  • Centrality finds "load-bearing" notes — those well-cited by other well-cited notes
  • Recency boost prevents the system from being trapped by old foundational notes

Caches: what's inside each domain's cache file

TablePurpose
vault_nodeOne row per markdown file. Frontmatter, evergreen flag, domain tag, last-modified.
vault_chunkOne row per chunk (~600-word target). Section label, embedding (1024-d float32 blob).
vault_chunk_vecsqlite-vec virtual table — KNN index over vault_chunk.embedding.
vault_chunk_ftsFTS5 virtual table — BM25 over vault_chunk.text.
vault_edgeCitation graph edges (forward links + backlinks). Source for PageRank + centrality.

Endpoints — the indexer HTTP surface

EndpointReturns
GET /healthVault path, DB path, embedding model, embedding_dim
POST /reloadFull vault rescan (idempotent, diff-aware)
GET /search?q=&k=Hybrid top-K with excerpts
GET /traverse/<path>?depth=Local subgraph from a seed node
GET /node/<path>Full chunk listing for a single file
POST /promoteApply ticks from _review-queue.md, regenerate queue
POST /apply-renamesScope-aware atomic taxonomy renames

TradingV's vault module proxies these endpoints from the FastAPI app to the indexer. The frontend never hits Port 1 directly — it hits /v1/vault/search which forwards.

← prev
05 · Feedback