06 · deep tech · technical

Embedding model, chunking, retrieval math

The knowledge vault is the answer to "what should I know about this drift signal?" The retrieval stack is a hybrid: dense vector KNN plus sparse lexical BM25, fused by Reciprocal Rank Fusion, decayed by age, then re-ranked by graph centrality. Four small SQLite-backed processes serve one HTTP endpoint each. All math is reproducible from the source code in tools/vault_indexer/.

The embedding model

BAAI/bge-large-en-v1.5

ModelBAAI/bge-large-en-v1.5 (Hugging Face)
Dimension1024
Similarity metriccosine, computed via sqlite-vec
Storage on diskfloat32 little-endian blob (4096 bytes per vector)
Query-side prefix (bge convention)"Represent this sentence for searching relevant passages: "
Document-side prefixnone — documents encoded raw
Why this modelStrong English retrieval benchmarks, runs on CPU at acceptable latency for a personal corpus, dimensionality matches the MTEB sweet spot for sub-100k chunks
Immutability per cache. The embedding_dim is written into the SQLite schema at cache creation. Swapping the EMBEDDING_MODEL env var without deleting the cache crashes startup (the dimension wouldn't match). To swap models cleanly: stop the indexer → delete cache-<domain>.db → restart → full re-embed.

Chunking strategy

The chunker lives in vault.py:chunk_body(). It is heading-aware by design, with a token target and overlap:

UnitH2 section (markdown ##)
Token target600
Overlap80 tokens
Algorithmscan headings → partition per H2 → per-section token check → sliding window subdivision on long sections
ReturnsList[(text, section_label)]
Why H2preserves narrative context; for The Street snapshots a single ticker = one section = one chunk
Why 600matches bge-large's sweet spot (max 512 input tokens before encoder pool); 600 word chunks comfortably fit after tokenisation
Tokens, not words. The config keys are chunk_target_tokens=600 and chunk_overlap_tokens=80. Earlier drafts of this doc said "words" — they are tokens.

The four indexer processes

:8001 finance cache-finance.db (18 MB) decay: ranked_grouped lexical weight: 0.5 includes Filings kind_overrides :8002 fitness cache-fitness.db (14 MB) decay: off lexical weight: 0.05 vector-heavy retrieval :8003 nutrition cache-nutrition.db (6.3 MB) decay: off lexical weight: 0.05 vector-heavy retrieval :8004 learning cache-learning.db (4 KB) decay: off lexical weight: 0.05 empty corpus today
Four sibling FastAPI processes. Same code, different DOMAIN= env var, separate cache file. Physical isolation by design.

Scope derivation

Each process boots with DOMAIN=<slug> and reads _domains.yaml at the vault root. From the registry it derives INCLUDE/EXCLUDE prefixes for every walk. The finance indexer's scope is legacy: true: it includes everything except other domains' subfolders (auto-derived).

# _domains.yaml (abridged)
finance:
  legacy: true
  taxonomy: _taxonomy.md
  review_queue: _review-queue.md
  embedding: BAAI/bge-large-en-v1.5
  embedding_dim: 1024
  decay:
    mode: ranked_grouped
    ladder: [1.0, 0.6, 0.45, 0.35, 0.25]
    recency_boost_days: 30
    recency_boost_amount: 0.05
    kind_overrides:
      Filings: { group_by_path_prefix: [1.0, 0.6] }
  lexical:
    weight: 0.5

fitness: { ... decay.mode: off, lexical.weight: 0.05 }
nutrition: { ... decay.mode: off, lexical.weight: 0.05 }
learning: { ... decay.mode: off, lexical.weight: 0.05 }

The retrieval stack — four layers

Two parallel retrievers → RRF merge → optional decay → graph re-rank with recency boost. Hover any layer.

Layer 1 · Hybrid Reciprocal Rank Fusion

Vector KNN and FTS5 BM25 run independently. Each produces a ranked list. RRF combines them rank-wise (not score-wise):

rrf_score(d) = Σ over retrievers r:  weight(r) × 1 / (k + rank_r(d))

# k = 60 (standard RRF constant)
# rank_r(d) = position of doc d in retriever r's ranking (1-indexed)
# weight(r) per-domain in _domains.yaml
#   finance: vector=0.5, lexical=0.5
#   others:  vector=0.95, lexical=0.05

RRF is robust against score-scale mismatch between vector cosine [-1, 1] and BM25 (unbounded). By using rank position rather than raw score it works well even when one retriever is dominant.

Layer 2 · Decay (finance only by default)

The finance corpus has time-sensitive material (filings, snapshots, news). Decay groups results into recency tiers and multiplies the RRF score by a tier weight. The default ladder is [1.0, 0.6, 0.45, 0.35, 0.25]:

Fitness, nutrition, and learning use mode: off — content is timeless.

Layer 3 · Graph re-ranking

When citation edges exist (a vault note links to another), a graph re-rank applies a linear combination:

hybrid_score(d) = α × vector_score(d)
                + β × pagerank(d)
                + γ × eigenvector_centrality(d)
                + recency_bonus(d)

# α = 0.6, β = 0.25, γ = 0.15
# recency_bonus = +0.05 if modified within 30 days, else 0
# applies to ALL FOUR DOMAINS (not finance-only)
# only activates if ≥10 citation edges exist in the touched subgraph

Iterative-deepening graph traversal

/traverse/<path>?depth=N seeds the graph search from a specific node and walks the citation graph outward. The traversal is quality-gated, not exhaustive:

Typical traversal touches ~100 embeddings per query — never bulk-loads.

Beam-width pruning in action. Dashed + dimmed nodes were dropped (similarity below the 0.65 quality floor). Solid nodes feed the next hop. Hover any node to trace its lineage to the top-K output.

Excerpt extraction

After top-K chunks are picked, each one's excerpt (the snippet shown in the rec's "Sources" section) is selected sentence-by-sentence:

1. Split chunk into sentences
2. Encode each sentence with the same bge prefix
3. Cosine-score against the query embedding
4. Return the top-2 sentences in document order

This keeps citations focused — instead of a 600-token chunk wall, the operator sees the two sentences that best match the rec's framing question.

End-to-end query latency budget

StepTypical timeBottleneck
Encode query (bge-large)40–120 msCPU forward pass
Vector KNN (sqlite-vec)5–20 mslinear scan over ~50k vectors
FTS5 BM252–8 mstokeniser + inverted index
RRF merge< 1 ms
Decay weighting< 1 ms
Graph re-rank20–80 msnetworkx PageRank on subgraph
Excerpt selection30–60 mssentence-level encoding for top-K chunks
Total100–290 msencoder forward passes dominate

Why this model + stack?

Four design choices. Click any to expand.

Why bge-large-en-v1.5?
  • Top of MTEB English retrieval leaderboard at 1B-param tier (≈335M params)
  • Runs on CPU at acceptable latency for a personal corpus (~50k chunks)
  • 1024-d is the sweet spot — bigger doesn't help on sub-100k corpora; smaller (768-d) loses recall on subtle queries
  • Symmetric retrieval (same model for query + doc) — simpler ops than asymmetric retrievers
Why sqlite-vec instead of pgvector?
  • Per-domain physical isolation — one cache file per indexer
  • Zero-server local-first deploys (no postgres dependency for the indexer)
  • Atomic cp-able caches → trivial backup/restore
  • Crash-safe by SQLite WAL
  • Performance sufficient up to ~100k vectors; this corpus is well under that
Why hybrid (vector + lexical)?
  • Vector retrieval misses exact ticker symbols, model names, formula notation
  • BM25 nails proper nouns + technical terms
  • Finance especially needs both: "AAPL Q3" is a ticker (lexical) and "Apple's growth quarter" is a concept (vector)
  • Fitness leans vector-heavy (0.05 lexical) because timeless concepts dominate
Why graph re-rank?
  • A vault is not a flat corpus — it has citation structure
  • PageRank surfaces canonical references the operator has already validated by citing them
  • Centrality finds "load-bearing" notes — those well-cited by other well-cited notes
  • Recency boost prevents the system from being trapped by old foundational notes

Caches: what's inside cache-<domain>.db

TablePurpose
vault_nodeOne row per markdown file. Frontmatter, evergreen flag, domain tag, last-modified.
vault_chunkOne row per 600-token chunk. Section label, embedding (1024-d float32 blob).
vault_chunk_vecsqlite-vec virtual table — KNN index over vault_chunk.embedding.
vault_chunk_ftsFTS5 virtual table — BM25 over vault_chunk.text.
vault_edgeCitation graph edges (forward links + backlinks). Source for PageRank + centrality.

Endpoints — the indexer HTTP surface

EndpointReturns
GET /healthVault path, DB path, embedding model, embedding_dim
POST /reloadFull vault rescan (idempotent, diff-aware)
GET /search?q=&k=Hybrid top-K with excerpts
GET /traverse/<path>?depth=Local subgraph from a seed node
GET /node/<path>Full chunk listing for a single file
POST /promoteApply ticks from _review-queue.md, regenerate queue
POST /apply-renamesScope-aware atomic taxonomy renames

TradingV's app/vault/ module proxies these endpoints from the FastAPI app to the indexer. The frontend never hits :8001 directly — it hits /v1/vault/search which forwards.

← prev
05 · Feedback