Embedding model, chunking, retrieval math
The knowledge vault is the answer to "what should I know about this drift
signal?" The retrieval stack is a hybrid: dense vector KNN plus sparse
lexical BM25, fused by Reciprocal Rank Fusion, decayed by age, then
re-ranked by graph centrality. Four small SQLite-backed processes serve
one HTTP endpoint each. All math is reproducible from the source code in
tools/vault_indexer/.
The embedding model
BAAI/bge-large-en-v1.5
| Model | BAAI/bge-large-en-v1.5 (Hugging Face) |
|---|---|
| Dimension | 1024 |
| Similarity metric | cosine, computed via sqlite-vec |
| Storage on disk | float32 little-endian blob (4096 bytes per vector) |
| Query-side prefix (bge convention) | "Represent this sentence for searching relevant passages: " |
| Document-side prefix | none — documents encoded raw |
| Why this model | Strong English retrieval benchmarks, runs on CPU at acceptable latency for a personal corpus, dimensionality matches the MTEB sweet spot for sub-100k chunks |
embedding_dim is
written into the SQLite schema at cache creation. Swapping the
EMBEDDING_MODEL env var without deleting the cache crashes
startup (the dimension wouldn't match). To swap models cleanly: stop
the indexer → delete cache-<domain>.db → restart →
full re-embed.
Chunking strategy
The chunker lives in vault.py:chunk_body(). It is heading-aware
by design, with a token target and overlap:
| Unit | H2 section (markdown ##) |
|---|---|
| Token target | 600 |
| Overlap | 80 tokens |
| Algorithm | scan headings → partition per H2 → per-section token check → sliding window subdivision on long sections |
| Returns | List[(text, section_label)] |
| Why H2 | preserves narrative context; for The Street snapshots a single ticker = one section = one chunk |
| Why 600 | matches bge-large's sweet spot (max 512 input tokens before encoder pool); 600 word chunks comfortably fit after tokenisation |
chunk_target_tokens=600 and chunk_overlap_tokens=80.
Earlier drafts of this doc said "words" — they are tokens.
The four indexer processes
DOMAIN= env var, separate cache file. Physical isolation by design.Scope derivation
Each process boots with DOMAIN=<slug> and reads
_domains.yaml at the vault root. From the registry it derives
INCLUDE/EXCLUDE prefixes for every walk. The finance indexer's scope
is legacy: true: it includes everything except other
domains' subfolders (auto-derived).
# _domains.yaml (abridged)
finance:
legacy: true
taxonomy: _taxonomy.md
review_queue: _review-queue.md
embedding: BAAI/bge-large-en-v1.5
embedding_dim: 1024
decay:
mode: ranked_grouped
ladder: [1.0, 0.6, 0.45, 0.35, 0.25]
recency_boost_days: 30
recency_boost_amount: 0.05
kind_overrides:
Filings: { group_by_path_prefix: [1.0, 0.6] }
lexical:
weight: 0.5
fitness: { ... decay.mode: off, lexical.weight: 0.05 }
nutrition: { ... decay.mode: off, lexical.weight: 0.05 }
learning: { ... decay.mode: off, lexical.weight: 0.05 } The retrieval stack — four layers
Layer 1 · Hybrid Reciprocal Rank Fusion
Vector KNN and FTS5 BM25 run independently. Each produces a ranked list. RRF combines them rank-wise (not score-wise):
rrf_score(d) = Σ over retrievers r: weight(r) × 1 / (k + rank_r(d)) # k = 60 (standard RRF constant) # rank_r(d) = position of doc d in retriever r's ranking (1-indexed) # weight(r) per-domain in _domains.yaml # finance: vector=0.5, lexical=0.5 # others: vector=0.95, lexical=0.05
RRF is robust against score-scale mismatch between vector cosine [-1, 1] and BM25 (unbounded). By using rank position rather than raw score it works well even when one retriever is dominant.
Layer 2 · Decay (finance only by default)
The finance corpus has time-sensitive material (filings, snapshots, news).
Decay groups results into recency tiers and multiplies the RRF score by a
tier weight. The default ladder is [1.0, 0.6, 0.45, 0.35, 0.25]:
- Most recent group: weight ×1.0
- Each older group: progressively reduced multiplier
- Evergreen-tagged nodes opt out — they always get ×1.0
- The Filings
kind_overridesuses a path-prefix grouping[1.0, 0.6]— current quarter vs everything else
Fitness, nutrition, and learning use mode: off — content is timeless.
Layer 3 · Graph re-ranking
When citation edges exist (a vault note links to another), a graph re-rank applies a linear combination:
hybrid_score(d) = α × vector_score(d)
+ β × pagerank(d)
+ γ × eigenvector_centrality(d)
+ recency_bonus(d)
# α = 0.6, β = 0.25, γ = 0.15
# recency_bonus = +0.05 if modified within 30 days, else 0
# applies to ALL FOUR DOMAINS (not finance-only)
# only activates if ≥10 citation edges exist in the touched subgraph - PageRank rewards well-cited notes — the canonical references in your reading.
- Eigenvector centrality rewards notes that are well-cited by other well-cited notes — the "load-bearing" concepts.
- Recency boost ensures fresh thinking is not buried under decade-old classics.
Iterative-deepening graph traversal
/traverse/<path>?depth=N seeds the graph search from a
specific node and walks the citation graph outward. The traversal is
quality-gated, not exhaustive:
- Depth cap: 4 hops default
- Beam width: pruned at each hop
- Quality floor: 0.65 vector similarity to query
- Target K: 10 results
- Early stop: when average similarity plateaus
Typical traversal touches ~100 embeddings per query — never bulk-loads.
Excerpt extraction
After top-K chunks are picked, each one's excerpt (the snippet shown in the rec's "Sources" section) is selected sentence-by-sentence:
1. Split chunk into sentences 2. Encode each sentence with the same bge prefix 3. Cosine-score against the query embedding 4. Return the top-2 sentences in document order
This keeps citations focused — instead of a 600-token chunk wall, the operator sees the two sentences that best match the rec's framing question.
End-to-end query latency budget
| Step | Typical time | Bottleneck |
|---|---|---|
| Encode query (bge-large) | 40–120 ms | CPU forward pass |
| Vector KNN (sqlite-vec) | 5–20 ms | linear scan over ~50k vectors |
| FTS5 BM25 | 2–8 ms | tokeniser + inverted index |
| RRF merge | < 1 ms | — |
| Decay weighting | < 1 ms | — |
| Graph re-rank | 20–80 ms | networkx PageRank on subgraph |
| Excerpt selection | 30–60 ms | sentence-level encoding for top-K chunks |
| Total | 100–290 ms | encoder forward passes dominate |
Why this model + stack?
Four design choices. Click any to expand.
Why bge-large-en-v1.5?
- Top of MTEB English retrieval leaderboard at 1B-param tier (≈335M params)
- Runs on CPU at acceptable latency for a personal corpus (~50k chunks)
- 1024-d is the sweet spot — bigger doesn't help on sub-100k corpora; smaller (768-d) loses recall on subtle queries
- Symmetric retrieval (same model for query + doc) — simpler ops than asymmetric retrievers
Why sqlite-vec instead of pgvector?
- Per-domain physical isolation — one cache file per indexer
- Zero-server local-first deploys (no postgres dependency for the indexer)
- Atomic
cp-able caches → trivial backup/restore - Crash-safe by SQLite WAL
- Performance sufficient up to ~100k vectors; this corpus is well under that
Why hybrid (vector + lexical)?
- Vector retrieval misses exact ticker symbols, model names, formula notation
- BM25 nails proper nouns + technical terms
- Finance especially needs both: "AAPL Q3" is a ticker (lexical) and "Apple's growth quarter" is a concept (vector)
- Fitness leans vector-heavy (0.05 lexical) because timeless concepts dominate
Why graph re-rank?
- A vault is not a flat corpus — it has citation structure
- PageRank surfaces canonical references the operator has already validated by citing them
- Centrality finds "load-bearing" notes — those well-cited by other well-cited notes
- Recency boost prevents the system from being trapped by old foundational notes
Caches: what's inside cache-<domain>.db
| Table | Purpose |
|---|---|
vault_node | One row per markdown file. Frontmatter, evergreen flag, domain tag, last-modified. |
vault_chunk | One row per 600-token chunk. Section label, embedding (1024-d float32 blob). |
vault_chunk_vec | sqlite-vec virtual table — KNN index over vault_chunk.embedding. |
vault_chunk_fts | FTS5 virtual table — BM25 over vault_chunk.text. |
vault_edge | Citation graph edges (forward links + backlinks). Source for PageRank + centrality. |
Endpoints — the indexer HTTP surface
| Endpoint | Returns |
|---|---|
GET /health | Vault path, DB path, embedding model, embedding_dim |
POST /reload | Full vault rescan (idempotent, diff-aware) |
GET /search?q=&k= | Hybrid top-K with excerpts |
GET /traverse/<path>?depth= | Local subgraph from a seed node |
GET /node/<path> | Full chunk listing for a single file |
POST /promote | Apply ticks from _review-queue.md, regenerate queue |
POST /apply-renames | Scope-aware atomic taxonomy renames |
TradingV's app/vault/ module proxies these endpoints from the
FastAPI app to the indexer. The frontend never hits :8001
directly — it hits /v1/vault/search which forwards.