Embedding model, chunking, retrieval math
The knowledge vault is the answer to "what should I know about this drift signal?" The retrieval stack is a hybrid: dense vector KNN plus sparse lexical BM25, fused by Reciprocal Rank Fusion, decayed by age, then re-ranked by graph centrality. Four small SQLite-backed processes serve one HTTP endpoint each. All math is reproducible: the indexer source — the chunking, lexical, and graph-compute modules — ships inside the downloadable kit.
The embedding model Immutability per cache. The embedding_dim is written into the SQLite schema at cache creation. Swapping the EMBEDDING_MODEL env var without deleting the cache crashes startup (the dimension wouldn't match). To swap models cleanly: stop the indexer → delete each domain's cache file → restart → full re-embed.
BAAI/bge-large-en-v1.5
| Model | BAAI/bge-large-en-v1.5 (Hugging Face) |
|---|---|
| Dimension | 1024 |
| Similarity metric | cosine, computed via sqlite-vec |
| Storage on disk | float32 little-endian blob (4096 bytes per vector) |
| Query-side prefix (bge convention) | "Represent this sentence for searching relevant passages: " |
| Document-side prefix | none — documents encoded raw |
| Why this model | Strong English retrieval benchmarks, runs on CPU at acceptable latency for a personal corpus, dimensionality matches the MTEB sweet spot for sub-100k chunks |
Chunking strategy "Tokens" here means whitespace words. The config keys are named chunk_target_tokens=600 and chunk_overlap_tokens=80, but the chunker is a deliberately naive token-approximation: it counts whitespace-split words, and no tokenizer runs at chunking time. Earlier drafts of this doc flip-flopped between "words" and "tokens" — the code is the referee.
The chunker is a routine in the vault indexer. It is heading-aware by design, with a size target and overlap counted in whitespace words (a cheap token approximation — the config keys say "tokens", the loop counts words):
| Unit | H2 section (markdown ##) |
|---|---|
| Size target | ~600 words (chunk_target_tokens=600 — a whitespace-word approximation of tokens) |
| Overlap | 80 words (chunk_overlap_tokens=80) |
| Algorithm | scan headings → partition per H2 → per-section size check → sliding window subdivision on long sections |
| Returns | List[(text, section_label)] |
| Why H2 | preserves narrative context; for The Street snapshots a single ticker = one section = one chunk |
| Why 600 — and the 512-token caveat | most H2 sections in this corpus land well under the target, so they embed intact within bge-large's 512-token encoder window. A chunk that genuinely runs to 600 words exceeds that window: the encoder silently truncates, so the vector represents the head of the section, while FTS5 BM25 and excerpt selection still see the full chunk text. Accepted trade-off — the head of an H2 section carries its topic, and lexical recall covers the tail. |
The four indexer processes
DOMAIN= env var, separate cache file. Physical isolation by design.Scope derivation
Each process boots with DOMAIN=<slug> and reads
_domains.yaml at the vault root. From the registry it derives
INCLUDE/EXCLUDE prefixes for every walk. The finance indexer's scope
is legacy: true: it includes everything except other
domains' subfolders (auto-derived).
# _domains.yaml (abridged)
finance:
legacy: true
taxonomy: _taxonomy.md
review_queue: _review-queue.md
embedding: BAAI/bge-large-en-v1.5
embedding_dim: 1024
decay:
mode: ranked_grouped
ladder: [1.0, 0.6, 0.45, 0.35, 0.25]
recency_boost_days: 30
recency_boost_amount: 0.05
kind_overrides:
Filings: { group_by_path_prefix: [1.0, 0.6] }
lexical:
weight: 0.5
fitness: { ... decay.mode: off, lexical.weight: 0.05 }
nutrition: { ... decay.mode: off, lexical.weight: 0.05 }
learning: { ... decay.mode: off, lexical.weight: 0.05 } Cold-start tagging. When a brand-new note is ingested, a
small Claude Haiku pass proposes 1–5 controlled-vocabulary tags from the
domain taxonomy. It is best-effort and operator-review-gated through the
domain's _review-queue.md — the model suggests, the operator
confirms before tags become canonical.
The retrieval stack — four layers
query (+ bge prefix) → vector KNN (1024-d cosine,
sqlite-vec) in parallel with FTS5 BM25 (lexical) →
layer 1 · RRF 1/(k + rank) →
layer 2 · decay (ranked-grouped, finance only) →
layer 3 · graph re-rank
(0.6·vector + 0.25·PageRank + 0.15·centrality) →
top-K excerpts passed to the LLM as evidence.
query (+ bge prefix) → vector KNN (1024-d cosine,
sqlite-vec) in parallel with FTS5 BM25 (lexical) →
layer 1 · RRF 1/(k + rank) →
layer 2 · decay (ranked-grouped, finance only) →
layer 3 · graph re-rank
(0.6·vector + 0.25·PageRank + 0.15·centrality) →
top-K excerpts passed to the LLM as evidence.
Layer 1 · Hybrid Reciprocal Rank Fusion
Vector KNN and FTS5 BM25 run independently. Each produces a ranked list. RRF combines them rank-wise (not score-wise):
rrf_score(d) = Σ over retrievers r: weight(r) × 1 / (k + rank_r(d)) # k = 60 (standard RRF constant) # rank_r(d) = position of doc d in retriever r's ranking (1-indexed) # weight(r) per-domain in _domains.yaml # finance: vector=1.0, lexical=0.5 (lexical matters for tickers/filings) # others: vector=1.0, lexical=0.05 (timeless concepts → vector-dominant)
RRF is robust against score-scale mismatch between vector cosine [-1, 1] and BM25 (unbounded). By using rank position rather than raw score it works well even when one retriever is dominant.
Layer 2 · Decay (finance only by default)
The finance corpus has time-sensitive material (filings, snapshots, news).
The mode is ranked_grouped — and despite earlier wording on this
site, it is not a time-based half-life. Results are grouped
by author, ranked within each author group by published_at
descending, and the ladder [1.0, 0.6, 0.45, 0.35, 0.25] is applied
by rank-position, with 0.25 as the floor past the ladder's length:
- An author's most recent item: rank 0 → ×1.0; each older item steps down the ladder
- A single-item author group → rank 0 → ×1.0 (no penalty)
- Evergreen-tagged nodes opt out — always ×1.0
- The Filings
kind_overridesuses a path-prefix grouping[1.0, 0.6]— current quarter vs everything else
Why rank-within-author rather than a time half-life: authors publish on wildly different cadences (a daily newsletter vs a quarterly letter). A pure time decay would bury infrequent and evergreen sources regardless of quality; ranking within each author gives every source a fair shot at the top slot on its own cadence. (Not to be confused with the operator-attention signal, which does use a 7-day half-life — different mechanism, see the Lakshmi door.)
Fitness, nutrition, and learning use mode: off — content is timeless.
Layer 3 · Graph re-ranking
When citation edges exist (a vault note links to another), a graph re-rank applies a linear combination:
hybrid_score(d) = α × vector_score(d)
+ β × pagerank(d)
+ γ × eigenvector_centrality(d)
+ recency_bonus(d)
# α = 0.6, β = 0.25, γ = 0.15
# recency_bonus = +0.05 if modified within 30 days, else 0
# applies to ALL FOUR DOMAINS (not finance-only)
# only activates if ≥10 citation edges exist in the touched subgraph - PageRank rewards well-cited notes — the canonical references in your reading. Computed on a directed graph of citation edges only.
- Eigenvector centrality rewards notes that are well-cited by other well-cited notes — the "load-bearing" concepts. Computed on an undirected graph of citations + wikilinks (also the input to Louvain community clustering); falls back to degree centrality if the eigenvector solver fails to converge.
- Recency boost ensures fresh thinking is not buried under decade-old classics.
Graceful degradation. If a node has no PageRank or no
centrality (e.g. brand-new, uncited), that term's weight folds back into
α rather than scoring it as zero — so a fresh note still ranks
on pure vector similarity instead of being penalised for lacking graph
history.
Iterative-deepening graph traversal
/traverse/<path>?depth=N seeds the graph search from a
specific node and walks the citation graph outward. The traversal is
quality-gated, not exhaustive:
- Depth cap: 4 hops (
max_hops) - Beam width: 5 — at most 5 candidates carried forward per hop
- Per-hop prune floor:
0.50— a candidate below 0.50 cosine to the query is dropped at that hop - Early-stop condition: stop once the pool reaches
target_k = 10AND the pool's average similarity is ≥0.65(most queries converge in 1–2 hops)
The two thresholds do different jobs: 0.50 prunes individual weak nodes at each hop; 0.65 is the average-quality bar that lets the search stop early. They are not the same floor.
Typical traversal touches ~100 embeddings per query — never bulk-loads.
From a seed vault note, hop-1 expansion yields 4 candidates — 3 retained above the 0.50 per-hop prune floor, 1 pruned. Hop-2 expands from the retained hop-1 nodes only: 2 retained, 1 pruned. Five nodes reach the top-K accepted set; the search stops early once the pool's average similarity clears 0.65, after touching roughly 100 embeddings.
From a seed vault note, hop-1 expansion yields 4 candidates — 3 retained above the 0.50 per-hop prune floor, 1 pruned. Hop-2 expands from the retained hop-1 nodes only: 2 retained, 1 pruned. Five nodes reach the top-K accepted set; the search stops early once the pool's average similarity clears 0.65, after touching roughly 100 embeddings.
Deep retrieval mode — retrieve wide, filter late
The fast path above is tuned to be cheap and always-on (~100 embeddings per
query). When its thesis_match comes back weak or the corpus is
sparse, an opt-in deep mode runs the same graph search with
the brakes off. Crucially it does not run in the always-on app —
it runs on-demand in a Claude Code session (the operator
runs /rx-deep-retrieve <rec-id>), where the
<$2/mo and <100-embeddings/query budgets don't apply. The retrieval
itself is pure Python over the local bge model + sqlite-vec — zero
LLM, no API key, no billing; the curated result POSTs back to the
app's write-back endpoint (/v1/rx/deep) so the panel can show
it. This cost seam is what makes deep retrieval affordable (see
the enrichment lane on Pipeline).
| Parameter | Fast (always-on) | Deep (on-demand) |
|---|---|---|
beam_width | 5 | 12 |
max_hops | 4 | 4 |
seed_count | 3 | 8 |
target_k | 10 | 50 |
prune_threshold | 0.50 | 0.30 (floor only — drop pure noise) |
early-stop quality_floor | 0.65 (active) | disabled (explore full depth) |
| decay | filter (drops decay≤0) | feature (keeps, flags decay_zero) |
| pruned candidates | dropped silently | returned with a drop_reason |
The principle is separate retrieval from filtering. Deep
mode retrieves wide and attaches metadata to every candidate, then lets the
judgment layer (running where LLM judgment is ~free) filter with full
visibility — so nothing is dropped before it is evaluated. Each retained
candidate carries its similarity, hop distance,
decay_weight, a decay_zero flag, and a
retain_reason (seed ·
kept_despite_decay_zero · above_prune_floor).
Pruned candidates are also returned — each with a
drop_reason (below_prune_threshold ·
beam_overflow) — so a dropped-but-relevant doc can be rescued
by the curator instead of vanishing silently.
--mode compare). The
in-app "Deep retrieval available" affordance is specified for the finance
panel but not yet confirmed rendered there; the fitness Coach view already
renders deep results (a "Deep enrichment" section).
Citation verification — is the quote actually in the source?
A separate, deterministic, app-side check (no LLM): at rec
compose and at deep-result write-back, every quoted span must be a
normalized substring of the chunk it cites. Normalization is, in
order: NFKC unicode → smart-punctuation fold (curly quotes → straight,
en/em/minus dashes → -, non-breaking space → space) → casefold
→ whitespace collapse. Ellipsis-elided quotes are split and each fragment
must appear in order. Quotes shorter than 12
normalized characters are marked too_short — unverifiable, and
deliberately not a pass, so a trivial match can't manufacture
confidence.
| Per-citation | citation_verified (bool) + citation_reason: match · no_quote · no_chunk_text · too_short · not_found |
|---|---|
| Per-rec (derived on read) | citations_status: no_quotes · all_verified · has_mismatch · unverifiable |
| On a miss | the rec still publishes — verification is annotate-only, never blocks ingest. A fabricated citation is flagged (has_mismatch), not suppressed. |
Excerpt extraction
After top-K chunks are picked, each one's excerpt (the snippet shown in the rec's "Sources" section) is selected sentence-by-sentence:
1. Split chunk into sentences 2. Encode each sentence with the same bge prefix 3. Cosine-score against the query embedding 4. Return the top-2 sentences in document order
This keeps citations focused — instead of a 600-word chunk wall, the operator sees the two sentences that best match the rec's framing question.
End-to-end query latency budget
| Step | Typical time | Bottleneck |
|---|---|---|
| Encode query (bge-large) | 40–120 ms | CPU forward pass |
| Vector KNN (sqlite-vec) | 5–20 ms | linear scan over ~50k vectors |
| FTS5 BM25 | 2–8 ms | tokeniser + inverted index |
| RRF merge | < 1 ms | — |
| Decay weighting | < 1 ms | — |
| Graph re-rank | 20–80 ms | networkx PageRank on subgraph |
| Excerpt selection | 30–60 ms | sentence-level encoding for top-K chunks |
| Total | 100–290 ms | encoder forward passes dominate |
Why this model + stack?
Four design choices. Click any to expand.
Why bge-large-en-v1.5?
- Top of MTEB English retrieval leaderboard at 1B-param tier (≈335M params)
- Runs on CPU at acceptable latency for a personal corpus (~50k chunks)
- 1024-d is the sweet spot — bigger doesn't help on sub-100k corpora; smaller (768-d) loses recall on subtle queries
- Symmetric retrieval (same model for query + doc) — simpler ops than asymmetric retrievers
Why sqlite-vec instead of pgvector?
- Per-domain physical isolation — one cache file per indexer
- Zero-server local-first deploys (no postgres dependency for the indexer)
- Atomic
cp-able caches → trivial backup/restore - Crash-safe by SQLite WAL
- Performance sufficient up to ~100k vectors; this corpus is well under that
Why hybrid (vector + lexical)?
- Vector retrieval misses exact ticker symbols, model names, formula notation
- BM25 nails proper nouns + technical terms
- Finance especially needs both: "AAPL Q3" is a ticker (lexical) and "Apple's growth quarter" is a concept (vector)
- Fitness leans vector-heavy (0.05 lexical) because timeless concepts dominate
Why graph re-rank?
- A vault is not a flat corpus — it has citation structure
- PageRank surfaces canonical references the operator has already validated by citing them
- Centrality finds "load-bearing" notes — those well-cited by other well-cited notes
- Recency boost prevents the system from being trapped by old foundational notes
Caches: what's inside each domain's cache file
| Table | Purpose |
|---|---|
vault_node | One row per markdown file. Frontmatter, evergreen flag, domain tag, last-modified. |
vault_chunk | One row per chunk (~600-word target). Section label, embedding (1024-d float32 blob). |
vault_chunk_vec | sqlite-vec virtual table — KNN index over vault_chunk.embedding. |
vault_chunk_fts | FTS5 virtual table — BM25 over vault_chunk.text. |
vault_edge | Citation graph edges (forward links + backlinks). Source for PageRank + centrality. |
Endpoints — the indexer HTTP surface
| Endpoint | Returns |
|---|---|
GET /health | Vault path, DB path, embedding model, embedding_dim |
POST /reload | Full vault rescan (idempotent, diff-aware) |
GET /search?q=&k= | Hybrid top-K with excerpts |
GET /traverse/<path>?depth= | Local subgraph from a seed node |
GET /node/<path> | Full chunk listing for a single file |
POST /promote | Apply ticks from _review-queue.md, regenerate queue |
POST /apply-renames | Scope-aware atomic taxonomy renames |
TradingV's vault module proxies these endpoints from the
FastAPI app to the indexer. The frontend never hits Port 1
directly — it hits /v1/vault/search which forwards.