Search Concepts
Last updated:
LumoSearch is a full-text search engine that combines BM25F ranking, trigram-based fuzzy matching, and inverted indexes to deliver typo-tolerant, relevance-ranked results in under 5ms. Unlike linear-scan libraries (Fuse.js, Lunr), it uses a multi-stage retrieval pipeline that stays fast as datasets scale to 100K+ documents.
Architecture Overview
LumoSearch uses a multi-stage retrieval and ranking pipeline:
- Index — Normalize and tokenize configured fields. Build token, trigram, and prefix inverted indexes.
- Retrieve — For each query, gather candidates from postings (token 4x, prefix 3x, trigram 1x weight).
- Prune — Keep only the top candidateLimit candidates based on weighted postings.
- Score — Rank with BM25F + exact/prefix/phrase/proximity boosts.
- Return — Top-k results with character-level highlights.
Why this approach? Instead of scoring every document (like Fuse.js), LumoSearch uses inverted indexes to narrow candidates first. This makes search stay fast as your dataset grows, while still delivering fuzzy matching and typo tolerance.
BM25F Ranking
BM25F is an industry-standard ranking algorithm that considers:
- Term Frequency (TF) — How often query terms appear in a document
- Inverse Document Frequency (IDF) — Rarity of terms across all documents
- Field Weights — Custom weights per field (e.g., title 3x, body 1x)
- Length Normalization — Prevents long documents from dominating
The "F" in BM25F means field-weighted: you can boost importance of specific fields like titles, tags, or names.
// Title matches are 3x more important than body matches
const search = new LumoSearch(docs, {
keys: [
{ name: 'title', weight: 3 },
{ name: 'body', weight: 1 }
]
})Fuzzy Matching
LumoSearch handles typos and misspellings using trigram-based fuzzy matching:
- Each token is split into 3-character sequences (trigrams)
- Example: "javascript" → ["jav", "ava", "vas", "asc", "scr", "cri", "rip", "ipt"]
- Query tokens are matched against indexed trigrams using Jaccard similarity
- Allows "javscrippt" to match "javascript" even with typos
Trigram vs Levenshtein: Trigrams are faster and scale better than edit-distance algorithms. They also naturally handle character transpositions and missing/extra characters.
Candidate Pruning
To stay fast, LumoSearch doesn't score every document. Instead:
- Gather candidate documents from inverted indexes
- Score candidates using weighted postings (token matches 4x, prefix 3x, trigram 1x)
- Keep only top
candidateLimitcandidates (default: 250) - Run full BM25F scoring on these candidates
- Return top
limitresults (default: 10)
// Higher candidateLimit = better recall, slower performance
const search = new LumoSearch(docs, {
keys: ['title', 'body'],
candidateLimit: 500, // Consider more candidates
limit: 20 // Return top 20 results
})Inverted Indexes
LumoSearch builds three inverted indexes at initialization:
- Token Index — Exact token matches (e.g., "javascript")
- Prefix Index — Token prefixes (e.g., "java" matches "javascript")
- Trigram Index — 3-character sequences for fuzzy matching
Each index maps tokens/trigrams to document IDs and field positions. This enables fast candidate retrieval.
Scoring Boosts
On top of BM25F, LumoSearch applies several heuristic boosts:
- Exact Match Boost (2.0x) — Full token matches get double weight
- Prefix Boost (1.25x) — Prefix matches get 25% bonus
- Phrase Proximity — Query terms appearing close together score higher
- Field Weight — User-configured per-field weights
Text Normalization
Before indexing or querying, text is normalized:
- Convert to lowercase
- Apply Unicode NFD normalization
- Remove diacritics (é → e, ñ → n)
- Tokenize on whitespace and punctuation
This ensures "Café" matches "cafe" and "résumé" matches "resume".
Synonyms
Token-level synonyms expand query terms at search time:
const search = new LumoSearch(docs, {
keys: ['title', 'body'],
synonyms: {
js: ['javascript'],
auth: ['authentication', 'authorization']
}
})
// Searching "js auth" will also match:
// "javascript authentication"
// "javascript authorization"
// etc.Hybrid Reranking
For semantic search, use searchAsync() with a custom reranker:
const results = await search.searchAsync('modern UI frameworks', {
rerankLimit: 20,
reranker: {
async rerank({ query, candidates }) {
// 1. LumoSearch retrieves top 20 lexical candidates
// 2. Your reranker scores them semantically
// 3. Final results combine both scores
const embeddings = await getEmbeddings([
query,
...candidates.map(c => c.text)
])
return candidates.map((c, i) => ({
refIndex: c.refIndex,
score: cosineSimilarity(embeddings[0], embeddings[i + 1])
}))
}
}
})This two-stage approach combines the speed of lexical search with the semantic understanding of embeddings.
Best of both worlds: Lexical search is fast and interpretable. Semantic search understands meaning. Hybrid reranking gives you both.
Frequently Asked Questions
What is BM25F and how does LumoSearch use it?
BM25F is an extension of the BM25 ranking algorithm that supports multiple weighted fields. LumoSearch uses BM25F to score documents by combining term frequency, inverse document frequency, field weights, and length normalization — the same approach used by Elasticsearch and Apache Lucene.
How does trigram fuzzy matching work?
LumoSearch splits each token into overlapping 3-character sequences (trigrams). When searching, query trigrams are compared against indexed trigrams using Jaccard similarity. This allows 'javscrippt' to match 'javascript' because they share most trigrams despite the typo.
What is candidate pruning in search?
Candidate pruning narrows results before expensive scoring. LumoSearch gathers candidates from inverted indexes, scores them with lightweight posting weights (token 4x, prefix 3x, trigram 1x), keeps the top candidateLimit (default 250), then runs full BM25F only on those candidates.
How is LumoSearch different from linear-scan search libraries?
Libraries like Fuse.js score every document against every query. LumoSearch builds inverted indexes at initialization, then retrieves only relevant candidates per query. This makes search time nearly constant regardless of collection size — under 5ms for 100K documents.
Can LumoSearch do semantic search?
LumoSearch supports hybrid semantic reranking via the searchAsync() method. It first retrieves candidates using fast lexical BM25F scoring, then passes them to a user-provided reranker (e.g., an embedding model) for semantic rescoring — combining speed with meaning.