Search Concepts

Last updated:

LumoSearch is a full-text search engine that combines BM25F ranking, trigram-based fuzzy matching, and inverted indexes to deliver typo-tolerant, relevance-ranked results in under 5ms. Unlike linear-scan libraries (Fuse.js, Lunr), it uses a multi-stage retrieval pipeline that stays fast as datasets scale to 100K+ documents.

Architecture Overview

LumoSearch uses a multi-stage retrieval and ranking pipeline:

  1. Index — Normalize and tokenize configured fields. Build token, trigram, and prefix inverted indexes.
  2. Retrieve — For each query, gather candidates from postings (token 4x, prefix 3x, trigram 1x weight).
  3. Prune — Keep only the top candidateLimit candidates based on weighted postings.
  4. Score — Rank with BM25F + exact/prefix/phrase/proximity boosts.
  5. Return — Top-k results with character-level highlights.

Why this approach? Instead of scoring every document (like Fuse.js), LumoSearch uses inverted indexes to narrow candidates first. This makes search stay fast as your dataset grows, while still delivering fuzzy matching and typo tolerance.

BM25F Ranking

BM25F is an industry-standard ranking algorithm that considers:

  • Term Frequency (TF) — How often query terms appear in a document
  • Inverse Document Frequency (IDF) — Rarity of terms across all documents
  • Field Weights — Custom weights per field (e.g., title 3x, body 1x)
  • Length Normalization — Prevents long documents from dominating

The "F" in BM25F means field-weighted: you can boost importance of specific fields like titles, tags, or names.

// Title matches are 3x more important than body matches
const search = new LumoSearch(docs, {
  keys: [
    { name: 'title', weight: 3 },
    { name: 'body', weight: 1 }
  ]
})

Fuzzy Matching

LumoSearch handles typos and misspellings using trigram-based fuzzy matching:

  • Each token is split into 3-character sequences (trigrams)
  • Example: "javascript" → ["jav", "ava", "vas", "asc", "scr", "cri", "rip", "ipt"]
  • Query tokens are matched against indexed trigrams using Jaccard similarity
  • Allows "javscrippt" to match "javascript" even with typos

Trigram vs Levenshtein: Trigrams are faster and scale better than edit-distance algorithms. They also naturally handle character transpositions and missing/extra characters.

Candidate Pruning

To stay fast, LumoSearch doesn't score every document. Instead:

  1. Gather candidate documents from inverted indexes
  2. Score candidates using weighted postings (token matches 4x, prefix 3x, trigram 1x)
  3. Keep only top candidateLimit candidates (default: 250)
  4. Run full BM25F scoring on these candidates
  5. Return top limit results (default: 10)
// Higher candidateLimit = better recall, slower performance
const search = new LumoSearch(docs, {
  keys: ['title', 'body'],
  candidateLimit: 500,  // Consider more candidates
  limit: 20             // Return top 20 results
})

Inverted Indexes

LumoSearch builds three inverted indexes at initialization:

  • Token Index — Exact token matches (e.g., "javascript")
  • Prefix Index — Token prefixes (e.g., "java" matches "javascript")
  • Trigram Index — 3-character sequences for fuzzy matching

Each index maps tokens/trigrams to document IDs and field positions. This enables fast candidate retrieval.

Scoring Boosts

On top of BM25F, LumoSearch applies several heuristic boosts:

  • Exact Match Boost (2.0x) — Full token matches get double weight
  • Prefix Boost (1.25x) — Prefix matches get 25% bonus
  • Phrase Proximity — Query terms appearing close together score higher
  • Field Weight — User-configured per-field weights

Text Normalization

Before indexing or querying, text is normalized:

  1. Convert to lowercase
  2. Apply Unicode NFD normalization
  3. Remove diacritics (é → e, ñ → n)
  4. Tokenize on whitespace and punctuation

This ensures "Café" matches "cafe" and "résumé" matches "resume".

Synonyms

Token-level synonyms expand query terms at search time:

const search = new LumoSearch(docs, {
  keys: ['title', 'body'],
  synonyms: {
    js: ['javascript'],
    auth: ['authentication', 'authorization']
  }
})

// Searching "js auth" will also match:
// "javascript authentication"
// "javascript authorization"
// etc.

Hybrid Reranking

For semantic search, use searchAsync() with a custom reranker:

const results = await search.searchAsync('modern UI frameworks', {
  rerankLimit: 20,
  reranker: {
    async rerank({ query, candidates }) {
      // 1. LumoSearch retrieves top 20 lexical candidates
      // 2. Your reranker scores them semantically
      // 3. Final results combine both scores
      const embeddings = await getEmbeddings([
        query,
        ...candidates.map(c => c.text)
      ])
      return candidates.map((c, i) => ({
        refIndex: c.refIndex,
        score: cosineSimilarity(embeddings[0], embeddings[i + 1])
      }))
    }
  }
})

This two-stage approach combines the speed of lexical search with the semantic understanding of embeddings.

Best of both worlds: Lexical search is fast and interpretable. Semantic search understands meaning. Hybrid reranking gives you both.

Frequently Asked Questions

What is BM25F and how does LumoSearch use it?

BM25F is an extension of the BM25 ranking algorithm that supports multiple weighted fields. LumoSearch uses BM25F to score documents by combining term frequency, inverse document frequency, field weights, and length normalization — the same approach used by Elasticsearch and Apache Lucene.

How does trigram fuzzy matching work?

LumoSearch splits each token into overlapping 3-character sequences (trigrams). When searching, query trigrams are compared against indexed trigrams using Jaccard similarity. This allows 'javscrippt' to match 'javascript' because they share most trigrams despite the typo.

What is candidate pruning in search?

Candidate pruning narrows results before expensive scoring. LumoSearch gathers candidates from inverted indexes, scores them with lightweight posting weights (token 4x, prefix 3x, trigram 1x), keeps the top candidateLimit (default 250), then runs full BM25F only on those candidates.

How is LumoSearch different from linear-scan search libraries?

Libraries like Fuse.js score every document against every query. LumoSearch builds inverted indexes at initialization, then retrieves only relevant candidates per query. This makes search time nearly constant regardless of collection size — under 5ms for 100K documents.

Can LumoSearch do semantic search?

LumoSearch supports hybrid semantic reranking via the searchAsync() method. It first retrieves candidates using fast lexical BM25F scoring, then passes them to a user-provided reranker (e.g., an embedding model) for semantic rescoring — combining speed with meaning.

Related