Files
nextcloud-mcp-server/docs/ADR-014-bm25-search.md
Chris Coutinho 6cfd7e2729 feat: add configurable fusion algorithms for BM25 hybrid search
Added support for two fusion algorithms (RRF and DBSF) to combine dense
semantic and sparse BM25 search results, with comprehensive documentation
and unit tests.

Changes:
- Added fusion parameter to nc_semantic_search and nc_semantic_search_answer tools
- Updated ADR-014 with detailed comparison of RRF vs DBSF fusion algorithms
- Added unit tests for fusion algorithm initialization and validation
- Updated search_method in responses to include fusion type (e.g., "bm25_hybrid_rrf")

Fusion Algorithms:
- RRF (Reciprocal Rank Fusion): Default, rank-based, general-purpose
- DBSF (Distribution-Based Score Fusion): Score normalization using statistics

RRF is recommended for most use cases due to its robustness and established
track record. DBSF may provide better results when retrieval systems have
very different score distributions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 06:48:43 +01:00

12 KiB
Raw Permalink Blame History

ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search via Qdrant

Date: 2025-11-16

Status: Implemented


1. Context

Our RAG application currently employs two separate retrieval mechanisms:

  1. Dense (Semantic) Search: Using vector embeddings stored in our Qdrant database to find semantically similar context.
  2. Keyword Search: A custom-built fuzzy/character-based search to match-specific keywords, acronyms, and product codes that semantic search often misses.

This dual-system approach has several drawbacks:

  • Poor Relevance: Our current keyword search is basic (e.g., LIKE queries or simple fuzzy matching). It is not as effective as modern full-text search algorithms like BM25.
  • Clunky Fusion: We lack a robust, principled method to combine the results from the two systems. This leads to disjointed logic in the application layer and suboptimal context being passed to the LLM.
  • Architectural Complexity: We must maintain two separate search pathways (one to Qdrant, one to the keyword search mechanism), increasing code complexity and maintenance overhead.

Our vector database, Qdrant, natively supports hybrid search by combining dense vectors with BM25-based sparse vectors in a single collection.

2. Decision

We will deprecate and remove the existing custom keyword/fuzzy search functionality.

We will replace it by implementing native hybrid search within Qdrant. This involves:

  1. Modifying the Qdrant Collection: Updating our collection to support a named sparse vector index configured for BM25.
  2. Updating the Ingestion Pipeline: For every document chunk, we will generate and upsert both:
    • Its dense vector (from our existing embedding model).
    • Its sparse vector (generated using a BM25-compatible model, e.g., Qdrant/bm25 from fastembed).
  3. Refactoring Retrieval Logic: All retrieval calls will be consolidated into a single Qdrant query using the query_points endpoint. This query will use the prefetch parameter to execute both dense and sparse searches, and Qdrant's built-in Reciprocal Rank Fusion (RRF) to automatically merge the results into a single, relevance-ranked list.
  4. Backfilling: A one-time migration script will be created to generate and add sparse vectors for all existing documents in the Qdrant collection.

3. Considered Options

Option 1: Native Qdrant Hybrid Search (Chosen)

  • Use Qdrant's built-in sparse vector and RRF capabilities.
  • Pros:
    • Consolidated Architecture: Manages both dense and sparse indexes in one database.
    • No Data Sync Issues: Updates are atomic. A single upsert updates both representations.
    • Built-in Fusion: RRF is handled natively and efficiently by the database.
    • Superior Relevance: Replaces our brittle custom search with the industry-standard BM25.
  • Cons:
    • Requires a one-time data backfill which may be time-consuming.
    • Adds a new step (sparse vector generation) to the ingestion pipeline.

Option 2: External Full-Text Search (e.g., Elasticsearch)

  • Keep Qdrant for dense search and add a separate Elasticsearch/OpenSearch cluster for BM25.
  • Pros:
    • Provides a very powerful, dedicated full-text search engine.
  • Cons:
    • High Complexity: Introduces a new, stateful service to deploy, manage, and scale.
    • Data Sync Nightmare: We would be responsible for ensuring that the document IDs and content in Qdrant and Elasticsearch are always perfectly synchronized. This is a major source of bugs.
    • Manual Fusion: The application would have to query both systems and perform RRF manually.

Option 3: Keep Current System

  • Make no changes.
  • Pros:
    • No engineering effort required.
  • Cons:
    • Fails to address the known relevance and architectural problems.
    • Our RAG application's performance will remain suboptimal, especially for keyword-sensitive queries.

4. Rationale

Option 1 is the clear winner. It directly solves our primary problem (poor keyword matching) by adopting the industry-standard BM25.

Critically, it achieves this while simplifying our overall architecture, not complicating it. By leveraging features already present in our existing database (Qdrant), we avoid the massive operational and synchronization overhead of adding a second search system (Option 2).

This decision consolidates our retrieval logic, eliminates the data consistency problem, and moves the complex fusion logic (RRF) from the application layer into the database, where it can be performed more efficiently.

5. Consequences

New Work:

  • Ingestion: The data ingestion pipeline must be updated to add the fastembed library (or similar), generate sparse vectors, and upsert them to the new named vector field in Qdrant.
  • Retrieval: The application's retrieval service must be refactored to use the query_points endpoint with prefetch and fusion=models.Fusion.RRF.
  • Migration: A one-time backfill script must be written and executed to add sparse vectors for all existing documents.
  • Infrastructure: The Qdrant collection schema must be updated (or re-created) to add the sparse_vectors_config.

Positive:

  • Improved Accuracy: Retrieval will be significantly more accurate, handling both semantic and keyword queries robustly.
  • Simplified Code: The application's retrieval logic will be cleaner and simpler, with one endpoint instead of two.
  • Reduced Maintenance: We will remove the custom fuzzy-search code, which is brittle and difficult to maintain.

Negative:

  • The data backfill process will require careful management to avoid downtime.
  • Ingestion time will slightly increase due to the extra step of sparse vector generation. This is considered a negligible trade-off for the gains in relevance.

6. Implementation Notes

Implementation completed on 2025-11-16

Key Changes:

  1. Dependencies (pyproject.toml:25):

    • Added fastembed>=0.4.2 for BM25 sparse vector embeddings
    • Adjusted pillow version constraint to be compatible with fastembed
  2. Qdrant Collection Schema (nextcloud_mcp_server/vector/qdrant_client.py:113-128):

    • Updated to named vectors: {"dense": VectorParams(...), "sparse": SparseVectorParams(...)}
    • Added sparse vector configuration with BM25 index
    • Maintains backward compatibility with existing collections (detects legacy schema)
  3. BM25 Embedding Provider (nextcloud_mcp_server/embedding/bm25_provider.py):

    • Created BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model
    • Implements encode() and encode_batch() methods
    • Returns sparse vectors as {indices: list[int], values: list[float]} format
  4. Document Indexing Pipeline (nextcloud_mcp_server/vector/processor.py:229-255):

    • Generates both dense (semantic) and sparse (BM25) embeddings for each document chunk
    • Updates PointStruct to use named vectors: vector={"dense": ..., "sparse": ...}
    • Maintains same chunking strategy (512 words, 50-word overlap)
  5. BM25 Hybrid Search Algorithm (nextcloud_mcp_server/search/bm25_hybrid.py):

    • Implements BM25HybridSearchAlgorithm using Qdrant's native RRF fusion
    • Uses prefetch parameter for parallel dense + sparse search
    • Applies fusion=models.Fusion.RRF for automatic result merging
    • Maintains same deduplication and filtering logic as semantic search
  6. MCP Tool Updates (nextcloud_mcp_server/server/semantic.py:39-68):

    • Simplified nc_semantic_search() to use BM25 hybrid only
    • Removed algorithm, semantic_weight, keyword_weight, fuzzy_weight parameters
    • Updated default score_threshold=0.0 for RRF scoring
    • Returns search_method="bm25_hybrid" in responses
  7. Legacy Algorithm Removal:

    • Deleted nextcloud_mcp_server/search/keyword.py (278 lines)
    • Deleted nextcloud_mcp_server/search/fuzzy.py (220 lines)
    • Deleted nextcloud_mcp_server/search/hybrid.py (238 lines - custom RRF)
    • Updated nextcloud_mcp_server/search/__init__.py to export only BM25 hybrid

Migration Strategy:

  • No migration required (vector sync feature is experimental)
  • New documents automatically indexed with both dense + sparse vectors
  • Collection re-creation on first startup with updated schema

Test Results:

  • All unit tests passing (118 passed)
  • All integration tests passing (7 semantic search tests)
  • Code formatting verified with ruff

Benefits Realized:

  • Consolidated architecture (single Qdrant database for both dense + sparse)
  • Native fusion algorithms (database-level, more efficient)
  • Industry-standard BM25 (replaces custom keyword search)
  • Simplified codebase (removed 736 lines of legacy code)
  • Better relevance (handles both semantic and keyword queries)
  • Configurable fusion methods (RRF and DBSF)

7. Fusion Algorithm Options

Update: 2025-11-16

The BM25 hybrid search now supports two fusion algorithms for combining dense (semantic) and sparse (BM25) search results:

Reciprocal Rank Fusion (RRF)

Default fusion method. RRF is a widely-used, well-established algorithm that combines rankings from multiple retrieval systems using the reciprocal rank formula:

RRF(doc) = Σ 1/(k + rank_i(doc))

where k is a constant (typically 60) and rank_i(doc) is the rank of the document in retrieval system i.

Characteristics:

  • General-purpose: Works well across diverse query types and document collections
  • Rank-based: Focuses on relative rankings rather than absolute scores
  • Established: Well-tested, documented, and understood in IR literature
  • Robust: Less sensitive to score distribution differences between systems

When to use RRF:

  • Default choice for most use cases
  • When you have mixed query types (semantic + keyword)
  • When retrieval systems have very different score ranges
  • When you want predictable, well-understood behavior

Distribution-Based Score Fusion (DBSF)

Alternative fusion method. DBSF normalizes scores from each retrieval system using distribution statistics before combining them:

  1. Normalization: For each query, calculates mean (μ) and standard deviation (σ) of scores
  2. Outlier handling: Uses μ ± 3σ as normalization bounds
  3. Fusion: Sums normalized scores across systems

Characteristics:

  • Score-aware: Uses actual relevance scores, not just rankings
  • Statistical: Normalizes based on score distribution properties
  • ⚠️ Experimental: Newer algorithm, less battle-tested than RRF
  • ⚠️ Sensitive: May behave differently depending on score distributions

When to use DBSF:

  • When retrieval systems have vastly different score ranges that RRF doesn't balance well
  • When you want to experiment with score-based (vs rank-based) fusion
  • When statistical normalization better matches your use case
  • For A/B testing against RRF to measure retrieval quality improvements

Configuration

Both fusion algorithms are exposed via the fusion parameter in MCP tools:

# Use RRF (default)
response = await nc_semantic_search(
    query="async programming",
    fusion="rrf"  # Can be omitted, RRF is default
)

# Use DBSF
response = await nc_semantic_search(
    query="async programming",
    fusion="dbsf"
)

The nc_semantic_search_answer tool also supports the fusion parameter and passes it through to the underlying search.

Future: Configurable Weights

Current limitation: Neither RRF nor DBSF currently support per-system weights (e.g., 0.8 for semantic, 0.2 for BM25). This is a Qdrant platform limitation tracked in qdrant/qdrant#6067.

When Qdrant adds weight support, the fusion parameter can be extended to accept weight configurations:

# Hypothetical future API
response = await nc_semantic_search(
    query="async programming",
    fusion="rrf",
    fusion_weights={"dense": 0.7, "sparse": 0.3}  # Not yet implemented
)

Recommendation: Start with RRF (default). If you encounter cases where keyword matches are under- or over-weighted, experiment with DBSF. Monitor qdrant/qdrant#6067 for configurable weight support.