Added support for two fusion algorithms (RRF and DBSF) to combine dense semantic and sparse BM25 search results, with comprehensive documentation and unit tests. Changes: - Added fusion parameter to nc_semantic_search and nc_semantic_search_answer tools - Updated ADR-014 with detailed comparison of RRF vs DBSF fusion algorithms - Added unit tests for fusion algorithm initialization and validation - Updated search_method in responses to include fusion type (e.g., "bm25_hybrid_rrf") Fusion Algorithms: - RRF (Reciprocal Rank Fusion): Default, rank-based, general-purpose - DBSF (Distribution-Based Score Fusion): Score normalization using statistics RRF is recommended for most use cases due to its robustness and established track record. DBSF may provide better results when retrieval systems have very different score distributions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search via Qdrant
Date: 2025-11-16
Status: Implemented
1. Context
Our RAG application currently employs two separate retrieval mechanisms:
- Dense (Semantic) Search: Using vector embeddings stored in our Qdrant database to find semantically similar context.
- Keyword Search: A custom-built fuzzy/character-based search to match-specific keywords, acronyms, and product codes that semantic search often misses.
This dual-system approach has several drawbacks:
- Poor Relevance: Our current keyword search is basic (e.g.,
LIKEqueries or simple fuzzy matching). It is not as effective as modern full-text search algorithms like BM25. - Clunky Fusion: We lack a robust, principled method to combine the results from the two systems. This leads to disjointed logic in the application layer and suboptimal context being passed to the LLM.
- Architectural Complexity: We must maintain two separate search pathways (one to Qdrant, one to the keyword search mechanism), increasing code complexity and maintenance overhead.
Our vector database, Qdrant, natively supports hybrid search by combining dense vectors with BM25-based sparse vectors in a single collection.
2. Decision
We will deprecate and remove the existing custom keyword/fuzzy search functionality.
We will replace it by implementing native hybrid search within Qdrant. This involves:
- Modifying the Qdrant Collection: Updating our collection to support a named sparse vector index configured for BM25.
- Updating the Ingestion Pipeline: For every document chunk, we will generate and upsert both:
- Its dense vector (from our existing embedding model).
- Its sparse vector (generated using a BM25-compatible model, e.g.,
Qdrant/bm25fromfastembed).
- Refactoring Retrieval Logic: All retrieval calls will be consolidated into a single Qdrant query using the
query_pointsendpoint. This query will use theprefetchparameter to execute both dense and sparse searches, and Qdrant's built-in Reciprocal Rank Fusion (RRF) to automatically merge the results into a single, relevance-ranked list. - Backfilling: A one-time migration script will be created to generate and add sparse vectors for all existing documents in the Qdrant collection.
3. Considered Options
Option 1: Native Qdrant Hybrid Search (Chosen)
- Use Qdrant's built-in sparse vector and RRF capabilities.
- Pros:
- Consolidated Architecture: Manages both dense and sparse indexes in one database.
- No Data Sync Issues: Updates are atomic. A single
upsertupdates both representations. - Built-in Fusion: RRF is handled natively and efficiently by the database.
- Superior Relevance: Replaces our brittle custom search with the industry-standard BM25.
- Cons:
- Requires a one-time data backfill which may be time-consuming.
- Adds a new step (sparse vector generation) to the ingestion pipeline.
Option 2: External Full-Text Search (e.g., Elasticsearch)
- Keep Qdrant for dense search and add a separate Elasticsearch/OpenSearch cluster for BM25.
- Pros:
- Provides a very powerful, dedicated full-text search engine.
- Cons:
- High Complexity: Introduces a new, stateful service to deploy, manage, and scale.
- Data Sync Nightmare: We would be responsible for ensuring that the document IDs and content in Qdrant and Elasticsearch are always perfectly synchronized. This is a major source of bugs.
- Manual Fusion: The application would have to query both systems and perform RRF manually.
Option 3: Keep Current System
- Make no changes.
- Pros:
- No engineering effort required.
- Cons:
- Fails to address the known relevance and architectural problems.
- Our RAG application's performance will remain suboptimal, especially for keyword-sensitive queries.
4. Rationale
Option 1 is the clear winner. It directly solves our primary problem (poor keyword matching) by adopting the industry-standard BM25.
Critically, it achieves this while simplifying our overall architecture, not complicating it. By leveraging features already present in our existing database (Qdrant), we avoid the massive operational and synchronization overhead of adding a second search system (Option 2).
This decision consolidates our retrieval logic, eliminates the data consistency problem, and moves the complex fusion logic (RRF) from the application layer into the database, where it can be performed more efficiently.
5. Consequences
New Work:
- Ingestion: The data ingestion pipeline must be updated to add the
fastembedlibrary (or similar), generate sparse vectors, and upsert them to the new named vector field in Qdrant. - Retrieval: The application's retrieval service must be refactored to use the
query_pointsendpoint withprefetchandfusion=models.Fusion.RRF. - Migration: A one-time backfill script must be written and executed to add sparse vectors for all existing documents.
- Infrastructure: The Qdrant collection schema must be updated (or re-created) to add the
sparse_vectors_config.
Positive:
- Improved Accuracy: Retrieval will be significantly more accurate, handling both semantic and keyword queries robustly.
- Simplified Code: The application's retrieval logic will be cleaner and simpler, with one endpoint instead of two.
- Reduced Maintenance: We will remove the custom fuzzy-search code, which is brittle and difficult to maintain.
Negative:
- The data backfill process will require careful management to avoid downtime.
- Ingestion time will slightly increase due to the extra step of sparse vector generation. This is considered a negligible trade-off for the gains in relevance.
6. Implementation Notes
Implementation completed on 2025-11-16
Key Changes:
-
Dependencies (pyproject.toml:25):
- Added
fastembed>=0.4.2for BM25 sparse vector embeddings - Adjusted
pillowversion constraint to be compatible with fastembed
- Added
-
Qdrant Collection Schema (nextcloud_mcp_server/vector/qdrant_client.py:113-128):
- Updated to named vectors:
{"dense": VectorParams(...), "sparse": SparseVectorParams(...)} - Added sparse vector configuration with BM25 index
- Maintains backward compatibility with existing collections (detects legacy schema)
- Updated to named vectors:
-
BM25 Embedding Provider (nextcloud_mcp_server/embedding/bm25_provider.py):
- Created
BM25SparseEmbeddingProviderusing FastEmbed'sQdrant/bm25model - Implements
encode()andencode_batch()methods - Returns sparse vectors as
{indices: list[int], values: list[float]}format
- Created
-
Document Indexing Pipeline (nextcloud_mcp_server/vector/processor.py:229-255):
- Generates both dense (semantic) and sparse (BM25) embeddings for each document chunk
- Updates
PointStructto use named vectors:vector={"dense": ..., "sparse": ...} - Maintains same chunking strategy (512 words, 50-word overlap)
-
BM25 Hybrid Search Algorithm (nextcloud_mcp_server/search/bm25_hybrid.py):
- Implements
BM25HybridSearchAlgorithmusing Qdrant's native RRF fusion - Uses
prefetchparameter for parallel dense + sparse search - Applies
fusion=models.Fusion.RRFfor automatic result merging - Maintains same deduplication and filtering logic as semantic search
- Implements
-
MCP Tool Updates (nextcloud_mcp_server/server/semantic.py:39-68):
- Simplified
nc_semantic_search()to use BM25 hybrid only - Removed
algorithm,semantic_weight,keyword_weight,fuzzy_weightparameters - Updated default
score_threshold=0.0for RRF scoring - Returns
search_method="bm25_hybrid"in responses
- Simplified
-
Legacy Algorithm Removal:
- Deleted
nextcloud_mcp_server/search/keyword.py(278 lines) - Deleted
nextcloud_mcp_server/search/fuzzy.py(220 lines) - Deleted
nextcloud_mcp_server/search/hybrid.py(238 lines - custom RRF) - Updated
nextcloud_mcp_server/search/__init__.pyto export only BM25 hybrid
- Deleted
Migration Strategy:
- No migration required (vector sync feature is experimental)
- New documents automatically indexed with both dense + sparse vectors
- Collection re-creation on first startup with updated schema
Test Results:
- All unit tests passing (118 passed)
- All integration tests passing (7 semantic search tests)
- Code formatting verified with ruff
Benefits Realized:
- ✅ Consolidated architecture (single Qdrant database for both dense + sparse)
- ✅ Native fusion algorithms (database-level, more efficient)
- ✅ Industry-standard BM25 (replaces custom keyword search)
- ✅ Simplified codebase (removed 736 lines of legacy code)
- ✅ Better relevance (handles both semantic and keyword queries)
- ✅ Configurable fusion methods (RRF and DBSF)
7. Fusion Algorithm Options
Update: 2025-11-16
The BM25 hybrid search now supports two fusion algorithms for combining dense (semantic) and sparse (BM25) search results:
Reciprocal Rank Fusion (RRF)
Default fusion method. RRF is a widely-used, well-established algorithm that combines rankings from multiple retrieval systems using the reciprocal rank formula:
RRF(doc) = Σ 1/(k + rank_i(doc))
where k is a constant (typically 60) and rank_i(doc) is the rank of the document in retrieval system i.
Characteristics:
- ✅ General-purpose: Works well across diverse query types and document collections
- ✅ Rank-based: Focuses on relative rankings rather than absolute scores
- ✅ Established: Well-tested, documented, and understood in IR literature
- ✅ Robust: Less sensitive to score distribution differences between systems
When to use RRF:
- Default choice for most use cases
- When you have mixed query types (semantic + keyword)
- When retrieval systems have very different score ranges
- When you want predictable, well-understood behavior
Distribution-Based Score Fusion (DBSF)
Alternative fusion method. DBSF normalizes scores from each retrieval system using distribution statistics before combining them:
- Normalization: For each query, calculates mean (μ) and standard deviation (σ) of scores
- Outlier handling: Uses μ ± 3σ as normalization bounds
- Fusion: Sums normalized scores across systems
Characteristics:
- ✅ Score-aware: Uses actual relevance scores, not just rankings
- ✅ Statistical: Normalizes based on score distribution properties
- ⚠️ Experimental: Newer algorithm, less battle-tested than RRF
- ⚠️ Sensitive: May behave differently depending on score distributions
When to use DBSF:
- When retrieval systems have vastly different score ranges that RRF doesn't balance well
- When you want to experiment with score-based (vs rank-based) fusion
- When statistical normalization better matches your use case
- For A/B testing against RRF to measure retrieval quality improvements
Configuration
Both fusion algorithms are exposed via the fusion parameter in MCP tools:
# Use RRF (default)
response = await nc_semantic_search(
query="async programming",
fusion="rrf" # Can be omitted, RRF is default
)
# Use DBSF
response = await nc_semantic_search(
query="async programming",
fusion="dbsf"
)
The nc_semantic_search_answer tool also supports the fusion parameter and passes it through to the underlying search.
Future: Configurable Weights
Current limitation: Neither RRF nor DBSF currently support per-system weights (e.g., 0.8 for semantic, 0.2 for BM25). This is a Qdrant platform limitation tracked in qdrant/qdrant#6067.
When Qdrant adds weight support, the fusion parameter can be extended to accept weight configurations:
# Hypothetical future API
response = await nc_semantic_search(
query="async programming",
fusion="rrf",
fusion_weights={"dense": 0.7, "sparse": 0.3} # Not yet implemented
)
Recommendation: Start with RRF (default). If you encounter cases where keyword matches are under- or over-weighted, experiment with DBSF. Monitor qdrant/qdrant#6067 for configurable weight support.