feat: Implement BM25 hybrid search with native Qdrant RRF fusion
Replace custom keyword/fuzzy search algorithms with industry-standard BM25 sparse vectors, combined with dense semantic vectors using Qdrant's native Reciprocal Rank Fusion (RRF). This consolidates search architecture and improves relevance for both semantic and keyword queries. Key changes: - Add fastembed dependency for BM25 sparse vector generation - Update Qdrant collection schema to support named vectors (dense + sparse) - Create BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model - Implement BM25HybridSearchAlgorithm with native Qdrant RRF prefetch - Update document processor to generate both dense and sparse embeddings - Simplify nc_semantic_search() tool to use BM25 hybrid only - Remove legacy keyword.py, fuzzy.py, and custom hybrid.py (736 lines) - Update ADR-014 with implementation notes and test results Benefits: - Consolidated architecture (single Qdrant database) - Native database-level RRF fusion (more efficient) - Industry-standard BM25 (replaces brittle custom keyword search) - Better relevance across semantic and keyword queries - Simplified codebase (-285 net lines) Tests: All 125 tests passing (118 unit, 7 integration) Implements ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -14,7 +14,7 @@ from qdrant_client.models import FieldCondition, Filter, MatchValue, PointStruct
|
||||
|
||||
from nextcloud_mcp_server.client import NextcloudClient
|
||||
from nextcloud_mcp_server.config import get_settings
|
||||
from nextcloud_mcp_server.embedding import get_embedding_service
|
||||
from nextcloud_mcp_server.embedding import get_bm25_service, get_embedding_service
|
||||
from nextcloud_mcp_server.observability.metrics import (
|
||||
record_qdrant_operation,
|
||||
record_vector_sync_processing,
|
||||
@@ -226,15 +226,21 @@ async def _index_document(
|
||||
)
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Generate embeddings (I/O bound - external API call)
|
||||
# Generate dense embeddings (I/O bound - external API call)
|
||||
embedding_service = get_embedding_service()
|
||||
embeddings = await embedding_service.embed_batch(chunks)
|
||||
dense_embeddings = await embedding_service.embed_batch(chunks)
|
||||
|
||||
# Generate sparse embeddings (BM25 for keyword matching)
|
||||
bm25_service = get_bm25_service()
|
||||
sparse_embeddings = bm25_service.encode_batch(chunks)
|
||||
|
||||
# Prepare Qdrant points
|
||||
indexed_at = int(time.time())
|
||||
points = []
|
||||
|
||||
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
|
||||
for i, (chunk, dense_emb, sparse_emb) in enumerate(
|
||||
zip(chunks, dense_embeddings, sparse_embeddings)
|
||||
):
|
||||
# Generate deterministic UUID for point ID
|
||||
# Using uuid5 with DNS namespace and combining doc info
|
||||
point_name = f"{doc_task.doc_type}:{doc_task.doc_id}:chunk:{i}"
|
||||
@@ -243,7 +249,10 @@ async def _index_document(
|
||||
points.append(
|
||||
PointStruct(
|
||||
id=point_id,
|
||||
vector=embedding,
|
||||
vector={
|
||||
"dense": dense_emb,
|
||||
"sparse": sparse_emb,
|
||||
},
|
||||
payload={
|
||||
"user_id": doc_task.user_id,
|
||||
"doc_id": doc_task.doc_id,
|
||||
|
||||
Reference in New Issue
Block a user