Commit Graph

4 Commits

Author SHA1 Message Date
Chris Coutinho 6cfd7e2729 feat: add configurable fusion algorithms for BM25 hybrid search
Added support for two fusion algorithms (RRF and DBSF) to combine dense
semantic and sparse BM25 search results, with comprehensive documentation
and unit tests.

Changes:
- Added fusion parameter to nc_semantic_search and nc_semantic_search_answer tools
- Updated ADR-014 with detailed comparison of RRF vs DBSF fusion algorithms
- Added unit tests for fusion algorithm initialization and validation
- Updated search_method in responses to include fusion type (e.g., "bm25_hybrid_rrf")

Fusion Algorithms:
- RRF (Reciprocal Rank Fusion): Default, rank-based, general-purpose
- DBSF (Distribution-Based Score Fusion): Score normalization using statistics

RRF is recommended for most use cases due to its robustness and established
track record. DBSF may provide better results when retrieval systems have
very different score distributions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 06:48:43 +01:00
Chris Coutinho ed33b39062 docs: fix ADR-014 template text and numbering
- Remove template instruction text from line 1
- Fix ADR numbering from 007 to 014 to match filename
2025-11-16 12:08:37 +01:00
Chris Coutinho 6fe5596c13 feat: Implement BM25 hybrid search with native Qdrant RRF fusion
Replace custom keyword/fuzzy search algorithms with industry-standard BM25
sparse vectors, combined with dense semantic vectors using Qdrant's native
Reciprocal Rank Fusion (RRF). This consolidates search architecture and
improves relevance for both semantic and keyword queries.

Key changes:
- Add fastembed dependency for BM25 sparse vector generation
- Update Qdrant collection schema to support named vectors (dense + sparse)
- Create BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model
- Implement BM25HybridSearchAlgorithm with native Qdrant RRF prefetch
- Update document processor to generate both dense and sparse embeddings
- Simplify nc_semantic_search() tool to use BM25 hybrid only
- Remove legacy keyword.py, fuzzy.py, and custom hybrid.py (736 lines)
- Update ADR-014 with implementation notes and test results

Benefits:
- Consolidated architecture (single Qdrant database)
- Native database-level RRF fusion (more efficient)
- Industry-standard BM25 (replaces brittle custom keyword search)
- Better relevance across semantic and keyword queries
- Simplified codebase (-285 net lines)

Tests: All 125 tests passing (118 unit, 7 integration)

Implements ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 06:59:44 +01:00
Chris Coutinho f5bc3e3bc3 docs: init ADR 2025-11-16 06:24:25 +01:00