Commit Graph

3 Commits

Author SHA1 Message Date
Chris Coutinho 6fe5596c13 feat: Implement BM25 hybrid search with native Qdrant RRF fusion
Replace custom keyword/fuzzy search algorithms with industry-standard BM25
sparse vectors, combined with dense semantic vectors using Qdrant's native
Reciprocal Rank Fusion (RRF). This consolidates search architecture and
improves relevance for both semantic and keyword queries.

Key changes:
- Add fastembed dependency for BM25 sparse vector generation
- Update Qdrant collection schema to support named vectors (dense + sparse)
- Create BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model
- Implement BM25HybridSearchAlgorithm with native Qdrant RRF prefetch
- Update document processor to generate both dense and sparse embeddings
- Simplify nc_semantic_search() tool to use BM25 hybrid only
- Remove legacy keyword.py, fuzzy.py, and custom hybrid.py (736 lines)
- Update ADR-014 with implementation notes and test results

Benefits:
- Consolidated architecture (single Qdrant database)
- Native database-level RRF fusion (more efficient)
- Industry-standard BM25 (replaces brittle custom keyword search)
- Better relevance across semantic and keyword queries
- Simplified codebase (-285 net lines)

Tests: All 125 tests passing (118 unit, 7 integration)

Implements ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 06:59:44 +01:00
Chris Coutinho b5b03bfd78 feat: Add multi-document Protocol with cross-app search support
Implements NextcloudClientProtocol for multi-document type search following
user requirement that document types are not 1:1 with apps (e.g., Notes app
specializes in markdown, while Files/WebDAV handles multiple file types).

Key Changes:
- NextcloudClientProtocol: Generic protocol with app-specific client properties
- get_indexed_doc_types(): Query Qdrant for actually-indexed document types
- Document dispatch: All algorithms check Qdrant before attempting access
- Cross-type deduplication: Use (doc_id, doc_type) tuples in hybrid RRF

Search Algorithm Updates:
- Semantic: Added _verify_document_access() with dispatch to appropriate client
  - Deduplication by (doc_id, doc_type) tuple
  - Only "note" verification implemented, others return None with info log
- Keyword: Added _fetch_documents() dispatch method
  - Queries Qdrant for available types before fetching
  - Supports cross-app search when doc_type=None
- Fuzzy: Same pattern as keyword search
- Hybrid: Already uses (doc_id, doc_type) for deduplication (no changes needed)

Future-Proof Design:
- File/calendar verification stubs in place
- Clear logging when unsupported types found
- Easy to extend when processor indexes new document types

Currently Supported:
- "note" documents fully implemented and tested
- Other types gracefully handled (logged but skipped)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-15 01:19:29 +01:00
Chris Coutinho 11e620f2d1 feat: Implement unified search algorithm module
Creates shared search module with four algorithms implementing ADR-012:
- Semantic search (vector similarity via Qdrant)
- Keyword search (token-based matching from ADR-001)
- Fuzzy search (character overlap matching)
- Hybrid search (RRF fusion from ADR-003)

Architecture:
- Base SearchAlgorithm interface for consistent API
- SearchResult dataclass for unified result format
- All algorithms async and independently testable
- Proper logging and error handling throughout

Semantic Search (search/semantic.py):
- Extracted from server/semantic.py
- Vector similarity using Qdrant query_points
- Dual-phase authorization (vector filter + API verification)
- Deduplication of document chunks
- Configurable score threshold (default: 0.7)

Keyword Search (search/keyword.py):
- Implements ADR-001 token-based matching
- Title matches weighted 3x higher than content
- Case-insensitive token matching
- Relevance scoring with normalization
- Excerpt extraction with context

Fuzzy Search (search/fuzzy.py):
- Simple character overlap calculation
- Configurable threshold (default: 70%)
- Typo-tolerant matching
- Fast and dependency-free

Hybrid Search (search/hybrid.py):
- Reciprocal Rank Fusion (RRF) from ADR-003
- Parallel execution of sub-algorithms
- Configurable weights per algorithm
- RRF constant k=60 (standard value)
- Weight validation (must sum ≤1.0)

All algorithms:
- Share NextcloudClient for document access
- Support user_id filtering (multi-tenant)
- Support doc_type filtering (currently notes only)
- Return consistent SearchResult objects
- Properly formatted with ruff and type-checked

Next steps: Update MCP tool to use these algorithms

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-15 00:10:19 +01:00