nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	6fe5596c13	feat: Implement BM25 hybrid search with native Qdrant RRF fusion Replace custom keyword/fuzzy search algorithms with industry-standard BM25 sparse vectors, combined with dense semantic vectors using Qdrant's native Reciprocal Rank Fusion (RRF). This consolidates search architecture and improves relevance for both semantic and keyword queries. Key changes: - Add fastembed dependency for BM25 sparse vector generation - Update Qdrant collection schema to support named vectors (dense + sparse) - Create BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model - Implement BM25HybridSearchAlgorithm with native Qdrant RRF prefetch - Update document processor to generate both dense and sparse embeddings - Simplify nc_semantic_search() tool to use BM25 hybrid only - Remove legacy keyword.py, fuzzy.py, and custom hybrid.py (736 lines) - Update ADR-014 with implementation notes and test results Benefits: - Consolidated architecture (single Qdrant database) - Native database-level RRF fusion (more efficient) - Industry-standard BM25 (replaces brittle custom keyword search) - Better relevance across semantic and keyword queries - Simplified codebase (-285 net lines) Tests: All 125 tests passing (118 unit, 7 integration) Implements ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 06:59:44 +01:00
Chris Coutinho	eaeb8eae28	feat: Normalize hybrid search RRF scores to 0-1 range Improve user comprehension by scaling RRF scores to match the intuitive 0-1 range used by other search algorithms. ## Problem RRF (Reciprocal Rank Fusion) scores had a drastically different scale than semantic/keyword/fuzzy scores: - Semantic similarity: 0.0 to 1.0 (typical: 0.5-0.9) - RRF scores: 0.0 to ~0.016 (typical: 0.005-0.015) This caused user confusion - a score of 0.0078 looked terrible but was actually excellent (near theoretical maximum). ## Solution Normalize RRF scores using the formula: `normalized_score = rrf_score * (rrf_k + 1) / total_weight` Where: - rrf_k = 60 (RRF constant) - total_weight = sum of algorithm weights (default: 1.0) Example transformation: - Before: 0.0078 (confusing) - After: 0.477 (intuitive) ## Changes nextcloud_mcp_server/search/hybrid.py: - Store total_weight as instance variable (line 63) - Calculate normalization factor in _reciprocal_rank_fusion() (line 209) - Apply normalization to all RRF scores (line 217) - Preserve raw RRF score in metadata for debugging (line 222) ## Impact User Experience: - Hybrid search scores now comparable with semantic/keyword/fuzzy - Score of 0.5 indicates good match across all algorithms - Consistent scale improves score threshold usability Backward Compatibility: - Raw RRF scores preserved in metadata["rrf_score_raw"] - Result ordering unchanged (normalization is linear transformation) - Breaking change: Existing score thresholds need adjustment Performance: - Negligible overhead (single multiplication per result) ## Testing Verified with nc_semantic_search and nc_semantic_search_answer: - Hybrid scores now 0.47-0.7 range (was 0.003-0.011) - Semantic scores unchanged (0.75) - Result ordering preserved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 06:48:58 +01:00
Chris Coutinho	42376483ab	refactor: Optimize Nextcloud access verification with centralized filtering Move access verification from individual search algorithms to final output stage, eliminating redundant API calls and improving performance. ## Changes New: - `search/verification.py`: Centralized verification using anyio task groups - Deduplicates results by (doc_id, doc_type) before verification - Verifies all unique documents in parallel using structured concurrency - Filters out inaccessible documents in single pass Modified Search Algorithms: - `search/semantic.py`: Removed _deduplicate_and_verify() and _verify_document_access() - `search/keyword.py`: Removed _verify_access() and parallel verification - `search/fuzzy.py`: Removed _verify_access() and parallel verification - `search/hybrid.py`: Removed nextcloud_client parameter passing All algorithms now return unverified results from Qdrant payload. Modified Output Stages: - `server/semantic.py`: Added verify_search_results() call after search - `auth/viz_routes.py`: Added verify_search_results() call after search Both endpoints now verify access once at final stage with deduplication. ## Performance Impact Before: - Hybrid mode (limit=10): 30 API calls (10 per algorithm × 3 algorithms) - Single algorithm: 10-20 API calls (with verification buffer) After: - Hybrid mode (limit=10): 10 API calls (deduplicated verification) - Single algorithm: 10 API calls (deduplicated verification) Performance Gain: 3x reduction in API calls for hybrid search ## Architecture Benefits - Separation of concerns: Algorithms handle scoring, output stage handles security - Deduplication: Each document verified exactly once - Parallel execution: All verifications run concurrently via anyio task groups - Consistency: Same verification logic across MCP tools and viz endpoints 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 06:21:06 +01:00
Chris Coutinho	ed0825e661	feat: Enhance vector visualization UI and parallelize search verification Vector Visualization Improvements: - Add interactive vector viz tab with Alpine.js and Plotly.js to user info page - Refactor viz route CSS for better scoping and maintainability - Remove unused nextcloud_host variable Performance Optimizations: - Parallelize access verification in fuzzy and keyword search algorithms - Use asyncio.gather() to verify multiple documents concurrently - Add exception handling with return_exceptions=True for resilience Dependencies: - Update third_party/oidc submodule to include RFC 9728 resource_url support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 05:39:07 +01:00
Chris Coutinho	e3153822f7	perf: Exclude vector-sync status polling from distributed tracing Skip tracing for /app/vector-sync/status to reduce noise from HTMX polling. Metrics collection continues for this endpoint. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 05:19:35 +01:00
Chris Coutinho	2a078093ed	refactor!: Make all search algorithms query Qdrant payload, not Nextcloud BREAKING CHANGE: Search algorithms now require Qdrant to be populated. Vector sync must be enabled and documents indexed for search to work. - Keyword and fuzzy search now query Qdrant scroll API for title/excerpt - Remove inefficient Nextcloud API fetching pattern - Add optional Nextcloud verification for security - Deduplicate by (doc_id, doc_type) tuple, keeping chunk_index=0 - Align with document processor pattern that already stores text in Qdrant	2025-11-15 01:56:41 +01:00
Chris Coutinho	b5b03bfd78	feat: Add multi-document Protocol with cross-app search support Implements NextcloudClientProtocol for multi-document type search following user requirement that document types are not 1:1 with apps (e.g., Notes app specializes in markdown, while Files/WebDAV handles multiple file types). Key Changes: - NextcloudClientProtocol: Generic protocol with app-specific client properties - get_indexed_doc_types(): Query Qdrant for actually-indexed document types - Document dispatch: All algorithms check Qdrant before attempting access - Cross-type deduplication: Use (doc_id, doc_type) tuples in hybrid RRF Search Algorithm Updates: - Semantic: Added _verify_document_access() with dispatch to appropriate client - Deduplication by (doc_id, doc_type) tuple - Only "note" verification implemented, others return None with info log - Keyword: Added _fetch_documents() dispatch method - Queries Qdrant for available types before fetching - Supports cross-app search when doc_type=None - Fuzzy: Same pattern as keyword search - Hybrid: Already uses (doc_id, doc_type) for deduplication (no changes needed) Future-Proof Design: - File/calendar verification stubs in place - Clear logging when unsupported types found - Easy to extend when processor indexes new document types Currently Supported: - "note" documents fully implemented and tested - Other types gracefully handled (logged but skipped) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 01:19:29 +01:00
Chris Coutinho	11e620f2d1	feat: Implement unified search algorithm module Creates shared search module with four algorithms implementing ADR-012: - Semantic search (vector similarity via Qdrant) - Keyword search (token-based matching from ADR-001) - Fuzzy search (character overlap matching) - Hybrid search (RRF fusion from ADR-003) Architecture: - Base SearchAlgorithm interface for consistent API - SearchResult dataclass for unified result format - All algorithms async and independently testable - Proper logging and error handling throughout Semantic Search (search/semantic.py): - Extracted from server/semantic.py - Vector similarity using Qdrant query_points - Dual-phase authorization (vector filter + API verification) - Deduplication of document chunks - Configurable score threshold (default: 0.7) Keyword Search (search/keyword.py): - Implements ADR-001 token-based matching - Title matches weighted 3x higher than content - Case-insensitive token matching - Relevance scoring with normalization - Excerpt extraction with context Fuzzy Search (search/fuzzy.py): - Simple character overlap calculation - Configurable threshold (default: 70%) - Typo-tolerant matching - Fast and dependency-free Hybrid Search (search/hybrid.py): - Reciprocal Rank Fusion (RRF) from ADR-003 - Parallel execution of sub-algorithms - Configurable weights per algorithm - RRF constant k=60 (standard value) - Weight validation (must sum ≤1.0) All algorithms: - Share NextcloudClient for document access - Support user_id filtering (multi-tenant) - Support doc_type filtering (currently notes only) - Return consistent SearchResult objects - Properly formatted with ruff and type-checked Next steps: Update MCP tool to use these algorithms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 00:10:19 +01:00

8 Commits