nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	eaeb8eae28	feat: Normalize hybrid search RRF scores to 0-1 range Improve user comprehension by scaling RRF scores to match the intuitive 0-1 range used by other search algorithms. ## Problem RRF (Reciprocal Rank Fusion) scores had a drastically different scale than semantic/keyword/fuzzy scores: - Semantic similarity: 0.0 to 1.0 (typical: 0.5-0.9) - RRF scores: 0.0 to ~0.016 (typical: 0.005-0.015) This caused user confusion - a score of 0.0078 looked terrible but was actually excellent (near theoretical maximum). ## Solution Normalize RRF scores using the formula: `normalized_score = rrf_score * (rrf_k + 1) / total_weight` Where: - rrf_k = 60 (RRF constant) - total_weight = sum of algorithm weights (default: 1.0) Example transformation: - Before: 0.0078 (confusing) - After: 0.477 (intuitive) ## Changes nextcloud_mcp_server/search/hybrid.py: - Store total_weight as instance variable (line 63) - Calculate normalization factor in _reciprocal_rank_fusion() (line 209) - Apply normalization to all RRF scores (line 217) - Preserve raw RRF score in metadata for debugging (line 222) ## Impact User Experience: - Hybrid search scores now comparable with semantic/keyword/fuzzy - Score of 0.5 indicates good match across all algorithms - Consistent scale improves score threshold usability Backward Compatibility: - Raw RRF scores preserved in metadata["rrf_score_raw"] - Result ordering unchanged (normalization is linear transformation) - Breaking change: Existing score thresholds need adjustment Performance: - Negligible overhead (single multiplication per result) ## Testing Verified with nc_semantic_search and nc_semantic_search_answer: - Hybrid scores now 0.47-0.7 range (was 0.003-0.011) - Semantic scores unchanged (0.75) - Result ordering preserved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 06:48:58 +01:00
Chris Coutinho	42376483ab	refactor: Optimize Nextcloud access verification with centralized filtering Move access verification from individual search algorithms to final output stage, eliminating redundant API calls and improving performance. ## Changes New: - `search/verification.py`: Centralized verification using anyio task groups - Deduplicates results by (doc_id, doc_type) before verification - Verifies all unique documents in parallel using structured concurrency - Filters out inaccessible documents in single pass Modified Search Algorithms: - `search/semantic.py`: Removed _deduplicate_and_verify() and _verify_document_access() - `search/keyword.py`: Removed _verify_access() and parallel verification - `search/fuzzy.py`: Removed _verify_access() and parallel verification - `search/hybrid.py`: Removed nextcloud_client parameter passing All algorithms now return unverified results from Qdrant payload. Modified Output Stages: - `server/semantic.py`: Added verify_search_results() call after search - `auth/viz_routes.py`: Added verify_search_results() call after search Both endpoints now verify access once at final stage with deduplication. ## Performance Impact Before: - Hybrid mode (limit=10): 30 API calls (10 per algorithm × 3 algorithms) - Single algorithm: 10-20 API calls (with verification buffer) After: - Hybrid mode (limit=10): 10 API calls (deduplicated verification) - Single algorithm: 10 API calls (deduplicated verification) Performance Gain: 3x reduction in API calls for hybrid search ## Architecture Benefits - Separation of concerns: Algorithms handle scoring, output stage handles security - Deduplication: Each document verified exactly once - Parallel execution: All verifications run concurrently via anyio task groups - Consistency: Same verification logic across MCP tools and viz endpoints 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 06:21:06 +01:00
Chris Coutinho	ed0825e661	feat: Enhance vector visualization UI and parallelize search verification Vector Visualization Improvements: - Add interactive vector viz tab with Alpine.js and Plotly.js to user info page - Refactor viz route CSS for better scoping and maintainability - Remove unused nextcloud_host variable Performance Optimizations: - Parallelize access verification in fuzzy and keyword search algorithms - Use asyncio.gather() to verify multiple documents concurrently - Add exception handling with return_exceptions=True for resilience Dependencies: - Update third_party/oidc submodule to include RFC 9728 resource_url support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 05:39:07 +01:00
Chris Coutinho	e3153822f7	perf: Exclude vector-sync status polling from distributed tracing Skip tracing for /app/vector-sync/status to reduce noise from HTMX polling. Metrics collection continues for this endpoint. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 05:19:35 +01:00
Chris Coutinho	2a078093ed	refactor!: Make all search algorithms query Qdrant payload, not Nextcloud BREAKING CHANGE: Search algorithms now require Qdrant to be populated. Vector sync must be enabled and documents indexed for search to work. - Keyword and fuzzy search now query Qdrant scroll API for title/excerpt - Remove inefficient Nextcloud API fetching pattern - Add optional Nextcloud verification for security - Deduplicate by (doc_id, doc_type) tuple, keeping chunk_index=0 - Align with document processor pattern that already stores text in Qdrant	2025-11-15 01:56:41 +01:00
Chris Coutinho	b5b03bfd78	feat: Add multi-document Protocol with cross-app search support Implements NextcloudClientProtocol for multi-document type search following user requirement that document types are not 1:1 with apps (e.g., Notes app specializes in markdown, while Files/WebDAV handles multiple file types). Key Changes: - NextcloudClientProtocol: Generic protocol with app-specific client properties - get_indexed_doc_types(): Query Qdrant for actually-indexed document types - Document dispatch: All algorithms check Qdrant before attempting access - Cross-type deduplication: Use (doc_id, doc_type) tuples in hybrid RRF Search Algorithm Updates: - Semantic: Added _verify_document_access() with dispatch to appropriate client - Deduplication by (doc_id, doc_type) tuple - Only "note" verification implemented, others return None with info log - Keyword: Added _fetch_documents() dispatch method - Queries Qdrant for available types before fetching - Supports cross-app search when doc_type=None - Fuzzy: Same pattern as keyword search - Hybrid: Already uses (doc_id, doc_type) for deduplication (no changes needed) Future-Proof Design: - File/calendar verification stubs in place - Clear logging when unsupported types found - Easy to extend when processor indexes new document types Currently Supported: - "note" documents fully implemented and tested - Other types gracefully handled (logged but skipped) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 01:19:29 +01:00
Chris Coutinho	11e620f2d1	feat: Implement unified search algorithm module Creates shared search module with four algorithms implementing ADR-012: - Semantic search (vector similarity via Qdrant) - Keyword search (token-based matching from ADR-001) - Fuzzy search (character overlap matching) - Hybrid search (RRF fusion from ADR-003) Architecture: - Base SearchAlgorithm interface for consistent API - SearchResult dataclass for unified result format - All algorithms async and independently testable - Proper logging and error handling throughout Semantic Search (search/semantic.py): - Extracted from server/semantic.py - Vector similarity using Qdrant query_points - Dual-phase authorization (vector filter + API verification) - Deduplication of document chunks - Configurable score threshold (default: 0.7) Keyword Search (search/keyword.py): - Implements ADR-001 token-based matching - Title matches weighted 3x higher than content - Case-insensitive token matching - Relevance scoring with normalization - Excerpt extraction with context Fuzzy Search (search/fuzzy.py): - Simple character overlap calculation - Configurable threshold (default: 70%) - Typo-tolerant matching - Fast and dependency-free Hybrid Search (search/hybrid.py): - Reciprocal Rank Fusion (RRF) from ADR-003 - Parallel execution of sub-algorithms - Configurable weights per algorithm - RRF constant k=60 (standard value) - Weight validation (must sum ≤1.0) All algorithms: - Share NextcloudClient for document access - Support user_id filtering (multi-tenant) - Support doc_type filtering (currently notes only) - Return consistent SearchResult objects - Properly formatted with ruff and type-checked Next steps: Update MCP tool to use these algorithms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-15 00:10:19 +01:00

7 Commits