nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	b0612cfa0f	perf: Optimize vector viz search performance - Replace sequential Qdrant scroll calls with batch retrieve (50 HTTP requests → 1 request, ~50x faster vector fetch) - Add point_id to SearchResult to enable batch retrieval by Qdrant point ID - Reuse query embedding from search algorithm in viz_routes (eliminates redundant embedding call, saves ~30ms) - Make BM25 encode() async with thread pool to avoid blocking event loop (~4.4s was blocking, now properly async) - Run PCA computation in thread pool to avoid blocking event loop (~1.2s was blocking, now properly async) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-22 19:47:43 +01:00
Chris Coutinho	13b2d0048c	feat: Implement Qdrant placeholder state management Introduces a placeholder-based state tracking system to prevent duplicate document processing during the gap between scanner queuing and processor completion. Key Changes: 1. Placeholder Helper Functions (`vector/placeholder.py`): - `write_placeholder_point()` - Creates zero-vector placeholder when queuing - `query_document_metadata()` - Queries for existing entry (placeholder or real) - `delete_placeholder_point()` - Removes placeholder before writing real vectors - `get_placeholder_filter()` - Filters placeholders from user-facing queries 2. Scanner Updates (`vector/scanner.py`): - Replace `indexed_at` comparison with `modified_at` comparison - Write placeholder before queuing each document - Query per-document metadata instead of bulk-querying indexed_at - Fixes bug where files were resubmitted every scan cycle 3. Processor Updates (`vector/processor.py`): - Delete placeholder before upserting real vectors - Ensures no duplicate points in Qdrant 4. Query Filters (all search files): - Add `get_placeholder_filter()` to all user-facing queries - Ensures placeholders never appear in search results or visualizations - Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py Architecture: - Placeholders use zero vectors with dimension from embedding service - Payload includes `is_placeholder: True` flag for filtering - Status field tracks: "pending", "processing", "completed", "failed" - Deterministic UUIDs using uuid5 for consistent point IDs Impact: - Eliminates duplicate processing of same documents - Fixes race condition where long-running documents get queued multiple times - Prevents scanner from resubmitting files every scan cycle - Maintains clean separation between in-flight and indexed documents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 15:04:00 +01:00
Chris Coutinho	327d843f64	feat: Implement per-chunk vector visualization with context expansion Major improvements to vector visualization page: - Refactor PCA to display individual chunks instead of averaged documents - Add context expansion module for fetching surrounding text from notes and PDFs - Update deduplication to use (doc_id, doc_type, chunk_start, chunk_end) keys - Fix Alpine.js rendering with chunk-specific keys including offsets - Refactor authentication helper to return NextcloudClient for better reuse - Add async context manager support to NextcloudClient Technical details: - viz_routes.py: Fetch specific chunk vectors instead of averaging per document - context.py: New module supporting both notes and PDF text extraction via PyMuPDF - search algorithms: Extract page_number, chunk_index, total_chunks from Qdrant - vector-viz.js/html: Use chunk positions in expansion tracking keys This enables users to see which specific chunks match their query and view them with surrounding context in the PCA visualization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 11:22:20 +01:00
Chris Coutinho	b8010270c1	fix: Add async/await, PDF metadata, and type safety fixes This commit addresses multiple issues with async operations, PDF metadata extraction, and type safety in document processing and search. ## Async/Await Fixes - processor.py:259 - Added await for chunker.chunk_text(content) - processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts) - tests/unit/test_document_chunker.py - Converted all 12 test methods to async ## PDF Metadata Enhancement - pymupdf.py:143 - Added file_size metadata extraction - pymupdf.py:145-206 - Refactored to extract text page-by-page - Manually loop through pages instead of using page_chunks=True - Generate page_boundaries metadata for precise page tracking - Works around pymupdf.layout.activate() breaking page_chunks=True - processor.py:32-66 - Added assign_page_numbers() helper function - Assigns page numbers to chunks based on overlap with page boundaries - Handles chunks spanning multiple pages - processor.py:298-300 - Call assign_page_numbers() for PDF files ## Type Safety Fixes - bm25_hybrid.py:184 - Removed int() conversion of doc_id - semantic.py:131 - Removed int() conversion of doc_id - viz_routes.py:275 - Removed int() conversion of doc_id - Added comments documenting that doc_id can be int (notes) or str (file paths) ## Testing - All 18 tests passing (12 unit + 6 integration) - No type errors in modified files - Container logs show successful processing - Vector viz searches working correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 02:37:07 +01:00
Chris Coutinho	3464b21845	fix: Relax SearchResult validation to support DBSF fusion scores > 1.0 Fix false-positive validation error where DBSF (Distribution-Based Score Fusion) correctly produces scores > 1.0 but SearchResult validation incorrectly rejected them. Root Cause: SearchResult.__post_init__() enforced scores in [0.0, 1.0] range, but DBSF sums normalized scores from multiple retrieval systems (dense semantic + sparse BM25), resulting in scores like 1.55 when both systems strongly agree a document is relevant. Changes: - Relaxed validation to allow any score ≥ 0.0 (algorithms.py:147-157) - Updated SearchResult and SemanticSearchResult documentation to explain score ranges for RRF ([0.0, 1.0]) vs DBSF (unbounded) - Added comprehensive test coverage for both fusion methods - Added DBSF fusion option to vector visualization UI - Updated viz routes and vizApp() to support fusion parameter selection Testing: All 157 unit tests pass, type checking passes, ruff passes Fixes error: "Configuration error: Score must be between 0.0 and 1.0, got 1.1528953" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 06:32:30 +01:00
Chris Coutinho	6fe5596c13	feat: Implement BM25 hybrid search with native Qdrant RRF fusion Replace custom keyword/fuzzy search algorithms with industry-standard BM25 sparse vectors, combined with dense semantic vectors using Qdrant's native Reciprocal Rank Fusion (RRF). This consolidates search architecture and improves relevance for both semantic and keyword queries. Key changes: - Add fastembed dependency for BM25 sparse vector generation - Update Qdrant collection schema to support named vectors (dense + sparse) - Create BM25SparseEmbeddingProvider using FastEmbed's Qdrant/bm25 model - Implement BM25HybridSearchAlgorithm with native Qdrant RRF prefetch - Update document processor to generate both dense and sparse embeddings - Simplify nc_semantic_search() tool to use BM25 hybrid only - Remove legacy keyword.py, fuzzy.py, and custom hybrid.py (736 lines) - Update ADR-014 with implementation notes and test results Benefits: - Consolidated architecture (single Qdrant database) - Native database-level RRF fusion (more efficient) - Industry-standard BM25 (replaces brittle custom keyword search) - Better relevance across semantic and keyword queries - Simplified codebase (-285 net lines) Tests: All 125 tests passing (118 unit, 7 integration) Implements ADR-014: Replace Custom Keyword Search with BM25 Hybrid Search 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 06:59:44 +01:00

6 Commits