nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	056414752e	fix(mcp): Move all imports to the top of modules	2025-12-26 10:05:27 -06:00
Chris Coutinho	8baa07db84	fix: Remove pymupdf.layout.activate() to fix page_chunks behavior pymupdf.layout.activate() causes pymupdf4llm.to_markdown() to ignore the page_chunks=True option, returning a single string instead of list[dict]. This broke per-page chunking needed for semantic search indexing. See: https://github.com/pymupdf/pymupdf4llm/issues/323 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-22 16:58:35 +01:00
Chris Coutinho	ba8a53803a	refactor: Simplify PDF text extraction with single to_markdown call Replace parallel per-page extraction with single to_markdown(page_chunks=True) call. This is more efficient as pymupdf4llm can optimize internally for full-document processing instead of making N separate calls for N pages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-22 03:52:02 +01:00
Chris Coutinho	31fade9730	perf: Optimize PDF processing with parallel extraction and single-render highlights Phase 1 - PDF Highlighting Optimization: - Render each page ONCE instead of once per chunk (N chunks = 1 render, not N) - Use PIL to draw bounding boxes on copied base images (fast) instead of re-rendering page via pymupdf (slow) - Add _find_chunk_bbox() to extract bbox without modifying page Phase 2 - Parallel Page Extraction: - Use anyio task group with run_sync() for parallel page extraction - Each page extracted in separate thread via anyio.to_thread.run_sync() - Event loop stays responsive during extraction - Remove obsolete _process_sync() method Expected improvement: 30-50% reduction in total PDF processing time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-22 03:11:56 +01:00
Chris Coutinho	fffe483c02	fix: Centralize PDF processing and generate separate images per chunk Previously, pymupdf4llm.to_markdown() was called twice - once in PyMuPDFProcessor during indexing and again in PDFHighlighter during visualization. Different image path lengths caused different character offsets, leading to highlighted pages not matching their chunks. Also fixed issue where all chunks on the same page showed all highlights instead of just their own highlight. Now restores original page contents between chunks using xref stream caching. Changes: - Add PDFHighlighter class requiring pre-computed page_boundaries and full_text from document processor (no fallback extraction) - Pass pre-computed data from processor to highlighter - Extract page-relative portion of chunk text for cross-page chunks - Add bounding box highlighting using text anchor search - Run highlight generation in parallel with embedding/BM25 - Cache and restore page contents to isolate highlights per chunk Results: Highlighting success rate improved from 51% to 95% (121/128). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-22 02:46:30 +01:00
Chris Coutinho	a62a007c87	feat: Add context expansion to semantic search with chunk overlap removal Implements optional context expansion for semantic search results that fetches adjacent chunks (N-1 and N+1) from Qdrant to provide before/after context. Removes configurable chunk overlap (default 200 chars) to avoid duplicate text appearing in both context and excerpt. Key changes: - Add include_context and context_chars parameters to nc_semantic_search and nc_semantic_search_answer tools - Implement Qdrant cache fast path for chunk retrieval (avoids re-fetching and re-parsing documents, especially important for PDFs) - Add _get_chunk_by_index_from_qdrant() to fetch adjacent chunks - Remove chunk overlap from before_context (last N chars) and after_context (first N chars) to prevent duplicate text - Fetch context in parallel with anyio.Semaphore (max 20 concurrent) - Pass through page_number from SearchResult to SemanticSearchResult - Remove document-level deduplication (keep chunk-level dedup from algorithm) Context expansion is opt-in via include_context=true parameter. When enabled: - Populates has_context_expansion, marked_text, before_context, after_context - Adds truncation flags when context exceeds context_chars limit - Falls back to document fetch for legacy data with truncated excerpts Related: nextcloud_mcp_server/search/context.py:87-382, nextcloud_mcp_server/server/semantic.py:161-255	2025-11-21 01:02:22 +01:00
Chris Coutinho	b8010270c1	fix: Add async/await, PDF metadata, and type safety fixes This commit addresses multiple issues with async operations, PDF metadata extraction, and type safety in document processing and search. ## Async/Await Fixes - processor.py:259 - Added await for chunker.chunk_text(content) - processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts) - tests/unit/test_document_chunker.py - Converted all 12 test methods to async ## PDF Metadata Enhancement - pymupdf.py:143 - Added file_size metadata extraction - pymupdf.py:145-206 - Refactored to extract text page-by-page - Manually loop through pages instead of using page_chunks=True - Generate page_boundaries metadata for precise page tracking - Works around pymupdf.layout.activate() breaking page_chunks=True - processor.py:32-66 - Added assign_page_numbers() helper function - Assigns page numbers to chunks based on overlap with page boundaries - Handles chunks spanning multiple pages - processor.py:298-300 - Call assign_page_numbers() for PDF files ## Type Safety Fixes - bm25_hybrid.py:184 - Removed int() conversion of doc_id - semantic.py:131 - Removed int() conversion of doc_id - viz_routes.py:275 - Removed int() conversion of doc_id - Added comments documenting that doc_id can be int (notes) or str (file paths) ## Testing - All 18 tests passing (12 unit + 6 integration) - No type errors in modified files - Container logs show successful processing - Vector viz searches working correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 02:37:07 +01:00
Chris Coutinho	6cccd92b3b	build: Add type checking	2025-11-05 15:19:55 +01:00
Chris Coutinho	a36038422b	feat: Add text processing background worker for telling client about progress	2025-10-25 19:52:45 +02:00
Chris Coutinho	2147fc1696	refactor: Transform document parsing into pluggable processor architecture Refactors PR #190's hardcoded Unstructured.io integration into a flexible, extensible plugin system supporting multiple text extraction engines. - `DocumentProcessor` ABC: Abstract interface for all processors - `ProcessorRegistry`: Central registry for discovery and routing - `ProcessingResult`: Standardized output format across processors - `UnstructuredProcessor`: Refactored from `UnstructuredClient` - `TesseractProcessor`: Local OCR for images (lightweight alternative) - `CustomHTTPProcessor`: Generic wrapper for custom HTTP APIs - New `get_document_processor_config()` returns structured config - Supports enabling/disabling individual processors - Per-processor configuration via environment variables - Breaking Change: `ENABLE_UNSTRUCTURED_PARSING` replaced with: - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch) - `ENABLE_UNSTRUCTURED=true/false` (per-processor) - `ENABLE_TESSERACT=true/false` - `ENABLE_CUSTOM_PROCESSOR=true/false` - `parse_document()` now uses `ProcessorRegistry` - Auto-selects appropriate processor based on MIME type - Processor priority system (Unstructured=10, Tesseract=5, Custom=1) - `initialize_document_processors()` registers processors at startup - Integrated into both BasicAuth and OAuth lifespans - Graceful degradation if processors fail to initialize ```env ENABLE_DOCUMENT_PROCESSING=false ENABLE_UNSTRUCTURED=false UNSTRUCTURED_API_URL=http://unstructured:8000 UNSTRUCTURED_STRATEGY=auto # auto\|fast\|hi_res UNSTRUCTURED_LANGUAGES=eng,deu ENABLE_TESSERACT=false TESSERACT_LANG=eng ENABLE_CUSTOM_PROCESSOR=false CUSTOM_PROCESSOR_URL=http://localhost:9000/process CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg ``` - Removed: `tests/test_unstructured_config.py` (legacy tests) - Added: `tests/unit/test_document_processor_config.py` - 7 unit tests for new config system - Tests individual and multi-processor configurations - Added: - `nextcloud_mcp_server/document_processors/__init__.py` - `nextcloud_mcp_server/document_processors/base.py` - `nextcloud_mcp_server/document_processors/registry.py` - `nextcloud_mcp_server/document_processors/unstructured.py` - `nextcloud_mcp_server/document_processors/tesseract.py` - `nextcloud_mcp_server/document_processors/custom_http.py` - `tests/unit/test_document_processor_config.py` - Modified: - `nextcloud_mcp_server/config.py` - New plugin config system - `nextcloud_mcp_server/app.py` - Processor initialization - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry - `nextcloud_mcp_server/server/webdav.py` - Import updates - `env.sample` - New configuration format - `docker-compose.yml` - (profile changes from previous work) - Removed: - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor - `tests/test_unstructured_config.py` - Replaced with new tests ✅ Extensible: Add processors without modifying core code ✅ Testable: Mock processors for unit tests ✅ Configurable: Enable only needed processors ✅ Flexible: Choose fast (Tesseract) vs accurate (Unstructured) ✅ Opt-in: Disabled by default, no mandatory dependencies Users upgrading from PR #190 need to update environment variables: ```bash ENABLE_UNSTRUCTURED_PARSING=true ENABLE_DOCUMENT_PROCESSING=true ENABLE_UNSTRUCTURED=true ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-25 19:28:35 +02:00

10 Commits