nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	a62a007c87	feat: Add context expansion to semantic search with chunk overlap removal Implements optional context expansion for semantic search results that fetches adjacent chunks (N-1 and N+1) from Qdrant to provide before/after context. Removes configurable chunk overlap (default 200 chars) to avoid duplicate text appearing in both context and excerpt. Key changes: - Add include_context and context_chars parameters to nc_semantic_search and nc_semantic_search_answer tools - Implement Qdrant cache fast path for chunk retrieval (avoids re-fetching and re-parsing documents, especially important for PDFs) - Add _get_chunk_by_index_from_qdrant() to fetch adjacent chunks - Remove chunk overlap from before_context (last N chars) and after_context (first N chars) to prevent duplicate text - Fetch context in parallel with anyio.Semaphore (max 20 concurrent) - Pass through page_number from SearchResult to SemanticSearchResult - Remove document-level deduplication (keep chunk-level dedup from algorithm) Context expansion is opt-in via include_context=true parameter. When enabled: - Populates has_context_expansion, marked_text, before_context, after_context - Adds truncation flags when context exceeds context_chars limit - Falls back to document fetch for legacy data with truncated excerpts Related: nextcloud_mcp_server/search/context.py:87-382, nextcloud_mcp_server/server/semantic.py:161-255	2025-11-21 01:02:22 +01:00
Chris Coutinho	b8010270c1	fix: Add async/await, PDF metadata, and type safety fixes This commit addresses multiple issues with async operations, PDF metadata extraction, and type safety in document processing and search. ## Async/Await Fixes - processor.py:259 - Added await for chunker.chunk_text(content) - processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts) - tests/unit/test_document_chunker.py - Converted all 12 test methods to async ## PDF Metadata Enhancement - pymupdf.py:143 - Added file_size metadata extraction - pymupdf.py:145-206 - Refactored to extract text page-by-page - Manually loop through pages instead of using page_chunks=True - Generate page_boundaries metadata for precise page tracking - Works around pymupdf.layout.activate() breaking page_chunks=True - processor.py:32-66 - Added assign_page_numbers() helper function - Assigns page numbers to chunks based on overlap with page boundaries - Handles chunks spanning multiple pages - processor.py:298-300 - Call assign_page_numbers() for PDF files ## Type Safety Fixes - bm25_hybrid.py:184 - Removed int() conversion of doc_id - semantic.py:131 - Removed int() conversion of doc_id - viz_routes.py:275 - Removed int() conversion of doc_id - Added comments documenting that doc_id can be int (notes) or str (file paths) ## Testing - All 18 tests passing (12 unit + 6 integration) - No type errors in modified files - Container logs show successful processing - Vector viz searches working correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 02:37:07 +01:00
Chris Coutinho	6cccd92b3b	build: Add type checking	2025-11-05 15:19:55 +01:00
Chris Coutinho	a36038422b	feat: Add text processing background worker for telling client about progress	2025-10-25 19:52:45 +02:00
Chris Coutinho	2147fc1696	refactor: Transform document parsing into pluggable processor architecture Refactors PR #190's hardcoded Unstructured.io integration into a flexible, extensible plugin system supporting multiple text extraction engines. - `DocumentProcessor` ABC: Abstract interface for all processors - `ProcessorRegistry`: Central registry for discovery and routing - `ProcessingResult`: Standardized output format across processors - `UnstructuredProcessor`: Refactored from `UnstructuredClient` - `TesseractProcessor`: Local OCR for images (lightweight alternative) - `CustomHTTPProcessor`: Generic wrapper for custom HTTP APIs - New `get_document_processor_config()` returns structured config - Supports enabling/disabling individual processors - Per-processor configuration via environment variables - Breaking Change: `ENABLE_UNSTRUCTURED_PARSING` replaced with: - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch) - `ENABLE_UNSTRUCTURED=true/false` (per-processor) - `ENABLE_TESSERACT=true/false` - `ENABLE_CUSTOM_PROCESSOR=true/false` - `parse_document()` now uses `ProcessorRegistry` - Auto-selects appropriate processor based on MIME type - Processor priority system (Unstructured=10, Tesseract=5, Custom=1) - `initialize_document_processors()` registers processors at startup - Integrated into both BasicAuth and OAuth lifespans - Graceful degradation if processors fail to initialize ```env ENABLE_DOCUMENT_PROCESSING=false ENABLE_UNSTRUCTURED=false UNSTRUCTURED_API_URL=http://unstructured:8000 UNSTRUCTURED_STRATEGY=auto # auto\|fast\|hi_res UNSTRUCTURED_LANGUAGES=eng,deu ENABLE_TESSERACT=false TESSERACT_LANG=eng ENABLE_CUSTOM_PROCESSOR=false CUSTOM_PROCESSOR_URL=http://localhost:9000/process CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg ``` - Removed: `tests/test_unstructured_config.py` (legacy tests) - Added: `tests/unit/test_document_processor_config.py` - 7 unit tests for new config system - Tests individual and multi-processor configurations - Added: - `nextcloud_mcp_server/document_processors/__init__.py` - `nextcloud_mcp_server/document_processors/base.py` - `nextcloud_mcp_server/document_processors/registry.py` - `nextcloud_mcp_server/document_processors/unstructured.py` - `nextcloud_mcp_server/document_processors/tesseract.py` - `nextcloud_mcp_server/document_processors/custom_http.py` - `tests/unit/test_document_processor_config.py` - Modified: - `nextcloud_mcp_server/config.py` - New plugin config system - `nextcloud_mcp_server/app.py` - Processor initialization - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry - `nextcloud_mcp_server/server/webdav.py` - Import updates - `env.sample` - New configuration format - `docker-compose.yml` - (profile changes from previous work) - Removed: - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor - `tests/test_unstructured_config.py` - Replaced with new tests ✅ Extensible: Add processors without modifying core code ✅ Testable: Mock processors for unit tests ✅ Configurable: Enable only needed processors ✅ Flexible: Choose fast (Tesseract) vs accurate (Unstructured) ✅ Opt-in: Disabled by default, no mandatory dependencies Users upgrading from PR #190 need to update environment variables: ```bash ENABLE_UNSTRUCTURED_PARSING=true ENABLE_DOCUMENT_PROCESSING=true ENABLE_UNSTRUCTURED=true ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-25 19:28:35 +02:00

5 Commits