nextcloud-mcp-server

Author	SHA1	Message	Date
Chris Coutinho	cb39b3fca4	feat(vector): Add configurable chunk size and overlap for document embedding Enable users to tune document chunking parameters to match their embedding model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP environment variables. - config.py: Added `document_chunk_size` (default: 512) and `document_chunk_overlap` (default: 50) configuration fields with validation: - Ensures overlap < chunk_size - Warns if chunk_size < 100 words - Prevents negative overlap values - processor.py: Updated DocumentChunker instantiation to use config settings instead of hardcoded values (line 174-177) - tests/unit/test_config.py: Added TestChunkConfigValidation class with 9 tests covering: - Default values - Valid configurations - Validation errors (overlap >= chunk_size, negative overlap) - Warning for small chunk sizes - Environment variable loading - docs/configuration.md: Added comprehensive "Document Chunking Configuration" section with: - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words) - Overlap recommendations (10-20% of chunk size) - Configuration examples for different use cases - Added env vars to reference table - docs/semantic-search-architecture.md: Added "Document Chunking Strategy" section with: - Chunking process explanation - Example showing sliding window behavior - Search behavior with chunks - Tuning recommendations - env.sample: Added complete "Semantic Search & Vector Sync Configuration" section with: - Vector sync settings - Qdrant configuration (3 modes) - Ollama embedding service - Document chunking configuration - docker-compose.yml: Added commented examples for DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP with usage notes \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` 1. \`overlap\` must be less than \`chunk_size\` 2. \`overlap\` cannot be negative 3. Warning issued if \`chunk_size\` < 100 words Precise matching (small notes, specific queries): \`\`\`bash DOCUMENT_CHUNK_SIZE=256 DOCUMENT_CHUNK_OVERLAP=25 \`\`\` Balanced (default, general purpose): \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` Contextual (long documents, broader topics): \`\`\`bash DOCUMENT_CHUNK_SIZE=1024 DOCUMENT_CHUNK_OVERLAP=100 \`\`\` ✅ User control - Tune chunking to match embedding model capabilities ✅ Experimentation - Test different chunk sizes for optimal results ✅ Model alignment - Match chunk size to embedding context window ✅ Backward compatible - Defaults maintain existing behavior ✅ Well validated - Comprehensive tests prevent misconfiguration All 22 config validation tests pass (9 new tests for chunking): - Default values work correctly - Validation prevents invalid configurations - Environment variables load properly - Warning system works as expected With configurable chunk sizes, users can now experiment with different Ollama embedding models and tune chunk parameters for optimal semantic search quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-10 02:47:57 +01:00
Chris Coutinho	15113dbb03	fix: remove Hybrid Flow, make Progressive Consent default (ADR-004) Eliminates scope escalation security vulnerability by removing Hybrid Flow and making Progressive Consent the only OAuth mode. Changes: - Delete oauth_callback() and oauth_token() (Hybrid Flow only, ~314 lines) - Fix scope flows: Flow 1 requests resource scopes, Flow 2 requests identity+offline - Remove ENABLE_PROGRESSIVE_CONSENT flag (always enabled in OAuth mode) - Update documentation to reflect Progressive Consent as default - Delete test_adr004_hybrid_flow.py test file - Remove unused variables (ruff lint fixes) Security improvements: - No scope escalation: client gets exactly what it requests - Clear separation: MCP session tokens vs Nextcloud offline tokens - OAuth2 compliant: follows best practices for scope handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-04 00:26:07 +01:00
Chris Coutinho	d16bcdcfbb	feat: Implement ADR-004 Progressive Consent foundation components - Token Broker Service manages Nextcloud access tokens with audience validation - Implements short-lived token caching (5-minute TTL) with early refresh - Enhanced token storage schema with ADR-004 fields (flow_type, audience, provisioning) - MCP provisioning tools for explicit Flow 2 resource authorization - Comprehensive unit tests for Token Broker Service (14 tests, all passing) - Environment configuration for Progressive Consent mode This implements the foundation for the dual OAuth flow architecture where: - Flow 1: MCP clients authenticate to MCP server (aud: "mcp-server") - Flow 2: MCP server gets delegated Nextcloud access (aud: "nextcloud") Users must explicitly call provision_nextcloud_access tool to grant resource access, implementing the "stateless by default" principle from ADR-004. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-03 07:51:07 +01:00
Chris Coutinho	849c67c32a	fix: Complete Keycloak external IdP integration with all tests passing This commit completes the Keycloak external IdP integration for the MCP server, implementing ADR-002 Tier 2 (External Identity Provider) with full Bearer token authentication support. Key Changes: 1. Keycloak backchannel-dynamic configuration - Added --hostname-strict=false and --hostname-backchannel-dynamic=true - Allows external issuer (localhost:8888) with internal endpoints (keycloak:8080) - Solves Docker networking issue where containers can't reach localhost 2. CORSMiddleware Bearer token patch - Created app-hooks/patches/cors-bearer-token.patch from upstream commit 8fb5e77db82 - Allows Bearer tokens to bypass CORS/CSRF checks (stateless authentication) - Applied via post-installation hook 20-apply-cors-bearer-token-patch.sh - Enables app-specific APIs (Notes, Calendar, etc.) to work with Bearer tokens 3. Patch organization - Moved patches to app-hooks/patches/ directory - Updated docker-compose.yml to mount entire app-hooks directory - Consolidated patch management for better maintainability 4. Test improvements - All 11 Keycloak integration tests passing - Tests validate OAuth token acquisition, MCP connectivity, token validation, tool execution, token persistence, user provisioning, scope filtering, and error handling Architecture: - Keycloak acts as external OAuth/OIDC identity provider - MCP server uses Keycloak tokens to access Nextcloud APIs - Nextcloud user_oidc app validates Bearer tokens from Keycloak - No admin credentials needed - all API access uses user's OAuth tokens Cache Note: - Discovery and JWKS caches must be cleared when switching Keycloak configurations - Use: docker compose exec redis redis-cli DEL "<cache-key>" - Or: docker compose exec app php occ user_oidc:provider keycloak --clientid nextcloud Related: - ADR-002: Vector sync background jobs authentication - Validates external IdP integration pattern - Demonstrates offline_access with refresh tokens (Tier 1 & 2) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-02 22:03:20 +01:00
Chris Coutinho	a36038422b	feat: Add text processing background worker for telling client about progress	2025-10-25 19:52:45 +02:00
Chris Coutinho	2147fc1696	refactor: Transform document parsing into pluggable processor architecture Refactors PR #190's hardcoded Unstructured.io integration into a flexible, extensible plugin system supporting multiple text extraction engines. - `DocumentProcessor` ABC: Abstract interface for all processors - `ProcessorRegistry`: Central registry for discovery and routing - `ProcessingResult`: Standardized output format across processors - `UnstructuredProcessor`: Refactored from `UnstructuredClient` - `TesseractProcessor`: Local OCR for images (lightweight alternative) - `CustomHTTPProcessor`: Generic wrapper for custom HTTP APIs - New `get_document_processor_config()` returns structured config - Supports enabling/disabling individual processors - Per-processor configuration via environment variables - Breaking Change: `ENABLE_UNSTRUCTURED_PARSING` replaced with: - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch) - `ENABLE_UNSTRUCTURED=true/false` (per-processor) - `ENABLE_TESSERACT=true/false` - `ENABLE_CUSTOM_PROCESSOR=true/false` - `parse_document()` now uses `ProcessorRegistry` - Auto-selects appropriate processor based on MIME type - Processor priority system (Unstructured=10, Tesseract=5, Custom=1) - `initialize_document_processors()` registers processors at startup - Integrated into both BasicAuth and OAuth lifespans - Graceful degradation if processors fail to initialize ```env ENABLE_DOCUMENT_PROCESSING=false ENABLE_UNSTRUCTURED=false UNSTRUCTURED_API_URL=http://unstructured:8000 UNSTRUCTURED_STRATEGY=auto # auto\|fast\|hi_res UNSTRUCTURED_LANGUAGES=eng,deu ENABLE_TESSERACT=false TESSERACT_LANG=eng ENABLE_CUSTOM_PROCESSOR=false CUSTOM_PROCESSOR_URL=http://localhost:9000/process CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg ``` - Removed: `tests/test_unstructured_config.py` (legacy tests) - Added: `tests/unit/test_document_processor_config.py` - 7 unit tests for new config system - Tests individual and multi-processor configurations - Added: - `nextcloud_mcp_server/document_processors/__init__.py` - `nextcloud_mcp_server/document_processors/base.py` - `nextcloud_mcp_server/document_processors/registry.py` - `nextcloud_mcp_server/document_processors/unstructured.py` - `nextcloud_mcp_server/document_processors/tesseract.py` - `nextcloud_mcp_server/document_processors/custom_http.py` - `tests/unit/test_document_processor_config.py` - Modified: - `nextcloud_mcp_server/config.py` - New plugin config system - `nextcloud_mcp_server/app.py` - Processor initialization - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry - `nextcloud_mcp_server/server/webdav.py` - Import updates - `env.sample` - New configuration format - `docker-compose.yml` - (profile changes from previous work) - Removed: - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor - `tests/test_unstructured_config.py` - Replaced with new tests ✅ Extensible: Add processors without modifying core code ✅ Testable: Mock processors for unit tests ✅ Configurable: Enable only needed processors ✅ Flexible: Choose fast (Tesseract) vs accurate (Unstructured) ✅ Opt-in: Disabled by default, no mandatory dependencies Users upgrading from PR #190 need to update environment variables: ```bash ENABLE_UNSTRUCTURED_PARSING=true ENABLE_DOCUMENT_PROCESSING=true ENABLE_UNSTRUCTURED=true ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-25 19:28:35 +02:00
yuisheaven	64649c902d	Merge branch 'master' into feature/introduce_files_parsing_with_unstructured_service_for_webdav_files_retrieval	2025-10-21 20:37:00 +02:00
Chris Coutinho	4d7e4b9a4b	feat(server): Experimental support for OAuth2/OIDC authentication	2025-10-14 01:22:15 +02:00
yuisheaven	c9a687171a	added envs for unstructured to control OCR quality and OCR languages	2025-10-04 05:21:02 +02:00
yuisheaven	642108ee91	added new "unstructured" docker service to compose stack and introduced new envs	2025-10-04 04:27:31 +02:00
Chris Coutinho	0d8666a2d7	Initial commit	2025-05-04 23:24:55 +02:00

11 Commits