Chris Coutinho
cb39b3fca4
feat(vector): Add configurable chunk size and overlap for document embedding
...
Enable users to tune document chunking parameters to match their embedding
model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP
environment variables.
- **config.py**: Added `document_chunk_size` (default: 512) and
`document_chunk_overlap` (default: 50) configuration fields with validation:
- Ensures overlap < chunk_size
- Warns if chunk_size < 100 words
- Prevents negative overlap values
- **processor.py**: Updated DocumentChunker instantiation to use config
settings instead of hardcoded values (line 174-177)
- **tests/unit/test_config.py**: Added TestChunkConfigValidation class with
9 tests covering:
- Default values
- Valid configurations
- Validation errors (overlap >= chunk_size, negative overlap)
- Warning for small chunk sizes
- Environment variable loading
- **docs/configuration.md**: Added comprehensive "Document Chunking
Configuration" section with:
- Chunk size selection guidance (256-384 vs 512 vs 768-1024 words)
- Overlap recommendations (10-20% of chunk size)
- Configuration examples for different use cases
- Added env vars to reference table
- **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy"
section with:
- Chunking process explanation
- Example showing sliding window behavior
- Search behavior with chunks
- Tuning recommendations
- **env.sample**: Added complete "Semantic Search & Vector Sync Configuration"
section with:
- Vector sync settings
- Qdrant configuration (3 modes)
- Ollama embedding service
- Document chunking configuration
- **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and
DOCUMENT_CHUNK_OVERLAP with usage notes
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`
1. \`overlap\` must be less than \`chunk_size\`
2. \`overlap\` cannot be negative
3. Warning issued if \`chunk_size\` < 100 words
**Precise matching** (small notes, specific queries):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
\`\`\`
**Balanced** (default, general purpose):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`
**Contextual** (long documents, broader topics):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
\`\`\`
✅ **User control** - Tune chunking to match embedding model capabilities
✅ **Experimentation** - Test different chunk sizes for optimal results
✅ **Model alignment** - Match chunk size to embedding context window
✅ **Backward compatible** - Defaults maintain existing behavior
✅ **Well validated** - Comprehensive tests prevent misconfiguration
All 22 config validation tests pass (9 new tests for chunking):
- Default values work correctly
- Validation prevents invalid configurations
- Environment variables load properly
- Warning system works as expected
With configurable chunk sizes, users can now experiment with different Ollama
embedding models and tune chunk parameters for optimal semantic search quality.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-10 02:47:57 +01:00
Chris Coutinho
15113dbb03
fix: remove Hybrid Flow, make Progressive Consent default (ADR-004)
...
Eliminates scope escalation security vulnerability by removing Hybrid Flow
and making Progressive Consent the only OAuth mode.
Changes:
- Delete oauth_callback() and oauth_token() (Hybrid Flow only, ~314 lines)
- Fix scope flows: Flow 1 requests resource scopes, Flow 2 requests identity+offline
- Remove ENABLE_PROGRESSIVE_CONSENT flag (always enabled in OAuth mode)
- Update documentation to reflect Progressive Consent as default
- Delete test_adr004_hybrid_flow.py test file
- Remove unused variables (ruff lint fixes)
Security improvements:
- No scope escalation: client gets exactly what it requests
- Clear separation: MCP session tokens vs Nextcloud offline tokens
- OAuth2 compliant: follows best practices for scope handling
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-04 00:26:07 +01:00
Chris Coutinho
d16bcdcfbb
feat: Implement ADR-004 Progressive Consent foundation components
...
- Token Broker Service manages Nextcloud access tokens with audience validation
- Implements short-lived token caching (5-minute TTL) with early refresh
- Enhanced token storage schema with ADR-004 fields (flow_type, audience, provisioning)
- MCP provisioning tools for explicit Flow 2 resource authorization
- Comprehensive unit tests for Token Broker Service (14 tests, all passing)
- Environment configuration for Progressive Consent mode
This implements the foundation for the dual OAuth flow architecture where:
- Flow 1: MCP clients authenticate to MCP server (aud: "mcp-server")
- Flow 2: MCP server gets delegated Nextcloud access (aud: "nextcloud")
Users must explicitly call provision_nextcloud_access tool to grant resource access,
implementing the "stateless by default" principle from ADR-004.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-03 07:51:07 +01:00
Chris Coutinho
849c67c32a
fix: Complete Keycloak external IdP integration with all tests passing
...
This commit completes the Keycloak external IdP integration for the MCP
server, implementing ADR-002 Tier 2 (External Identity Provider) with
full Bearer token authentication support.
Key Changes:
1. **Keycloak backchannel-dynamic configuration**
- Added --hostname-strict=false and --hostname-backchannel-dynamic=true
- Allows external issuer (localhost:8888) with internal endpoints (keycloak:8080)
- Solves Docker networking issue where containers can't reach localhost
2. **CORSMiddleware Bearer token patch**
- Created app-hooks/patches/cors-bearer-token.patch from upstream commit 8fb5e77db82
- Allows Bearer tokens to bypass CORS/CSRF checks (stateless authentication)
- Applied via post-installation hook 20-apply-cors-bearer-token-patch.sh
- Enables app-specific APIs (Notes, Calendar, etc.) to work with Bearer tokens
3. **Patch organization**
- Moved patches to app-hooks/patches/ directory
- Updated docker-compose.yml to mount entire app-hooks directory
- Consolidated patch management for better maintainability
4. **Test improvements**
- All 11 Keycloak integration tests passing
- Tests validate OAuth token acquisition, MCP connectivity, token validation,
tool execution, token persistence, user provisioning, scope filtering,
and error handling
Architecture:
- Keycloak acts as external OAuth/OIDC identity provider
- MCP server uses Keycloak tokens to access Nextcloud APIs
- Nextcloud user_oidc app validates Bearer tokens from Keycloak
- No admin credentials needed - all API access uses user's OAuth tokens
Cache Note:
- Discovery and JWKS caches must be cleared when switching Keycloak configurations
- Use: docker compose exec redis redis-cli DEL "<cache-key>"
- Or: docker compose exec app php occ user_oidc:provider keycloak --clientid nextcloud
Related:
- ADR-002: Vector sync background jobs authentication
- Validates external IdP integration pattern
- Demonstrates offline_access with refresh tokens (Tier 1 & 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-02 22:03:20 +01:00
Chris Coutinho
a36038422b
feat: Add text processing background worker for telling client about progress
2025-10-25 19:52:45 +02:00
Chris Coutinho
2147fc1696
refactor: Transform document parsing into pluggable processor architecture
...
Refactors PR #190 's hardcoded Unstructured.io integration into a flexible,
extensible plugin system supporting multiple text extraction engines.
- **`DocumentProcessor` ABC**: Abstract interface for all processors
- **`ProcessorRegistry`**: Central registry for discovery and routing
- **`ProcessingResult`**: Standardized output format across processors
- **`UnstructuredProcessor`**: Refactored from `UnstructuredClient`
- **`TesseractProcessor`**: Local OCR for images (lightweight alternative)
- **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs
- New `get_document_processor_config()` returns structured config
- Supports enabling/disabling individual processors
- Per-processor configuration via environment variables
- **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with:
- `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch)
- `ENABLE_UNSTRUCTURED=true/false` (per-processor)
- `ENABLE_TESSERACT=true/false`
- `ENABLE_CUSTOM_PROCESSOR=true/false`
- `parse_document()` now uses `ProcessorRegistry`
- Auto-selects appropriate processor based on MIME type
- Processor priority system (Unstructured=10, Tesseract=5, Custom=1)
- `initialize_document_processors()` registers processors at startup
- Integrated into both BasicAuth and OAuth lifespans
- Graceful degradation if processors fail to initialize
```env
ENABLE_DOCUMENT_PROCESSING=false
ENABLE_UNSTRUCTURED=false
UNSTRUCTURED_API_URL=http://unstructured:8000
UNSTRUCTURED_STRATEGY=auto # auto|fast|hi_res
UNSTRUCTURED_LANGUAGES=eng,deu
ENABLE_TESSERACT=false
TESSERACT_LANG=eng
ENABLE_CUSTOM_PROCESSOR=false
CUSTOM_PROCESSOR_URL=http://localhost:9000/process
CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg
```
- **Removed**: `tests/test_unstructured_config.py` (legacy tests)
- **Added**: `tests/unit/test_document_processor_config.py`
- 7 unit tests for new config system
- Tests individual and multi-processor configurations
- **Added**:
- `nextcloud_mcp_server/document_processors/__init__.py`
- `nextcloud_mcp_server/document_processors/base.py`
- `nextcloud_mcp_server/document_processors/registry.py`
- `nextcloud_mcp_server/document_processors/unstructured.py`
- `nextcloud_mcp_server/document_processors/tesseract.py`
- `nextcloud_mcp_server/document_processors/custom_http.py`
- `tests/unit/test_document_processor_config.py`
- **Modified**:
- `nextcloud_mcp_server/config.py` - New plugin config system
- `nextcloud_mcp_server/app.py` - Processor initialization
- `nextcloud_mcp_server/utils/document_parser.py` - Uses registry
- `nextcloud_mcp_server/server/webdav.py` - Import updates
- `env.sample` - New configuration format
- `docker-compose.yml` - (profile changes from previous work)
- **Removed**:
- `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor
- `tests/test_unstructured_config.py` - Replaced with new tests
✅ **Extensible**: Add processors without modifying core code
✅ **Testable**: Mock processors for unit tests
✅ **Configurable**: Enable only needed processors
✅ **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured)
✅ **Opt-in**: Disabled by default, no mandatory dependencies
Users upgrading from PR #190 need to update environment variables:
```bash
ENABLE_UNSTRUCTURED_PARSING=true
ENABLE_DOCUMENT_PROCESSING=true
ENABLE_UNSTRUCTURED=true
```
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-10-25 19:28:35 +02:00
yuisheaven
64649c902d
Merge branch 'master' into feature/introduce_files_parsing_with_unstructured_service_for_webdav_files_retrieval
2025-10-21 20:37:00 +02:00
Chris Coutinho
4d7e4b9a4b
feat(server): Experimental support for OAuth2/OIDC authentication
2025-10-14 01:22:15 +02:00
yuisheaven
c9a687171a
added envs for unstructured to control OCR quality and OCR languages
2025-10-04 05:21:02 +02:00
yuisheaven
642108ee91
added new "unstructured" docker service to compose stack and introduced new envs
2025-10-04 04:27:31 +02:00
Chris Coutinho
0d8666a2d7
Initial commit
2025-05-04 23:24:55 +02:00