2147fc1696
Refactors PR #190's hardcoded Unstructured.io integration into a flexible, extensible plugin system supporting multiple text extraction engines. - **`DocumentProcessor` ABC**: Abstract interface for all processors - **`ProcessorRegistry`**: Central registry for discovery and routing - **`ProcessingResult`**: Standardized output format across processors - **`UnstructuredProcessor`**: Refactored from `UnstructuredClient` - **`TesseractProcessor`**: Local OCR for images (lightweight alternative) - **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs - New `get_document_processor_config()` returns structured config - Supports enabling/disabling individual processors - Per-processor configuration via environment variables - **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with: - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch) - `ENABLE_UNSTRUCTURED=true/false` (per-processor) - `ENABLE_TESSERACT=true/false` - `ENABLE_CUSTOM_PROCESSOR=true/false` - `parse_document()` now uses `ProcessorRegistry` - Auto-selects appropriate processor based on MIME type - Processor priority system (Unstructured=10, Tesseract=5, Custom=1) - `initialize_document_processors()` registers processors at startup - Integrated into both BasicAuth and OAuth lifespans - Graceful degradation if processors fail to initialize ```env ENABLE_DOCUMENT_PROCESSING=false ENABLE_UNSTRUCTURED=false UNSTRUCTURED_API_URL=http://unstructured:8000 UNSTRUCTURED_STRATEGY=auto # auto|fast|hi_res UNSTRUCTURED_LANGUAGES=eng,deu ENABLE_TESSERACT=false TESSERACT_LANG=eng ENABLE_CUSTOM_PROCESSOR=false CUSTOM_PROCESSOR_URL=http://localhost:9000/process CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg ``` - **Removed**: `tests/test_unstructured_config.py` (legacy tests) - **Added**: `tests/unit/test_document_processor_config.py` - 7 unit tests for new config system - Tests individual and multi-processor configurations - **Added**: - `nextcloud_mcp_server/document_processors/__init__.py` - `nextcloud_mcp_server/document_processors/base.py` - `nextcloud_mcp_server/document_processors/registry.py` - `nextcloud_mcp_server/document_processors/unstructured.py` - `nextcloud_mcp_server/document_processors/tesseract.py` - `nextcloud_mcp_server/document_processors/custom_http.py` - `tests/unit/test_document_processor_config.py` - **Modified**: - `nextcloud_mcp_server/config.py` - New plugin config system - `nextcloud_mcp_server/app.py` - Processor initialization - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry - `nextcloud_mcp_server/server/webdav.py` - Import updates - `env.sample` - New configuration format - `docker-compose.yml` - (profile changes from previous work) - **Removed**: - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor - `tests/test_unstructured_config.py` - Replaced with new tests ✅ **Extensible**: Add processors without modifying core code ✅ **Testable**: Mock processors for unit tests ✅ **Configurable**: Enable only needed processors ✅ **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured) ✅ **Opt-in**: Disabled by default, no mandatory dependencies Users upgrading from PR #190 need to update environment variables: ```bash ENABLE_UNSTRUCTURED_PARSING=true ENABLE_DOCUMENT_PROCESSING=true ENABLE_UNSTRUCTURED=true ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
93 lines
3.2 KiB
Plaintext
93 lines
3.2 KiB
Plaintext
# Nextcloud Instance
|
|
NEXTCLOUD_HOST=
|
|
|
|
# ===== AUTHENTICATION MODE =====
|
|
# Choose ONE of the following:
|
|
|
|
# Option 1: OAuth2/OIDC (RECOMMENDED - More Secure)
|
|
# - Requires Nextcloud OIDC app installed and configured
|
|
# - Admin must enable "Dynamic Client Registration" in OIDC app settings
|
|
# - Leave NEXTCLOUD_USERNAME and NEXTCLOUD_PASSWORD empty to use OAuth mode
|
|
# - Optional: Pre-register client and provide credentials (otherwise auto-registers)
|
|
NEXTCLOUD_OIDC_CLIENT_ID=
|
|
NEXTCLOUD_OIDC_CLIENT_SECRET=
|
|
NEXTCLOUD_OIDC_CLIENT_STORAGE=.nextcloud_oauth_client.json
|
|
NEXTCLOUD_MCP_SERVER_URL=http://localhost:8000
|
|
|
|
# Option 2: Basic Authentication (LEGACY - Less Secure)
|
|
# - Requires username and password
|
|
# - Credentials stored in environment variables
|
|
# - Use only for backward compatibility or if OAuth unavailable
|
|
# - If these are set, OAuth mode is disabled
|
|
NEXTCLOUD_USERNAME=
|
|
NEXTCLOUD_PASSWORD=
|
|
|
|
# ============================================
|
|
# Document Processing Configuration
|
|
# ============================================
|
|
# Enable document processing (PDF, DOCX, images, etc.)
|
|
# Set to false to disable all document processing
|
|
ENABLE_DOCUMENT_PROCESSING=false
|
|
|
|
# Default processor to use when multiple are available
|
|
# Options: unstructured, tesseract, custom
|
|
DOCUMENT_PROCESSOR=unstructured
|
|
|
|
# ============================================
|
|
# Unstructured.io Processor
|
|
# ============================================
|
|
# Enable Unstructured processor (requires unstructured service in docker-compose)
|
|
# This is a cloud-based/API processor supporting many document types
|
|
ENABLE_UNSTRUCTURED=false
|
|
|
|
# Unstructured API endpoint
|
|
UNSTRUCTURED_API_URL=http://unstructured:8000
|
|
|
|
# Request timeout in seconds (default: 120)
|
|
# OCR operations can take 30-120 seconds for large documents
|
|
UNSTRUCTURED_TIMEOUT=120
|
|
|
|
# Parsing strategy: auto, fast, hi_res
|
|
# - auto: Automatically choose based on document type
|
|
# - fast: Fast parsing without OCR
|
|
# - hi_res: High-resolution with OCR (slowest, most accurate)
|
|
UNSTRUCTURED_STRATEGY=auto
|
|
|
|
# OCR languages (comma-separated ISO 639-3 codes)
|
|
# Common: eng=English, deu=German, fra=French, spa=Spanish
|
|
UNSTRUCTURED_LANGUAGES=eng,deu
|
|
|
|
# ============================================
|
|
# Tesseract Processor (Local OCR)
|
|
# ============================================
|
|
# Enable Tesseract processor (requires tesseract binary installed)
|
|
# This is a local, lightweight OCR solution for images only
|
|
ENABLE_TESSERACT=false
|
|
|
|
# Path to tesseract executable (optional, auto-detected if in PATH)
|
|
#TESSERACT_CMD=/usr/bin/tesseract
|
|
|
|
# OCR language (e.g., eng, deu, eng+deu for multiple)
|
|
TESSERACT_LANG=eng
|
|
|
|
# ============================================
|
|
# Custom Processor (Your own API)
|
|
# ============================================
|
|
# Enable custom document processor via HTTP API
|
|
ENABLE_CUSTOM_PROCESSOR=false
|
|
|
|
# Unique name for your processor
|
|
#CUSTOM_PROCESSOR_NAME=my_ocr
|
|
|
|
# Your custom processor API endpoint
|
|
#CUSTOM_PROCESSOR_URL=http://localhost:9000/process
|
|
|
|
# Optional API key for authentication
|
|
#CUSTOM_PROCESSOR_API_KEY=your-api-key-here
|
|
|
|
# Request timeout in seconds
|
|
#CUSTOM_PROCESSOR_TIMEOUT=60
|
|
|
|
# Comma-separated MIME types your processor supports
|
|
#CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg,image/png
|