Files
nextcloud-mcp-server/nextcloud_mcp_server/document_processors/base.py
T
Chris Coutinho 2147fc1696 refactor: Transform document parsing into pluggable processor architecture
Refactors PR #190's hardcoded Unstructured.io integration into a flexible,
extensible plugin system supporting multiple text extraction engines.

- **`DocumentProcessor` ABC**: Abstract interface for all processors
- **`ProcessorRegistry`**: Central registry for discovery and routing
- **`ProcessingResult`**: Standardized output format across processors

- **`UnstructuredProcessor`**: Refactored from `UnstructuredClient`
- **`TesseractProcessor`**: Local OCR for images (lightweight alternative)
- **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs

- New `get_document_processor_config()` returns structured config
- Supports enabling/disabling individual processors
- Per-processor configuration via environment variables
- **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with:
  - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch)
  - `ENABLE_UNSTRUCTURED=true/false` (per-processor)
  - `ENABLE_TESSERACT=true/false`
  - `ENABLE_CUSTOM_PROCESSOR=true/false`

- `parse_document()` now uses `ProcessorRegistry`
- Auto-selects appropriate processor based on MIME type
- Processor priority system (Unstructured=10, Tesseract=5, Custom=1)

- `initialize_document_processors()` registers processors at startup
- Integrated into both BasicAuth and OAuth lifespans
- Graceful degradation if processors fail to initialize

```env
ENABLE_DOCUMENT_PROCESSING=false

ENABLE_UNSTRUCTURED=false
UNSTRUCTURED_API_URL=http://unstructured:8000
UNSTRUCTURED_STRATEGY=auto  # auto|fast|hi_res
UNSTRUCTURED_LANGUAGES=eng,deu

ENABLE_TESSERACT=false
TESSERACT_LANG=eng

ENABLE_CUSTOM_PROCESSOR=false
CUSTOM_PROCESSOR_URL=http://localhost:9000/process
CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg
```

- **Removed**: `tests/test_unstructured_config.py` (legacy tests)
- **Added**: `tests/unit/test_document_processor_config.py`
  - 7 unit tests for new config system
  - Tests individual and multi-processor configurations

- **Added**:
  - `nextcloud_mcp_server/document_processors/__init__.py`
  - `nextcloud_mcp_server/document_processors/base.py`
  - `nextcloud_mcp_server/document_processors/registry.py`
  - `nextcloud_mcp_server/document_processors/unstructured.py`
  - `nextcloud_mcp_server/document_processors/tesseract.py`
  - `nextcloud_mcp_server/document_processors/custom_http.py`
  - `tests/unit/test_document_processor_config.py`

- **Modified**:
  - `nextcloud_mcp_server/config.py` - New plugin config system
  - `nextcloud_mcp_server/app.py` - Processor initialization
  - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry
  - `nextcloud_mcp_server/server/webdav.py` - Import updates
  - `env.sample` - New configuration format
  - `docker-compose.yml` - (profile changes from previous work)

- **Removed**:
  - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor
  - `tests/test_unstructured_config.py` - Replaced with new tests

 **Extensible**: Add processors without modifying core code
 **Testable**: Mock processors for unit tests
 **Configurable**: Enable only needed processors
 **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured)
 **Opt-in**: Disabled by default, no mandatory dependencies

Users upgrading from PR #190 need to update environment variables:
```bash
ENABLE_UNSTRUCTURED_PARSING=true

ENABLE_DOCUMENT_PROCESSING=true
ENABLE_UNSTRUCTURED=true
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-25 19:28:35 +02:00

118 lines
3.3 KiB
Python

"""Abstract base class for document processing plugins."""
from abc import ABC, abstractmethod
from typing import Any, Optional
from pydantic import BaseModel
class ProcessingResult(BaseModel):
"""Standardized result from any document processor."""
text: str
"""Extracted text content"""
metadata: dict[str, Any]
"""Processor-specific metadata"""
processor: str
"""Name of processor that handled this (e.g., 'unstructured', 'tesseract')"""
success: bool = True
"""Whether processing succeeded"""
error: Optional[str] = None
"""Error message if processing failed"""
class DocumentProcessor(ABC):
"""Abstract base class for document processing plugins.
Document processors extract text from various file formats (PDF, DOCX, images, etc.).
Each processor implements this interface and can be registered with the ProcessorRegistry.
Example:
class MyProcessor(DocumentProcessor):
@property
def name(self) -> str:
return "my_processor"
@property
def supported_mime_types(self) -> set[str]:
return {"application/pdf", "image/jpeg"}
async def process(self, content: bytes, content_type: str, **kwargs) -> ProcessingResult:
# Extract text from content
return ProcessingResult(text="...", metadata={}, processor=self.name)
async def health_check(self) -> bool:
return True
"""
@property
@abstractmethod
def name(self) -> str:
"""Unique identifier for this processor (e.g., 'unstructured', 'tesseract')."""
pass
@property
@abstractmethod
def supported_mime_types(self) -> set[str]:
"""Set of MIME types this processor can handle.
Examples: {"application/pdf", "image/jpeg", "image/png"}
"""
pass
@abstractmethod
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process a document and extract text.
Args:
content: Document bytes
content_type: MIME type of the document
filename: Optional filename for format detection
options: Processor-specific options (e.g., OCR language, strategy)
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If processing fails
"""
pass
@abstractmethod
async def health_check(self) -> bool:
"""Check if processor is available and healthy.
Returns:
True if processor is ready to use, False otherwise
"""
pass
def supports(self, content_type: str) -> bool:
"""Check if this processor supports the given MIME type.
Args:
content_type: MIME type (may include parameters like "application/pdf; charset=utf-8")
Returns:
True if this processor can handle the type
"""
# Strip parameters from content type
base_type = content_type.split(";")[0].strip().lower()
return base_type in self.supported_mime_types
class ProcessorError(Exception):
"""Raised when document processing fails."""
pass