refactor: Transform document parsing into pluggable processor architecture

Refactors PR #190's hardcoded Unstructured.io integration into a flexible,
extensible plugin system supporting multiple text extraction engines.

- **`DocumentProcessor` ABC**: Abstract interface for all processors
- **`ProcessorRegistry`**: Central registry for discovery and routing
- **`ProcessingResult`**: Standardized output format across processors

- **`UnstructuredProcessor`**: Refactored from `UnstructuredClient`
- **`TesseractProcessor`**: Local OCR for images (lightweight alternative)
- **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs

- New `get_document_processor_config()` returns structured config
- Supports enabling/disabling individual processors
- Per-processor configuration via environment variables
- **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with:
  - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch)
  - `ENABLE_UNSTRUCTURED=true/false` (per-processor)
  - `ENABLE_TESSERACT=true/false`
  - `ENABLE_CUSTOM_PROCESSOR=true/false`

- `parse_document()` now uses `ProcessorRegistry`
- Auto-selects appropriate processor based on MIME type
- Processor priority system (Unstructured=10, Tesseract=5, Custom=1)

- `initialize_document_processors()` registers processors at startup
- Integrated into both BasicAuth and OAuth lifespans
- Graceful degradation if processors fail to initialize

```env
ENABLE_DOCUMENT_PROCESSING=false

ENABLE_UNSTRUCTURED=false
UNSTRUCTURED_API_URL=http://unstructured:8000
UNSTRUCTURED_STRATEGY=auto  # auto|fast|hi_res
UNSTRUCTURED_LANGUAGES=eng,deu

ENABLE_TESSERACT=false
TESSERACT_LANG=eng

ENABLE_CUSTOM_PROCESSOR=false
CUSTOM_PROCESSOR_URL=http://localhost:9000/process
CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg
```

- **Removed**: `tests/test_unstructured_config.py` (legacy tests)
- **Added**: `tests/unit/test_document_processor_config.py`
  - 7 unit tests for new config system
  - Tests individual and multi-processor configurations

- **Added**:
  - `nextcloud_mcp_server/document_processors/__init__.py`
  - `nextcloud_mcp_server/document_processors/base.py`
  - `nextcloud_mcp_server/document_processors/registry.py`
  - `nextcloud_mcp_server/document_processors/unstructured.py`
  - `nextcloud_mcp_server/document_processors/tesseract.py`
  - `nextcloud_mcp_server/document_processors/custom_http.py`
  - `tests/unit/test_document_processor_config.py`

- **Modified**:
  - `nextcloud_mcp_server/config.py` - New plugin config system
  - `nextcloud_mcp_server/app.py` - Processor initialization
  - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry
  - `nextcloud_mcp_server/server/webdav.py` - Import updates
  - `env.sample` - New configuration format
  - `docker-compose.yml` - (profile changes from previous work)

- **Removed**:
  - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor
  - `tests/test_unstructured_config.py` - Replaced with new tests

 **Extensible**: Add processors without modifying core code
 **Testable**: Mock processors for unit tests
 **Configurable**: Enable only needed processors
 **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured)
 **Opt-in**: Disabled by default, no mandatory dependencies

Users upgrading from PR #190 need to update environment variables:
```bash
ENABLE_UNSTRUCTURED_PARSING=true

ENABLE_DOCUMENT_PROCESSING=true
ENABLE_UNSTRUCTURED=true
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-10-25 19:21:38 +02:00
parent a19017c686
commit 2147fc1696
20 changed files with 2027 additions and 529 deletions
@@ -0,0 +1,12 @@
"""Document processing plugins for extracting text from various file formats."""
from .base import DocumentProcessor, ProcessingResult, ProcessorError
from .registry import ProcessorRegistry, get_registry
__all__ = [
"DocumentProcessor",
"ProcessingResult",
"ProcessorError",
"ProcessorRegistry",
"get_registry",
]
@@ -0,0 +1,117 @@
"""Abstract base class for document processing plugins."""
from abc import ABC, abstractmethod
from typing import Any, Optional
from pydantic import BaseModel
class ProcessingResult(BaseModel):
"""Standardized result from any document processor."""
text: str
"""Extracted text content"""
metadata: dict[str, Any]
"""Processor-specific metadata"""
processor: str
"""Name of processor that handled this (e.g., 'unstructured', 'tesseract')"""
success: bool = True
"""Whether processing succeeded"""
error: Optional[str] = None
"""Error message if processing failed"""
class DocumentProcessor(ABC):
"""Abstract base class for document processing plugins.
Document processors extract text from various file formats (PDF, DOCX, images, etc.).
Each processor implements this interface and can be registered with the ProcessorRegistry.
Example:
class MyProcessor(DocumentProcessor):
@property
def name(self) -> str:
return "my_processor"
@property
def supported_mime_types(self) -> set[str]:
return {"application/pdf", "image/jpeg"}
async def process(self, content: bytes, content_type: str, **kwargs) -> ProcessingResult:
# Extract text from content
return ProcessingResult(text="...", metadata={}, processor=self.name)
async def health_check(self) -> bool:
return True
"""
@property
@abstractmethod
def name(self) -> str:
"""Unique identifier for this processor (e.g., 'unstructured', 'tesseract')."""
pass
@property
@abstractmethod
def supported_mime_types(self) -> set[str]:
"""Set of MIME types this processor can handle.
Examples: {"application/pdf", "image/jpeg", "image/png"}
"""
pass
@abstractmethod
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process a document and extract text.
Args:
content: Document bytes
content_type: MIME type of the document
filename: Optional filename for format detection
options: Processor-specific options (e.g., OCR language, strategy)
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If processing fails
"""
pass
@abstractmethod
async def health_check(self) -> bool:
"""Check if processor is available and healthy.
Returns:
True if processor is ready to use, False otherwise
"""
pass
def supports(self, content_type: str) -> bool:
"""Check if this processor supports the given MIME type.
Args:
content_type: MIME type (may include parameters like "application/pdf; charset=utf-8")
Returns:
True if this processor can handle the type
"""
# Strip parameters from content type
base_type = content_type.split(";")[0].strip().lower()
return base_type in self.supported_mime_types
class ProcessorError(Exception):
"""Raised when document processing fails."""
pass
@@ -0,0 +1,146 @@
"""Generic HTTP API processor wrapper for custom document processing services."""
import logging
from typing import Any, Optional
import httpx
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class CustomHTTPProcessor(DocumentProcessor):
"""Generic HTTP API processor wrapper.
Allows integration with any custom document processing API that follows
a simple request/response pattern. This makes it easy to integrate your
own text extraction services without writing a full processor.
Expected API Contract:
- POST request with file as multipart/form-data
- Response: {"text": "extracted text", "metadata": {...}}
Example:
processor = CustomHTTPProcessor(
name="my_ocr",
api_url="https://my-ocr-service.com/process",
api_key="secret",
supported_types={"application/pdf", "image/jpeg"},
)
result = await processor.process(pdf_bytes, "application/pdf")
"""
def __init__(
self,
api_url: str,
api_key: Optional[str] = None,
timeout: int = 60,
supported_types: Optional[set[str]] = None,
name: str = "custom",
):
"""Initialize custom HTTP processor.
Args:
api_url: Your API endpoint (should accept POST with multipart/form-data)
api_key: Optional API key for authentication (sent as Bearer token)
timeout: Request timeout in seconds (default: 60)
supported_types: MIME types your API supports
name: Unique name for this processor (default: "custom")
"""
self.api_url = api_url
self.api_key = api_key
self.timeout = timeout
self._name = name
self._supported_types = supported_types or set()
logger.info(f"Initialized CustomHTTPProcessor: {name} -> {api_url}")
@property
def name(self) -> str:
return self._name
@property
def supported_mime_types(self) -> set[str]:
return self._supported_types
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process via custom HTTP API.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename
options: Custom options (passed as form data to API)
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If API call fails
"""
options = options or {}
# Prepare request
files = {"file": (filename or "document", content, content_type)}
headers = {}
if self.api_key:
headers["Authorization"] = f"Bearer {self.api_key}"
try:
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.post(
self.api_url,
files=files,
headers=headers,
data=options, # Pass options as form data
)
response.raise_for_status()
# Parse response
result = response.json()
text = result.get("text", "")
metadata = result.get("metadata", {})
logger.debug(
f"Custom processor '{self.name}' extracted {len(text)} characters"
)
return ProcessingResult(
text=text,
metadata=metadata,
processor=self.name,
success=True,
)
except httpx.HTTPError as e:
logger.error(f"Custom processor '{self.name}' HTTP error: {e}")
raise ProcessorError(f"API call failed: {str(e)}") from e
except Exception as e:
logger.error(f"Custom processor '{self.name}' failed: {e}")
raise ProcessorError(f"Processing failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if custom API is available.
Returns:
True if API responds with status < 500
"""
try:
async with httpx.AsyncClient(timeout=5) as client:
# Try GET request to check availability
response = await client.get(
self.api_url,
headers={"User-Agent": "nextcloud-mcp-server"},
)
return response.status_code < 500
except Exception as e:
logger.warning(f"Custom processor '{self.name}' health check failed: {e}")
return False
@@ -0,0 +1,164 @@
"""Central registry for document processors."""
import logging
from typing import Any, Optional
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class ProcessorRegistry:
"""Central registry for document processors.
Manages registration and routing of document processing requests to
appropriate processors based on MIME types and priorities.
Example:
registry = ProcessorRegistry()
registry.register(UnstructuredProcessor(...), priority=10)
registry.register(TesseractProcessor(...), priority=5)
# Auto-select processor based on MIME type
result = await registry.process(pdf_bytes, "application/pdf")
# Force specific processor
result = await registry.process(img_bytes, "image/png", processor_name="tesseract")
"""
def __init__(self):
self._processors: dict[str, tuple[DocumentProcessor, int]] = {}
self._priority_order: list[str] = []
def register(self, processor: DocumentProcessor, priority: int = 0):
"""Register a document processor.
Args:
processor: Processor instance to register
priority: Higher priority processors are tried first (default: 0)
"""
name = processor.name
if name in self._processors:
logger.warning(f"Processor '{name}' already registered, replacing")
self._processors[name] = (processor, priority)
# Update priority order
if name in self._priority_order:
self._priority_order.remove(name)
# Insert in priority order (higher priority first)
inserted = False
for i, existing_name in enumerate(self._priority_order):
existing_priority = self._processors[existing_name][1]
if priority > existing_priority:
self._priority_order.insert(i, name)
inserted = True
break
if not inserted:
self._priority_order.append(name)
logger.info(
f"Registered processor: {name} "
f"(priority={priority}, supports={len(processor.supported_mime_types)} types)"
)
def get_processor(self, name: str) -> Optional[DocumentProcessor]:
"""Get a processor by name.
Args:
name: Processor name
Returns:
DocumentProcessor instance or None if not found
"""
if name in self._processors:
return self._processors[name][0]
return None
def find_processor(self, content_type: str) -> Optional[DocumentProcessor]:
"""Find the first processor that supports the given MIME type.
Processors are checked in priority order (highest priority first).
Args:
content_type: MIME type to match
Returns:
First matching processor or None
"""
for name in self._priority_order:
processor = self._processors[name][0]
if processor.supports(content_type):
logger.debug(f"Found processor '{name}' for type '{content_type}'")
return processor
logger.debug(f"No processor found for type '{content_type}'")
return None
def list_processors(self) -> list[str]:
"""List all registered processor names in priority order.
Returns:
List of processor names (highest priority first)
"""
return list(self._priority_order)
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
processor_name: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process a document using available processors.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename for format detection
processor_name: Force specific processor (or None for auto-select)
options: Processing options passed to processor
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If no processor found or processing fails
"""
# Find processor
if processor_name:
processor = self.get_processor(processor_name)
if not processor:
raise ProcessorError(
f"Processor '{processor_name}' not found. "
f"Available: {', '.join(self.list_processors())}"
)
else:
processor = self.find_processor(content_type)
if not processor:
raise ProcessorError(
f"No processor found for type: {content_type}. "
f"Registered processors: {', '.join(self.list_processors())}"
)
logger.info(f"Processing with '{processor.name}' processor")
# Process
return await processor.process(content, content_type, filename, options)
# Global registry instance
_registry = ProcessorRegistry()
def get_registry() -> ProcessorRegistry:
"""Get the global processor registry.
Returns:
Singleton ProcessorRegistry instance
"""
return _registry
@@ -0,0 +1,161 @@
"""Document processor using Tesseract OCR (local)."""
import logging
import shutil
from typing import Any, Optional
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
try:
import io
import pytesseract
from PIL import Image
TESSERACT_AVAILABLE = True
except ImportError:
TESSERACT_AVAILABLE = False
class TesseractProcessor(DocumentProcessor):
"""Document processor using Tesseract OCR (local).
This processor runs OCR locally using the Tesseract engine, which is
faster and more lightweight than cloud-based solutions but requires
Tesseract to be installed on the system.
Requirements:
- tesseract binary installed (e.g., apt install tesseract-ocr)
- Python packages: pip install pytesseract pillow
Example:
processor = TesseractProcessor(default_lang="eng+deu")
result = await processor.process(image_bytes, "image/jpeg")
"""
SUPPORTED_TYPES = {
"image/jpeg",
"image/png",
"image/tiff",
"image/bmp",
"image/gif",
}
def __init__(
self,
tesseract_cmd: Optional[str] = None,
default_lang: str = "eng",
):
"""Initialize Tesseract processor.
Args:
tesseract_cmd: Path to tesseract executable (None = auto-detect)
default_lang: Default OCR language (e.g., "eng", "deu", "eng+deu")
Raises:
ProcessorError: If Tesseract or required packages not available
"""
if not TESSERACT_AVAILABLE:
raise ProcessorError(
"Tesseract processor requires: pip install pytesseract pillow"
)
if tesseract_cmd:
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
elif not shutil.which("tesseract"):
raise ProcessorError(
"Tesseract not found in PATH. Install with: apt install tesseract-ocr"
)
self.default_lang = default_lang
logger.info(f"Initialized TesseractProcessor: lang={default_lang}")
@property
def name(self) -> str:
return "tesseract"
@property
def supported_mime_types(self) -> set[str]:
return self.SUPPORTED_TYPES
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process image via Tesseract OCR.
Args:
content: Image bytes
content_type: Image MIME type
filename: Optional filename
options: Processing options:
- lang: OCR language(s) (default: from init)
- config: Tesseract config string
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If OCR fails
"""
options = options or {}
lang = options.get("lang", self.default_lang)
config = options.get("config", "")
try:
# Load image
image = Image.open(io.BytesIO(content))
# Run OCR
text = pytesseract.image_to_string(image, lang=lang, config=config)
# Get additional data for confidence scores
data = pytesseract.image_to_data(
image, lang=lang, output_type=pytesseract.Output.DICT
)
# Calculate average confidence
confidences = [c for c in data["conf"] if c != -1]
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
metadata = {
"text_length": len(text),
"language": lang,
"image_size": image.size,
"image_mode": image.mode,
"confidence": round(avg_confidence, 2),
"words_detected": len([c for c in data["conf"] if c != -1]),
}
logger.debug(
f"Tesseract OCR completed: {len(text)} chars, "
f"confidence={avg_confidence:.1f}%"
)
return ProcessingResult(
text=text.strip(),
metadata=metadata,
processor=self.name,
success=True,
)
except Exception as e:
logger.error(f"Tesseract processing failed: {e}")
raise ProcessorError(f"OCR failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if Tesseract is available.
Returns:
True if Tesseract is installed and working
"""
try:
pytesseract.get_tesseract_version()
return True
except Exception:
return False
@@ -0,0 +1,193 @@
"""Document processor using Unstructured.io API."""
import io
import logging
from typing import Any, Optional
import httpx
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class UnstructuredProcessor(DocumentProcessor):
"""Document processor using Unstructured.io API.
The Unstructured API provides document parsing capabilities for various formats
including PDF, DOCX, images with OCR, and more.
API Documentation: https://docs.unstructured.io/api-reference/api-services/api-parameters
"""
# Supported MIME types for Unstructured
SUPPORTED_TYPES = {
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/msword",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.ms-powerpoint",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.ms-excel",
"application/rtf",
"text/rtf",
"application/vnd.oasis.opendocument.text",
"application/epub+zip",
"message/rfc822",
"application/vnd.ms-outlook",
"image/jpeg",
"image/png",
"image/tiff",
"image/bmp",
}
def __init__(
self,
api_url: str,
timeout: int = 120,
default_strategy: str = "auto",
default_languages: Optional[list[str]] = None,
):
"""Initialize Unstructured processor.
Args:
api_url: Unstructured API endpoint
timeout: Request timeout in seconds (default: 120)
default_strategy: Default parsing strategy - "auto", "fast", or "hi_res"
default_languages: Default OCR language codes (e.g., ["eng", "deu"])
"""
self.api_url = api_url
self.timeout = timeout
self.default_strategy = default_strategy
self.default_languages = default_languages or ["eng"]
logger.info(
f"Initialized UnstructuredProcessor: {api_url}, "
f"strategy={default_strategy}, languages={self.default_languages}"
)
@property
def name(self) -> str:
return "unstructured"
@property
def supported_mime_types(self) -> set[str]:
return self.SUPPORTED_TYPES
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process document via Unstructured API.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename for format detection
options: Processing options:
- strategy: "auto", "fast", or "hi_res" (default: from init)
- languages: List of language codes (default: from init)
- extract_image_block_types: Types of image elements to extract
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If processing fails
"""
options = options or {}
# Extract options with defaults
strategy = options.get("strategy", self.default_strategy)
languages = options.get("languages", self.default_languages)
extract_image_block_types = options.get("extract_image_block_types")
# Prepare multipart request
files = {
"files": (
filename or "document",
io.BytesIO(content),
content_type or "application/octet-stream",
)
}
data = {
"strategy": strategy,
"languages": ",".join(languages),
}
if extract_image_block_types:
data["extract_image_block_types"] = ",".join(extract_image_block_types)
logger.debug(
f"Processing with Unstructured API: strategy={strategy}, languages={languages}"
)
try:
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.post(
f"{self.api_url}/general/v0/general",
files=files,
data=data,
)
response.raise_for_status()
# Parse response
elements = response.json()
# Extract text and metadata
texts = []
element_types: dict[str, int] = {}
for element in elements:
if "text" in element and element["text"]:
texts.append(element["text"])
el_type = element.get("type", "unknown")
element_types[el_type] = element_types.get(el_type, 0) + 1
parsed_text = "\n\n".join(texts)
metadata = {
"element_count": len(elements),
"text_length": len(parsed_text),
"element_types": element_types,
"strategy": strategy,
"languages": languages,
}
logger.debug(
f"Successfully processed: {len(elements)} elements, "
f"{len(parsed_text)} characters"
)
return ProcessingResult(
text=parsed_text,
metadata=metadata,
processor=self.name,
success=True,
)
except httpx.HTTPError as e:
logger.error(f"Unstructured API HTTP error: {e}")
raise ProcessorError(f"HTTP error: {str(e)}") from e
except Exception as e:
logger.error(f"Unstructured API processing failed: {e}")
raise ProcessorError(f"Processing failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if Unstructured API is available.
Returns:
True if API is healthy, False otherwise
"""
try:
async with httpx.AsyncClient(timeout=5) as client:
response = await client.get(f"{self.api_url}/healthcheck")
return response.status_code == 200
except Exception as e:
logger.warning(f"Unstructured health check failed: {e}")
return False