refactor: Transform document parsing into pluggable processor architecture

Refactors PR #190's hardcoded Unstructured.io integration into a flexible,
extensible plugin system supporting multiple text extraction engines.

- **`DocumentProcessor` ABC**: Abstract interface for all processors
- **`ProcessorRegistry`**: Central registry for discovery and routing
- **`ProcessingResult`**: Standardized output format across processors

- **`UnstructuredProcessor`**: Refactored from `UnstructuredClient`
- **`TesseractProcessor`**: Local OCR for images (lightweight alternative)
- **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs

- New `get_document_processor_config()` returns structured config
- Supports enabling/disabling individual processors
- Per-processor configuration via environment variables
- **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with:
  - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch)
  - `ENABLE_UNSTRUCTURED=true/false` (per-processor)
  - `ENABLE_TESSERACT=true/false`
  - `ENABLE_CUSTOM_PROCESSOR=true/false`

- `parse_document()` now uses `ProcessorRegistry`
- Auto-selects appropriate processor based on MIME type
- Processor priority system (Unstructured=10, Tesseract=5, Custom=1)

- `initialize_document_processors()` registers processors at startup
- Integrated into both BasicAuth and OAuth lifespans
- Graceful degradation if processors fail to initialize

```env
ENABLE_DOCUMENT_PROCESSING=false

ENABLE_UNSTRUCTURED=false
UNSTRUCTURED_API_URL=http://unstructured:8000
UNSTRUCTURED_STRATEGY=auto  # auto|fast|hi_res
UNSTRUCTURED_LANGUAGES=eng,deu

ENABLE_TESSERACT=false
TESSERACT_LANG=eng

ENABLE_CUSTOM_PROCESSOR=false
CUSTOM_PROCESSOR_URL=http://localhost:9000/process
CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg
```

- **Removed**: `tests/test_unstructured_config.py` (legacy tests)
- **Added**: `tests/unit/test_document_processor_config.py`
  - 7 unit tests for new config system
  - Tests individual and multi-processor configurations

- **Added**:
  - `nextcloud_mcp_server/document_processors/__init__.py`
  - `nextcloud_mcp_server/document_processors/base.py`
  - `nextcloud_mcp_server/document_processors/registry.py`
  - `nextcloud_mcp_server/document_processors/unstructured.py`
  - `nextcloud_mcp_server/document_processors/tesseract.py`
  - `nextcloud_mcp_server/document_processors/custom_http.py`
  - `tests/unit/test_document_processor_config.py`

- **Modified**:
  - `nextcloud_mcp_server/config.py` - New plugin config system
  - `nextcloud_mcp_server/app.py` - Processor initialization
  - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry
  - `nextcloud_mcp_server/server/webdav.py` - Import updates
  - `env.sample` - New configuration format
  - `docker-compose.yml` - (profile changes from previous work)

- **Removed**:
  - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor
  - `tests/test_unstructured_config.py` - Replaced with new tests

 **Extensible**: Add processors without modifying core code
 **Testable**: Mock processors for unit tests
 **Configurable**: Enable only needed processors
 **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured)
 **Opt-in**: Disabled by default, no mandatory dependencies

Users upgrading from PR #190 need to update environment variables:
```bash
ENABLE_UNSTRUCTURED_PARSING=true

ENABLE_DOCUMENT_PROCESSING=true
ENABLE_UNSTRUCTURED=true
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-10-25 19:21:38 +02:00
parent a19017c686
commit 2147fc1696
20 changed files with 2027 additions and 529 deletions
+97 -1
View File
@@ -23,8 +23,13 @@ from nextcloud_mcp_server.auth import (
is_jwt_token,
)
from nextcloud_mcp_server.client import NextcloudClient
from nextcloud_mcp_server.config import LOGGING_CONFIG, setup_logging
from nextcloud_mcp_server.config import (
LOGGING_CONFIG,
get_document_processor_config,
setup_logging,
)
from nextcloud_mcp_server.context import get_client as get_nextcloud_client
from nextcloud_mcp_server.document_processors import get_registry
from nextcloud_mcp_server.server import (
configure_calendar_tools,
configure_contacts_tools,
@@ -39,6 +44,91 @@ from nextcloud_mcp_server.server import (
logger = logging.getLogger(__name__)
def initialize_document_processors():
"""Initialize and register document processors based on configuration.
This function reads the environment configuration and registers available
processors (Unstructured, Tesseract, Custom HTTP) with the global registry.
"""
config = get_document_processor_config()
if not config["enabled"]:
logger.info("Document processing disabled")
return
registry = get_registry()
registered_count = 0
# Register Unstructured processor
if "unstructured" in config["processors"]:
unst_config = config["processors"]["unstructured"]
try:
from nextcloud_mcp_server.document_processors.unstructured import (
UnstructuredProcessor,
)
processor = UnstructuredProcessor(
api_url=unst_config["api_url"],
timeout=unst_config["timeout"],
default_strategy=unst_config["strategy"],
default_languages=unst_config["languages"],
)
registry.register(processor, priority=10)
logger.info(f"Registered Unstructured processor: {unst_config['api_url']}")
registered_count += 1
except Exception as e:
logger.warning(f"Failed to register Unstructured processor: {e}")
# Register Tesseract processor
if "tesseract" in config["processors"]:
tess_config = config["processors"]["tesseract"]
try:
from nextcloud_mcp_server.document_processors.tesseract import (
TesseractProcessor,
)
processor = TesseractProcessor(
tesseract_cmd=tess_config.get("tesseract_cmd"),
default_lang=tess_config["lang"],
)
registry.register(processor, priority=5)
logger.info(f"Registered Tesseract processor: lang={tess_config['lang']}")
registered_count += 1
except Exception as e:
logger.warning(f"Failed to register Tesseract processor: {e}")
# Register custom processor
if "custom" in config["processors"]:
custom_config = config["processors"]["custom"]
try:
from nextcloud_mcp_server.document_processors.custom_http import (
CustomHTTPProcessor,
)
processor = CustomHTTPProcessor(
name=custom_config["name"],
api_url=custom_config["api_url"],
api_key=custom_config.get("api_key"),
timeout=custom_config["timeout"],
supported_types=custom_config["supported_types"],
)
registry.register(processor, priority=1)
logger.info(
f"Registered Custom processor '{custom_config['name']}': {custom_config['api_url']}"
)
registered_count += 1
except Exception as e:
logger.warning(f"Failed to register Custom processor: {e}")
if registered_count > 0:
logger.info(
f"Document processing initialized with {registered_count} processor(s): "
f"{', '.join(registry.list_processors())}"
)
else:
logger.warning("Document processing enabled but no processors registered")
def validate_pkce_support(discovery: dict, discovery_url: str) -> None:
"""
Validate that the OIDC provider properly advertises PKCE support.
@@ -257,6 +347,9 @@ async def app_lifespan_basic(server: FastMCP) -> AsyncIterator[AppContext]:
client = NextcloudClient.from_env()
logger.info("Client initialization complete")
# Initialize document processors
initialize_document_processors()
try:
yield AppContext(client=client)
finally:
@@ -317,6 +410,9 @@ async def app_lifespan_oauth(server: FastMCP) -> AsyncIterator[OAuthAppContext]:
logger.info("OAuth initialization complete")
# Initialize document processors
initialize_document_processors()
try:
yield OAuthAppContext(
nextcloud_host=nextcloud_host, token_verifier=token_verifier
@@ -1,170 +0,0 @@
"""HTTP client for Unstructured API."""
import io
import logging
from typing import Optional, Tuple
import httpx
from nextcloud_mcp_server.config import (
get_unstructured_api_url,
get_unstructured_languages,
get_unstructured_strategy,
)
logger = logging.getLogger(__name__)
class UnstructuredClient:
"""Client for interacting with the Unstructured API.
The Unstructured API provides document parsing capabilities for various formats
including PDF, DOCX, images with OCR, and more.
API Documentation: https://docs.unstructured.io/api-reference/api-services/api-parameters
"""
def __init__(self, api_url: Optional[str] = None, timeout: int = 120):
"""Initialize the Unstructured API client.
Args:
api_url: Base URL of the Unstructured API. If None, will use config.
timeout: Request timeout in seconds (default: 120 for large documents)
"""
self.api_url = api_url or get_unstructured_api_url()
self.timeout = timeout
if not self.api_url:
raise ValueError(
"Unstructured API URL not configured. "
"Set ENABLE_UNSTRUCTURED_PARSING=true and UNSTRUCTURED_API_URL in environment."
)
logger.info(f"Initialized UnstructuredClient with API URL: {self.api_url}")
async def partition_document(
self,
content: bytes,
filename: str,
content_type: Optional[str] = None,
strategy: Optional[str] = None,
languages: Optional[list[str]] = None,
extract_image_block_types: Optional[list[str]] = None,
) -> Tuple[str, dict]:
"""Parse a document using the Unstructured API.
Args:
content: The document content as bytes
filename: The filename (used for format detection)
content_type: Optional MIME type
strategy: Parsing strategy - "auto", "fast", or "hi_res" (OCR-based).
If None, uses the value from UNSTRUCTURED_STRATEGY env var.
languages: List of language codes for OCR (e.g., ["eng", "deu"]).
If None, uses the value from UNSTRUCTURED_LANGUAGES env var.
extract_image_block_types: Types of elements to extract from images
Returns:
Tuple of (parsed_text, metadata) where:
- parsed_text: The extracted text content
- metadata: Additional metadata about the parsing
Raises:
httpx.HTTPError: If the API request fails
Exception: If parsing fails
"""
# Use environment configuration as defaults
if strategy is None:
strategy = get_unstructured_strategy()
if languages is None:
languages = get_unstructured_languages()
# Prepare the multipart form data
files = {
"files": (
filename,
io.BytesIO(content),
content_type or "application/octet-stream",
)
}
# Prepare the request data
data = {
"strategy": strategy,
"languages": ",".join(languages),
}
if extract_image_block_types:
data["extract_image_block_types"] = ",".join(extract_image_block_types)
logger.debug(
f"Partitioning document '{filename}' with strategy '{strategy}', "
f"languages: {languages}"
)
try:
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.post(
f"{self.api_url}/general/v0/general",
files=files,
data=data,
)
response.raise_for_status()
# Parse the response
elements = response.json()
# Extract text from elements
# Each element has a "text" field
texts = []
element_types = {}
for element in elements:
if "text" in element and element["text"]:
texts.append(element["text"])
# Track element types
el_type = element.get("type", "unknown")
element_types[el_type] = element_types.get(el_type, 0) + 1
parsed_text = "\n\n".join(texts)
# Collect metadata
metadata = {
"element_count": len(elements),
"text_length": len(parsed_text),
"element_types": element_types,
"strategy": strategy,
"languages": languages,
"parsing_method": "unstructured_api",
}
logger.debug(
f"Successfully parsed document: {len(elements)} elements, "
f"{len(parsed_text)} characters"
)
return parsed_text, metadata
except httpx.HTTPError as e:
logger.error(f"HTTP error calling Unstructured API: {e}")
raise Exception(
f"Failed to parse document via Unstructured API: {str(e)}"
) from e
except Exception as e:
logger.error(f"Unexpected error parsing document: {e}")
raise Exception(f"Failed to parse document: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if the Unstructured API is available.
Returns:
True if the API is healthy, False otherwise.
"""
try:
async with httpx.AsyncClient(timeout=5) as client:
response = await client.get(f"{self.api_url}/healthcheck")
return response.status_code == 200
except Exception as e:
logger.warning(f"Unstructured API health check failed: {e}")
return False
+55 -76
View File
@@ -1,6 +1,6 @@
import logging.config
import os
from typing import Optional
from typing import Any
LOGGING_CONFIG = {
"version": 1,
@@ -55,86 +55,65 @@ def setup_logging():
logging.config.dictConfig(LOGGING_CONFIG)
# Document Parsing Configuration
def get_unstructured_api_url() -> Optional[str]:
"""Get the Unstructured API URL from environment variables.
# Document Processing Configuration
def get_document_processor_config() -> dict[str, Any]:
"""Get document processor configuration from environment.
Returns:
The Unstructured API URL if parsing is enabled, None otherwise.
Dict with processor configs:
{
"enabled": bool,
"default_processor": str,
"processors": {
"unstructured": {...},
"tesseract": {...},
"custom": {...},
}
}
"""
enabled = os.getenv("ENABLE_UNSTRUCTURED_PARSING", "true").lower() == "true"
if not enabled:
return None
config: dict[str, Any] = {
"enabled": os.getenv("ENABLE_DOCUMENT_PROCESSING", "false").lower() == "true",
"default_processor": os.getenv("DOCUMENT_PROCESSOR", "unstructured"),
"processors": {},
}
return os.getenv("UNSTRUCTURED_API_URL", "http://unstructured:8000")
# Unstructured configuration
if os.getenv("ENABLE_UNSTRUCTURED", "false").lower() == "true":
config["processors"]["unstructured"] = {
"api_url": os.getenv("UNSTRUCTURED_API_URL", "http://unstructured:8000"),
"timeout": int(os.getenv("UNSTRUCTURED_TIMEOUT", "120")),
"strategy": os.getenv("UNSTRUCTURED_STRATEGY", "auto"),
"languages": [
lang.strip()
for lang in os.getenv("UNSTRUCTURED_LANGUAGES", "eng,deu").split(",")
if lang.strip()
],
}
# Tesseract configuration
if os.getenv("ENABLE_TESSERACT", "false").lower() == "true":
config["processors"]["tesseract"] = {
"tesseract_cmd": os.getenv("TESSERACT_CMD"), # None = auto-detect
"lang": os.getenv("TESSERACT_LANG", "eng"),
}
def is_unstructured_parsing_enabled() -> bool:
"""Check if unstructured document parsing is enabled.
# Custom processor (via HTTP API)
if os.getenv("ENABLE_CUSTOM_PROCESSOR", "false").lower() == "true":
custom_url = os.getenv("CUSTOM_PROCESSOR_URL")
if custom_url:
supported_types_str = os.getenv("CUSTOM_PROCESSOR_TYPES", "application/pdf")
supported_types = {
t.strip() for t in supported_types_str.split(",") if t.strip()
}
Returns:
True if enabled, False otherwise.
"""
return os.getenv("ENABLE_UNSTRUCTURED_PARSING", "true").lower() == "true"
config["processors"]["custom"] = {
"name": os.getenv("CUSTOM_PROCESSOR_NAME", "custom"),
"api_url": custom_url,
"api_key": os.getenv("CUSTOM_PROCESSOR_API_KEY"),
"timeout": int(os.getenv("CUSTOM_PROCESSOR_TIMEOUT", "60")),
"supported_types": supported_types,
}
def get_unstructured_strategy() -> str:
"""Get the parsing strategy for the Unstructured API.
Valid values are:
- 'auto': Automatically choose the best strategy (default)
- 'fast': Fast parsing without OCR
- 'hi_res': High-resolution parsing with OCR for better accuracy
Returns:
The parsing strategy to use.
"""
strategy = os.getenv("UNSTRUCTURED_STRATEGY", "auto").lower()
valid_strategies = ["auto", "fast", "hi_res"]
if strategy not in valid_strategies:
logging.warning(
f"Invalid UNSTRUCTURED_STRATEGY '{strategy}'. Using 'hi_res'. "
f"Valid options: {', '.join(valid_strategies)}"
)
return "hi_res"
return strategy
def get_unstructured_languages() -> list[str]:
"""Get the OCR languages for the Unstructured API.
Languages should be specified as ISO 639-3 codes (e.g., 'eng', 'deu', 'fra').
Multiple languages can be specified separated by commas.
Default languages: English (eng) and German (deu)
Common language codes:
- eng: English
- deu: German
- fra: French
- spa: Spanish
- ita: Italian
- por: Portuguese
- rus: Russian
- ara: Arabic
- zho: Chinese
- jpn: Japanese
- kor: Korean
Returns:
List of language codes for OCR processing.
"""
languages_str = os.getenv("UNSTRUCTURED_LANGUAGES", "eng,deu")
# Split by comma and clean up whitespace
languages = [lang.strip() for lang in languages_str.split(",") if lang.strip()]
if not languages:
logging.warning(
"No languages specified in UNSTRUCTURED_LANGUAGES. Using default: eng,deu"
)
return ["eng", "deu"]
return languages
return config
@@ -0,0 +1,12 @@
"""Document processing plugins for extracting text from various file formats."""
from .base import DocumentProcessor, ProcessingResult, ProcessorError
from .registry import ProcessorRegistry, get_registry
__all__ = [
"DocumentProcessor",
"ProcessingResult",
"ProcessorError",
"ProcessorRegistry",
"get_registry",
]
@@ -0,0 +1,117 @@
"""Abstract base class for document processing plugins."""
from abc import ABC, abstractmethod
from typing import Any, Optional
from pydantic import BaseModel
class ProcessingResult(BaseModel):
"""Standardized result from any document processor."""
text: str
"""Extracted text content"""
metadata: dict[str, Any]
"""Processor-specific metadata"""
processor: str
"""Name of processor that handled this (e.g., 'unstructured', 'tesseract')"""
success: bool = True
"""Whether processing succeeded"""
error: Optional[str] = None
"""Error message if processing failed"""
class DocumentProcessor(ABC):
"""Abstract base class for document processing plugins.
Document processors extract text from various file formats (PDF, DOCX, images, etc.).
Each processor implements this interface and can be registered with the ProcessorRegistry.
Example:
class MyProcessor(DocumentProcessor):
@property
def name(self) -> str:
return "my_processor"
@property
def supported_mime_types(self) -> set[str]:
return {"application/pdf", "image/jpeg"}
async def process(self, content: bytes, content_type: str, **kwargs) -> ProcessingResult:
# Extract text from content
return ProcessingResult(text="...", metadata={}, processor=self.name)
async def health_check(self) -> bool:
return True
"""
@property
@abstractmethod
def name(self) -> str:
"""Unique identifier for this processor (e.g., 'unstructured', 'tesseract')."""
pass
@property
@abstractmethod
def supported_mime_types(self) -> set[str]:
"""Set of MIME types this processor can handle.
Examples: {"application/pdf", "image/jpeg", "image/png"}
"""
pass
@abstractmethod
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process a document and extract text.
Args:
content: Document bytes
content_type: MIME type of the document
filename: Optional filename for format detection
options: Processor-specific options (e.g., OCR language, strategy)
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If processing fails
"""
pass
@abstractmethod
async def health_check(self) -> bool:
"""Check if processor is available and healthy.
Returns:
True if processor is ready to use, False otherwise
"""
pass
def supports(self, content_type: str) -> bool:
"""Check if this processor supports the given MIME type.
Args:
content_type: MIME type (may include parameters like "application/pdf; charset=utf-8")
Returns:
True if this processor can handle the type
"""
# Strip parameters from content type
base_type = content_type.split(";")[0].strip().lower()
return base_type in self.supported_mime_types
class ProcessorError(Exception):
"""Raised when document processing fails."""
pass
@@ -0,0 +1,146 @@
"""Generic HTTP API processor wrapper for custom document processing services."""
import logging
from typing import Any, Optional
import httpx
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class CustomHTTPProcessor(DocumentProcessor):
"""Generic HTTP API processor wrapper.
Allows integration with any custom document processing API that follows
a simple request/response pattern. This makes it easy to integrate your
own text extraction services without writing a full processor.
Expected API Contract:
- POST request with file as multipart/form-data
- Response: {"text": "extracted text", "metadata": {...}}
Example:
processor = CustomHTTPProcessor(
name="my_ocr",
api_url="https://my-ocr-service.com/process",
api_key="secret",
supported_types={"application/pdf", "image/jpeg"},
)
result = await processor.process(pdf_bytes, "application/pdf")
"""
def __init__(
self,
api_url: str,
api_key: Optional[str] = None,
timeout: int = 60,
supported_types: Optional[set[str]] = None,
name: str = "custom",
):
"""Initialize custom HTTP processor.
Args:
api_url: Your API endpoint (should accept POST with multipart/form-data)
api_key: Optional API key for authentication (sent as Bearer token)
timeout: Request timeout in seconds (default: 60)
supported_types: MIME types your API supports
name: Unique name for this processor (default: "custom")
"""
self.api_url = api_url
self.api_key = api_key
self.timeout = timeout
self._name = name
self._supported_types = supported_types or set()
logger.info(f"Initialized CustomHTTPProcessor: {name} -> {api_url}")
@property
def name(self) -> str:
return self._name
@property
def supported_mime_types(self) -> set[str]:
return self._supported_types
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process via custom HTTP API.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename
options: Custom options (passed as form data to API)
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If API call fails
"""
options = options or {}
# Prepare request
files = {"file": (filename or "document", content, content_type)}
headers = {}
if self.api_key:
headers["Authorization"] = f"Bearer {self.api_key}"
try:
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.post(
self.api_url,
files=files,
headers=headers,
data=options, # Pass options as form data
)
response.raise_for_status()
# Parse response
result = response.json()
text = result.get("text", "")
metadata = result.get("metadata", {})
logger.debug(
f"Custom processor '{self.name}' extracted {len(text)} characters"
)
return ProcessingResult(
text=text,
metadata=metadata,
processor=self.name,
success=True,
)
except httpx.HTTPError as e:
logger.error(f"Custom processor '{self.name}' HTTP error: {e}")
raise ProcessorError(f"API call failed: {str(e)}") from e
except Exception as e:
logger.error(f"Custom processor '{self.name}' failed: {e}")
raise ProcessorError(f"Processing failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if custom API is available.
Returns:
True if API responds with status < 500
"""
try:
async with httpx.AsyncClient(timeout=5) as client:
# Try GET request to check availability
response = await client.get(
self.api_url,
headers={"User-Agent": "nextcloud-mcp-server"},
)
return response.status_code < 500
except Exception as e:
logger.warning(f"Custom processor '{self.name}' health check failed: {e}")
return False
@@ -0,0 +1,164 @@
"""Central registry for document processors."""
import logging
from typing import Any, Optional
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class ProcessorRegistry:
"""Central registry for document processors.
Manages registration and routing of document processing requests to
appropriate processors based on MIME types and priorities.
Example:
registry = ProcessorRegistry()
registry.register(UnstructuredProcessor(...), priority=10)
registry.register(TesseractProcessor(...), priority=5)
# Auto-select processor based on MIME type
result = await registry.process(pdf_bytes, "application/pdf")
# Force specific processor
result = await registry.process(img_bytes, "image/png", processor_name="tesseract")
"""
def __init__(self):
self._processors: dict[str, tuple[DocumentProcessor, int]] = {}
self._priority_order: list[str] = []
def register(self, processor: DocumentProcessor, priority: int = 0):
"""Register a document processor.
Args:
processor: Processor instance to register
priority: Higher priority processors are tried first (default: 0)
"""
name = processor.name
if name in self._processors:
logger.warning(f"Processor '{name}' already registered, replacing")
self._processors[name] = (processor, priority)
# Update priority order
if name in self._priority_order:
self._priority_order.remove(name)
# Insert in priority order (higher priority first)
inserted = False
for i, existing_name in enumerate(self._priority_order):
existing_priority = self._processors[existing_name][1]
if priority > existing_priority:
self._priority_order.insert(i, name)
inserted = True
break
if not inserted:
self._priority_order.append(name)
logger.info(
f"Registered processor: {name} "
f"(priority={priority}, supports={len(processor.supported_mime_types)} types)"
)
def get_processor(self, name: str) -> Optional[DocumentProcessor]:
"""Get a processor by name.
Args:
name: Processor name
Returns:
DocumentProcessor instance or None if not found
"""
if name in self._processors:
return self._processors[name][0]
return None
def find_processor(self, content_type: str) -> Optional[DocumentProcessor]:
"""Find the first processor that supports the given MIME type.
Processors are checked in priority order (highest priority first).
Args:
content_type: MIME type to match
Returns:
First matching processor or None
"""
for name in self._priority_order:
processor = self._processors[name][0]
if processor.supports(content_type):
logger.debug(f"Found processor '{name}' for type '{content_type}'")
return processor
logger.debug(f"No processor found for type '{content_type}'")
return None
def list_processors(self) -> list[str]:
"""List all registered processor names in priority order.
Returns:
List of processor names (highest priority first)
"""
return list(self._priority_order)
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
processor_name: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process a document using available processors.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename for format detection
processor_name: Force specific processor (or None for auto-select)
options: Processing options passed to processor
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If no processor found or processing fails
"""
# Find processor
if processor_name:
processor = self.get_processor(processor_name)
if not processor:
raise ProcessorError(
f"Processor '{processor_name}' not found. "
f"Available: {', '.join(self.list_processors())}"
)
else:
processor = self.find_processor(content_type)
if not processor:
raise ProcessorError(
f"No processor found for type: {content_type}. "
f"Registered processors: {', '.join(self.list_processors())}"
)
logger.info(f"Processing with '{processor.name}' processor")
# Process
return await processor.process(content, content_type, filename, options)
# Global registry instance
_registry = ProcessorRegistry()
def get_registry() -> ProcessorRegistry:
"""Get the global processor registry.
Returns:
Singleton ProcessorRegistry instance
"""
return _registry
@@ -0,0 +1,161 @@
"""Document processor using Tesseract OCR (local)."""
import logging
import shutil
from typing import Any, Optional
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
try:
import io
import pytesseract
from PIL import Image
TESSERACT_AVAILABLE = True
except ImportError:
TESSERACT_AVAILABLE = False
class TesseractProcessor(DocumentProcessor):
"""Document processor using Tesseract OCR (local).
This processor runs OCR locally using the Tesseract engine, which is
faster and more lightweight than cloud-based solutions but requires
Tesseract to be installed on the system.
Requirements:
- tesseract binary installed (e.g., apt install tesseract-ocr)
- Python packages: pip install pytesseract pillow
Example:
processor = TesseractProcessor(default_lang="eng+deu")
result = await processor.process(image_bytes, "image/jpeg")
"""
SUPPORTED_TYPES = {
"image/jpeg",
"image/png",
"image/tiff",
"image/bmp",
"image/gif",
}
def __init__(
self,
tesseract_cmd: Optional[str] = None,
default_lang: str = "eng",
):
"""Initialize Tesseract processor.
Args:
tesseract_cmd: Path to tesseract executable (None = auto-detect)
default_lang: Default OCR language (e.g., "eng", "deu", "eng+deu")
Raises:
ProcessorError: If Tesseract or required packages not available
"""
if not TESSERACT_AVAILABLE:
raise ProcessorError(
"Tesseract processor requires: pip install pytesseract pillow"
)
if tesseract_cmd:
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
elif not shutil.which("tesseract"):
raise ProcessorError(
"Tesseract not found in PATH. Install with: apt install tesseract-ocr"
)
self.default_lang = default_lang
logger.info(f"Initialized TesseractProcessor: lang={default_lang}")
@property
def name(self) -> str:
return "tesseract"
@property
def supported_mime_types(self) -> set[str]:
return self.SUPPORTED_TYPES
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process image via Tesseract OCR.
Args:
content: Image bytes
content_type: Image MIME type
filename: Optional filename
options: Processing options:
- lang: OCR language(s) (default: from init)
- config: Tesseract config string
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If OCR fails
"""
options = options or {}
lang = options.get("lang", self.default_lang)
config = options.get("config", "")
try:
# Load image
image = Image.open(io.BytesIO(content))
# Run OCR
text = pytesseract.image_to_string(image, lang=lang, config=config)
# Get additional data for confidence scores
data = pytesseract.image_to_data(
image, lang=lang, output_type=pytesseract.Output.DICT
)
# Calculate average confidence
confidences = [c for c in data["conf"] if c != -1]
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
metadata = {
"text_length": len(text),
"language": lang,
"image_size": image.size,
"image_mode": image.mode,
"confidence": round(avg_confidence, 2),
"words_detected": len([c for c in data["conf"] if c != -1]),
}
logger.debug(
f"Tesseract OCR completed: {len(text)} chars, "
f"confidence={avg_confidence:.1f}%"
)
return ProcessingResult(
text=text.strip(),
metadata=metadata,
processor=self.name,
success=True,
)
except Exception as e:
logger.error(f"Tesseract processing failed: {e}")
raise ProcessorError(f"OCR failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if Tesseract is available.
Returns:
True if Tesseract is installed and working
"""
try:
pytesseract.get_tesseract_version()
return True
except Exception:
return False
@@ -0,0 +1,193 @@
"""Document processor using Unstructured.io API."""
import io
import logging
from typing import Any, Optional
import httpx
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
class UnstructuredProcessor(DocumentProcessor):
"""Document processor using Unstructured.io API.
The Unstructured API provides document parsing capabilities for various formats
including PDF, DOCX, images with OCR, and more.
API Documentation: https://docs.unstructured.io/api-reference/api-services/api-parameters
"""
# Supported MIME types for Unstructured
SUPPORTED_TYPES = {
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/msword",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.ms-powerpoint",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.ms-excel",
"application/rtf",
"text/rtf",
"application/vnd.oasis.opendocument.text",
"application/epub+zip",
"message/rfc822",
"application/vnd.ms-outlook",
"image/jpeg",
"image/png",
"image/tiff",
"image/bmp",
}
def __init__(
self,
api_url: str,
timeout: int = 120,
default_strategy: str = "auto",
default_languages: Optional[list[str]] = None,
):
"""Initialize Unstructured processor.
Args:
api_url: Unstructured API endpoint
timeout: Request timeout in seconds (default: 120)
default_strategy: Default parsing strategy - "auto", "fast", or "hi_res"
default_languages: Default OCR language codes (e.g., ["eng", "deu"])
"""
self.api_url = api_url
self.timeout = timeout
self.default_strategy = default_strategy
self.default_languages = default_languages or ["eng"]
logger.info(
f"Initialized UnstructuredProcessor: {api_url}, "
f"strategy={default_strategy}, languages={self.default_languages}"
)
@property
def name(self) -> str:
return "unstructured"
@property
def supported_mime_types(self) -> set[str]:
return self.SUPPORTED_TYPES
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process document via Unstructured API.
Args:
content: Document bytes
content_type: MIME type
filename: Optional filename for format detection
options: Processing options:
- strategy: "auto", "fast", or "hi_res" (default: from init)
- languages: List of language codes (default: from init)
- extract_image_block_types: Types of image elements to extract
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If processing fails
"""
options = options or {}
# Extract options with defaults
strategy = options.get("strategy", self.default_strategy)
languages = options.get("languages", self.default_languages)
extract_image_block_types = options.get("extract_image_block_types")
# Prepare multipart request
files = {
"files": (
filename or "document",
io.BytesIO(content),
content_type or "application/octet-stream",
)
}
data = {
"strategy": strategy,
"languages": ",".join(languages),
}
if extract_image_block_types:
data["extract_image_block_types"] = ",".join(extract_image_block_types)
logger.debug(
f"Processing with Unstructured API: strategy={strategy}, languages={languages}"
)
try:
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.post(
f"{self.api_url}/general/v0/general",
files=files,
data=data,
)
response.raise_for_status()
# Parse response
elements = response.json()
# Extract text and metadata
texts = []
element_types: dict[str, int] = {}
for element in elements:
if "text" in element and element["text"]:
texts.append(element["text"])
el_type = element.get("type", "unknown")
element_types[el_type] = element_types.get(el_type, 0) + 1
parsed_text = "\n\n".join(texts)
metadata = {
"element_count": len(elements),
"text_length": len(parsed_text),
"element_types": element_types,
"strategy": strategy,
"languages": languages,
}
logger.debug(
f"Successfully processed: {len(elements)} elements, "
f"{len(parsed_text)} characters"
)
return ProcessingResult(
text=parsed_text,
metadata=metadata,
processor=self.name,
success=True,
)
except httpx.HTTPError as e:
logger.error(f"Unstructured API HTTP error: {e}")
raise ProcessorError(f"HTTP error: {str(e)}") from e
except Exception as e:
logger.error(f"Unstructured API processing failed: {e}")
raise ProcessorError(f"Processing failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if Unstructured API is available.
Returns:
True if API is healthy, False otherwise
"""
try:
async with httpx.AsyncClient(timeout=5) as client:
response = await client.get(f"{self.api_url}/healthcheck")
return response.status_code == 200
except Exception as e:
logger.warning(f"Unstructured health check failed: {e}")
return False
+5 -6
View File
@@ -2,15 +2,13 @@ import logging
from mcp.server.fastmcp import Context, FastMCP
from nextcloud_mcp_server.client import NextcloudClient
from nextcloud_mcp_server.auth import require_scopes
from nextcloud_mcp_server.context import get_client
from nextcloud_mcp_server.models import DirectoryListing, FileInfo, SearchFilesResponse
from nextcloud_mcp_server.utils.document_parser import (
is_parseable_document,
parse_document,
)
from nextcloud_mcp_server.config import is_unstructured_parsing_enabled
from nextcloud_mcp_server.auth import require_scopes
from nextcloud_mcp_server.context import get_client
from nextcloud_mcp_server.models import DirectoryListing, FileInfo, SearchFilesResponse
logger = logging.getLogger(__name__)
@@ -82,7 +80,8 @@ def configure_webdav_tools(mcp: FastMCP):
content, content_type = await client.webdav.read_file(path)
# Check if this is a parseable document (PDF, DOCX, etc.)
if is_unstructured_parsing_enabled() and is_parseable_document(content_type):
# is_parseable_document() checks if document processing is enabled
if is_parseable_document(content_type):
try:
logger.info(f"Parsing document '{path}' of type '{content_type}'")
parsed_text, metadata = await parse_document(
+41 -79
View File
@@ -1,62 +1,46 @@
"""Document parsing utilities based on the "unstructured" microservice"""
"""Document parsing utilities using pluggable processor registry."""
import base64
import logging
from typing import Optional, Tuple
from nextcloud_mcp_server.config import is_unstructured_parsing_enabled
from nextcloud_mcp_server.config import get_document_processor_config
from nextcloud_mcp_server.document_processors import (
ProcessorError,
get_registry,
)
logger = logging.getLogger(__name__)
# Mapping of MIME types to their corresponding parsing strategies
PARSEABLE_MIME_TYPES = {
# PDF documents
"application/pdf": "pdf",
# Microsoft Word documents
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
"application/msword": "doc",
# Microsoft PowerPoint
"application/vnd.openxmlformats-officedocument.presentationml.presentation": "pptx",
"application/vnd.ms-powerpoint": "ppt",
# Microsoft Excel
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "xlsx",
"application/vnd.ms-excel": "xls",
# Other document formats
"application/rtf": "rtf",
"text/rtf": "rtf",
"application/vnd.oasis.opendocument.text": "odt",
"application/epub+zip": "epub",
# Email formats
"message/rfc822": "eml",
"application/vnd.ms-outlook": "msg",
# Image formats (for OCR)
"image/jpeg": "image",
"image/png": "image",
"image/tiff": "image",
"image/bmp": "image",
}
def is_parseable_document(content_type: Optional[str]) -> bool:
"""Check if a document type can be parsed.
"""Check if a document type can be parsed by any registered processor.
Args:
content_type: The MIME type of the document
Returns:
True if the document can be parsed, False otherwise
True if any processor can handle this type, False otherwise
"""
if not content_type:
return False
# Handle content types with additional parameters (e.g., "application/pdf; charset=utf-8")
base_content_type = content_type.split(";")[0].strip().lower()
return base_content_type in PARSEABLE_MIME_TYPES
config = get_document_processor_config()
if not config["enabled"]:
return False
registry = get_registry()
processor = registry.find_processor(content_type)
return processor is not None
async def parse_document(
content: bytes, content_type: Optional[str], filename: Optional[str] = None
) -> Tuple[str, dict]:
"""Parse a document using the Unstructured API.
"""Parse a document using registered processors.
This function uses the processor registry to find an appropriate
processor for the given document type and extract text from it.
Args:
content: The document content as bytes
@@ -72,59 +56,37 @@ async def parse_document(
ValueError: If the document type is not supported
Exception: If parsing fails
"""
if not is_parseable_document(content_type):
raise ValueError(f"Document type '{content_type}' is not supported for parsing")
if not content_type:
raise ValueError("Content type is required for document parsing")
base_content_type = (
content_type.split(";")[0].strip().lower() if content_type else ""
)
doc_type = PARSEABLE_MIME_TYPES.get(base_content_type, "unknown")
config = get_document_processor_config()
if not config["enabled"]:
raise ValueError("Document processing is disabled")
logger.debug(f"Parsing document of type '{doc_type}' (MIME: {content_type})")
registry = get_registry()
# Check if unstructured parsing is enabled via environment
if is_unstructured_parsing_enabled():
logger.debug("Using Unstructured API for parsing")
try:
from nextcloud_mcp_server.client.unstructured_client import (
UnstructuredClient,
)
logger.debug(f"Parsing document of type '{content_type}'")
client = UnstructuredClient()
# The client will automatically use environment configuration
# (UNSTRUCTURED_STRATEGY and UNSTRUCTURED_LANGUAGES)
return await client.partition_document(
content=content,
filename=filename or f"document.{doc_type}",
content_type=content_type,
)
except Exception as e:
logger.error(f"Unstructured API parsing failed: {e}")
# If unstructured parsing fails, return base64 as fallback
import base64
parsed_text = f"Document could not be parsed. Base64 content: {base64.b64encode(content).decode('ascii')[:200]}..."
metadata = {
"document_type": doc_type,
"mime_type": content_type,
"element_count": 0,
"text_length": len(parsed_text),
"parsing_method": "fallback_base64",
"error": str(e),
}
return parsed_text, metadata
else:
logger.debug(
"Unstructured parsing is disabled, returning base64 encoded content as fallback"
try:
# Process using registry (auto-selects processor based on MIME type)
result = await registry.process(
content=content,
content_type=content_type,
filename=filename,
)
import base64
logger.info(f"Successfully parsed document with '{result.processor}' processor")
return result.text, result.metadata
except ProcessorError as e:
logger.error(f"Document processing failed: {e}")
# Fallback to base64 with error metadata
parsed_text = f"Document could not be parsed. Base64 content: {base64.b64encode(content).decode('ascii')[:200]}..."
metadata = {
"document_type": doc_type,
"mime_type": content_type,
"element_count": 0,
"text_length": len(parsed_text),
"parsing_method": "fallback_base64",
"error": str(e),
}
return parsed_text, metadata