Files
nextcloud-mcp-server/nextcloud_mcp_server/document_processors/tesseract.py
T
Chris Coutinho 2147fc1696 refactor: Transform document parsing into pluggable processor architecture
Refactors PR #190's hardcoded Unstructured.io integration into a flexible,
extensible plugin system supporting multiple text extraction engines.

- **`DocumentProcessor` ABC**: Abstract interface for all processors
- **`ProcessorRegistry`**: Central registry for discovery and routing
- **`ProcessingResult`**: Standardized output format across processors

- **`UnstructuredProcessor`**: Refactored from `UnstructuredClient`
- **`TesseractProcessor`**: Local OCR for images (lightweight alternative)
- **`CustomHTTPProcessor`**: Generic wrapper for custom HTTP APIs

- New `get_document_processor_config()` returns structured config
- Supports enabling/disabling individual processors
- Per-processor configuration via environment variables
- **Breaking Change**: `ENABLE_UNSTRUCTURED_PARSING` replaced with:
  - `ENABLE_DOCUMENT_PROCESSING=true/false` (master switch)
  - `ENABLE_UNSTRUCTURED=true/false` (per-processor)
  - `ENABLE_TESSERACT=true/false`
  - `ENABLE_CUSTOM_PROCESSOR=true/false`

- `parse_document()` now uses `ProcessorRegistry`
- Auto-selects appropriate processor based on MIME type
- Processor priority system (Unstructured=10, Tesseract=5, Custom=1)

- `initialize_document_processors()` registers processors at startup
- Integrated into both BasicAuth and OAuth lifespans
- Graceful degradation if processors fail to initialize

```env
ENABLE_DOCUMENT_PROCESSING=false

ENABLE_UNSTRUCTURED=false
UNSTRUCTURED_API_URL=http://unstructured:8000
UNSTRUCTURED_STRATEGY=auto  # auto|fast|hi_res
UNSTRUCTURED_LANGUAGES=eng,deu

ENABLE_TESSERACT=false
TESSERACT_LANG=eng

ENABLE_CUSTOM_PROCESSOR=false
CUSTOM_PROCESSOR_URL=http://localhost:9000/process
CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg
```

- **Removed**: `tests/test_unstructured_config.py` (legacy tests)
- **Added**: `tests/unit/test_document_processor_config.py`
  - 7 unit tests for new config system
  - Tests individual and multi-processor configurations

- **Added**:
  - `nextcloud_mcp_server/document_processors/__init__.py`
  - `nextcloud_mcp_server/document_processors/base.py`
  - `nextcloud_mcp_server/document_processors/registry.py`
  - `nextcloud_mcp_server/document_processors/unstructured.py`
  - `nextcloud_mcp_server/document_processors/tesseract.py`
  - `nextcloud_mcp_server/document_processors/custom_http.py`
  - `tests/unit/test_document_processor_config.py`

- **Modified**:
  - `nextcloud_mcp_server/config.py` - New plugin config system
  - `nextcloud_mcp_server/app.py` - Processor initialization
  - `nextcloud_mcp_server/utils/document_parser.py` - Uses registry
  - `nextcloud_mcp_server/server/webdav.py` - Import updates
  - `env.sample` - New configuration format
  - `docker-compose.yml` - (profile changes from previous work)

- **Removed**:
  - `nextcloud_mcp_server/client/unstructured_client.py` - Replaced by UnstructuredProcessor
  - `tests/test_unstructured_config.py` - Replaced with new tests

 **Extensible**: Add processors without modifying core code
 **Testable**: Mock processors for unit tests
 **Configurable**: Enable only needed processors
 **Flexible**: Choose fast (Tesseract) vs accurate (Unstructured)
 **Opt-in**: Disabled by default, no mandatory dependencies

Users upgrading from PR #190 need to update environment variables:
```bash
ENABLE_UNSTRUCTURED_PARSING=true

ENABLE_DOCUMENT_PROCESSING=true
ENABLE_UNSTRUCTURED=true
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-25 19:28:35 +02:00

162 lines
4.7 KiB
Python

"""Document processor using Tesseract OCR (local)."""
import logging
import shutil
from typing import Any, Optional
from .base import DocumentProcessor, ProcessingResult, ProcessorError
logger = logging.getLogger(__name__)
try:
import io
import pytesseract
from PIL import Image
TESSERACT_AVAILABLE = True
except ImportError:
TESSERACT_AVAILABLE = False
class TesseractProcessor(DocumentProcessor):
"""Document processor using Tesseract OCR (local).
This processor runs OCR locally using the Tesseract engine, which is
faster and more lightweight than cloud-based solutions but requires
Tesseract to be installed on the system.
Requirements:
- tesseract binary installed (e.g., apt install tesseract-ocr)
- Python packages: pip install pytesseract pillow
Example:
processor = TesseractProcessor(default_lang="eng+deu")
result = await processor.process(image_bytes, "image/jpeg")
"""
SUPPORTED_TYPES = {
"image/jpeg",
"image/png",
"image/tiff",
"image/bmp",
"image/gif",
}
def __init__(
self,
tesseract_cmd: Optional[str] = None,
default_lang: str = "eng",
):
"""Initialize Tesseract processor.
Args:
tesseract_cmd: Path to tesseract executable (None = auto-detect)
default_lang: Default OCR language (e.g., "eng", "deu", "eng+deu")
Raises:
ProcessorError: If Tesseract or required packages not available
"""
if not TESSERACT_AVAILABLE:
raise ProcessorError(
"Tesseract processor requires: pip install pytesseract pillow"
)
if tesseract_cmd:
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
elif not shutil.which("tesseract"):
raise ProcessorError(
"Tesseract not found in PATH. Install with: apt install tesseract-ocr"
)
self.default_lang = default_lang
logger.info(f"Initialized TesseractProcessor: lang={default_lang}")
@property
def name(self) -> str:
return "tesseract"
@property
def supported_mime_types(self) -> set[str]:
return self.SUPPORTED_TYPES
async def process(
self,
content: bytes,
content_type: str,
filename: Optional[str] = None,
options: Optional[dict[str, Any]] = None,
) -> ProcessingResult:
"""Process image via Tesseract OCR.
Args:
content: Image bytes
content_type: Image MIME type
filename: Optional filename
options: Processing options:
- lang: OCR language(s) (default: from init)
- config: Tesseract config string
Returns:
ProcessingResult with extracted text and metadata
Raises:
ProcessorError: If OCR fails
"""
options = options or {}
lang = options.get("lang", self.default_lang)
config = options.get("config", "")
try:
# Load image
image = Image.open(io.BytesIO(content))
# Run OCR
text = pytesseract.image_to_string(image, lang=lang, config=config)
# Get additional data for confidence scores
data = pytesseract.image_to_data(
image, lang=lang, output_type=pytesseract.Output.DICT
)
# Calculate average confidence
confidences = [c for c in data["conf"] if c != -1]
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
metadata = {
"text_length": len(text),
"language": lang,
"image_size": image.size,
"image_mode": image.mode,
"confidence": round(avg_confidence, 2),
"words_detected": len([c for c in data["conf"] if c != -1]),
}
logger.debug(
f"Tesseract OCR completed: {len(text)} chars, "
f"confidence={avg_confidence:.1f}%"
)
return ProcessingResult(
text=text.strip(),
metadata=metadata,
processor=self.name,
success=True,
)
except Exception as e:
logger.error(f"Tesseract processing failed: {e}")
raise ProcessorError(f"OCR failed: {str(e)}") from e
async def health_check(self) -> bool:
"""Check if Tesseract is available.
Returns:
True if Tesseract is installed and working
"""
try:
pytesseract.get_tesseract_version()
return True
except Exception:
return False