d67aa6ae5c
This commit fixes two critical issues with PDF processing:
1. **Text extraction mismatch (context expansion bug)**:
- Indexing used pymupdf4llm.to_markdown() producing markdown text
- Context expansion used page.get_text() producing plain text
- Different text formats caused character offset misalignment
- Search would find correct chunk, but expansion showed wrong section
- Fixed by making context.py use pymupdf4llm.to_markdown() consistently
2. **Diagnostic logging for page number assignment**:
- Added logging to verify page_boundaries exist in metadata
- Added logging to verify assign_page_numbers() assigns values
- Helps diagnose why page numbers show as null in search results
3. **mime_type storage bug**:
- Fixed incorrect field reference in processor.py:405
- Was using file_metadata.get("content_type", "")
- Should use content_type from WebDAV response
Changes:
- nextcloud_mcp_server/search/context.py: Use pymupdf4llm.to_markdown()
for PDF text extraction to match indexing method
- nextcloud_mcp_server/vector/processor.py: Add diagnostic logging for
page boundaries and assignment, fix mime_type storage
- tests/unit/client/test_webdav.py: Fix import sorting
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>