nextcloud-mcp-server

Files

T

Chris Coutinho d67aa6ae5c fix: Align PDF text extraction between indexing and context expansion

This commit fixes two critical issues with PDF processing:

1. **Text extraction mismatch (context expansion bug)**:
   - Indexing used pymupdf4llm.to_markdown() producing markdown text
   - Context expansion used page.get_text() producing plain text
   - Different text formats caused character offset misalignment
   - Search would find correct chunk, but expansion showed wrong section
   - Fixed by making context.py use pymupdf4llm.to_markdown() consistently

2. **Diagnostic logging for page number assignment**:
   - Added logging to verify page_boundaries exist in metadata
   - Added logging to verify assign_page_numbers() assigns values
   - Helps diagnose why page numbers show as null in search results

3. **mime_type storage bug**:
   - Fixed incorrect field reference in processor.py:405
   - Was using file_metadata.get("content_type", "")
   - Should use content_type from WebDAV response

Changes:
- nextcloud_mcp_server/search/context.py: Use pymupdf4llm.to_markdown()
  for PDF text extraction to match indexing method
- nextcloud_mcp_server/vector/processor.py: Add diagnostic logging for
  page boundaries and assignment, fix mime_type storage
- tests/unit/client/test_webdav.py: Fix import sorting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 13:57:50 +01:00

auth

feat: add webhook management UI and BeforeNodeDeletedEvent support

2025-11-11 20:35:08 +01:00

client

feat: add webhook management UI and BeforeNodeDeletedEvent support

2025-11-11 20:35:08 +01:00

fixtures

test: Replace http server for recipes with nginx container

2025-10-17 04:30:03 +02:00

integration

fix: Add async/await, PDF metadata, and type safety fixes