Commit Graph

299 Commits

Author SHA1 Message Date
Chris Coutinho a58a14111b feat(vector-sync): enable background sync in OAuth mode
Add multi-user background vector synchronization when running in OAuth
mode with ENABLE_OFFLINE_ACCESS=true. Key changes:

Architecture (oauth_sync.py):
- User Manager task polls RefreshTokenStorage for provisioned users
- Per-user scanner tasks fetch documents using OAuth tokens
- Shared processor pool indexes documents from all users

Token Broker improvements:
- Accept client_id/client_secret instead of encryption_key
- Remove redundant token audience pre-validation (Nextcloud validates)
- Add _rewrite_token_endpoint for Docker internal URL routing
- Remove double-decryption (storage handles encryption internally)

Browser OAuth flow fixes:
- Add 'resource' parameter to request Nextcloud-scoped tokens
- Store and retrieve next_url for proper redirect after consent
- Rewrite token endpoint URLs for internal Docker access

Configuration:
- Add vector_sync_user_poll_interval setting (default: 60s)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 20:00:41 +01:00
Chris Coutinho e0320e761c perf(deck): optimize card lookup by storing board_id/stack_id in metadata
Addresses reviewer feedback on PR #395 about O(n²) performance issue.

Changes:
- scanner.py: Add metadata field to DocumentTask with board_id/stack_id
- scanner.py: Populate metadata during deck card scanning (both initial and incremental sync)
- processor.py: Use metadata for O(1) card lookup via get_card() API when available
- processor.py: Fallback to iteration for legacy data without metadata
- context.py: Add _get_deck_metadata_from_qdrant() helper to retrieve metadata from Qdrant
- context.py: Use metadata for fast path lookup in chunk context expansion
- context.py: Add user_id parameter to _fetch_document_text() for metadata retrieval

Performance Impact:
- Before: O(boards × stacks × cards) iteration for each card lookup
- After: O(1) direct API call using stored board_id/stack_id
- Graceful degradation: Falls back to iteration for legacy data

Testing:
- All existing integration tests pass (test_deck_vector_search.py)
- Type checking passes with no new errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 00:23:12 +01:00
Chris Coutinho 20404cf3f2 feat(vector): add Deck card vector search with visualization support
Adds comprehensive vector search support for Nextcloud Deck cards,
including semantic search indexing, chunk preview in the vector viz UI,
and proper deep linking to cards.

**Vector Search Indexing**
- Add deck_card scanning in scanner.py (scan_deck_cards function)
- Index cards from non-archived, non-deleted boards
- Store metadata: board_id, board_title, stack_id, stack_title, card_type, duedate, owner
- Content structure: title + "\n\n" + description (matches indexing format)
- Incremental sync based on lastModified timestamp
- Deletion tracking with grace period

**Vector Visualization Support**
- Add deck_card handler in context.py for chunk preview expansion
- Include board_id in search result metadata (bm25_hybrid.py, semantic.py)
- Expose metadata in viz_routes.py JSON responses
- Update vector-viz.js to construct proper Deck URLs: /apps/deck/board/{board_id}/card/{card_id}
- Update vector_viz.html filter label from "Deck" to "Deck Cards"

**Bug Fixes**
- Skip soft-deleted boards (deletedAt > 0) to prevent 403 Forbidden errors
- Applies to scanner, processor, and context expansion code paths
- Deck API returns deleted boards but rejects stack access with 403

**Testing**
- Add integration tests in test_deck_vector_search.py:
  - test_deck_card_semantic_search: Filtered search with doc_type="deck_card"
  - test_deck_card_appears_in_cross_app_search: Cross-app search includes deck cards
  - test_deck_card_chunk_context: Chunk context fetching for viz preview

**Documentation**
- Update README.md: Add Deck cards to semantic search feature list
- Update semantic-search-architecture.md: Document deck_card support
- Update nc_semantic_search tool documentation

**Type Safety**
- Fix type narrowing for page_boundaries (could be None) using cast()
- Fix scanner.py payload None check for type safety

Resolves vector search for Deck cards across indexing, search, and visualization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-13 23:51:18 +01:00
Chris Coutinho 9d0a993c2a feat(vector-viz): add news_item support for links and chunk expansion
Add support for news_item document type in the vector visualization page:

- Add "News" checkbox to document type filter options
- Add URL handler to link news items to /apps/news/item/{id}
- Add content fetching for news items in chunk context expansion

This enables users to search and view news articles in the vector
visualization, with clickable links back to Nextcloud News and the
ability to expand chunks to see full article context.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-13 21:34:47 +01:00
Chris Coutinho d61e33113c fix(news): revert get_item() to use get_items() + filter
Reverts the "perf(news): use direct API endpoint for get_item()" change
from commit 92c4bf3 which incorrectly assumed GET /items/{itemId} exists.

The News API (v1-2, v1-3, v2) does not provide a direct endpoint to
retrieve individual items. The only /items/{itemId} routes are POST
operations for marking items read/unread/starred.

Changes:
- Restore original get_item() implementation that fetches all items
  and filters in Python
- Update exception from HTTPStatusError to ValueError
- Restore documentation explaining API limitation
- Update unit tests to mock get_items() instead of _make_request()
- Add test for ValueError when item not found

Fixes vector processor 405 errors when indexing news items.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-13 15:47:27 +01:00
Chris Coutinho c8da826ef7 Merge pull request #382 from cbcoutinho/renovate/mcp-1.x
fix(deps): update dependency mcp to >=1.23,<1.24
2025-12-12 18:00:04 +01:00
Chris Coutinho ec70e70a5d fix: Disable DNS rebinding protection for containerized deployments
MCP Python SDK 1.23.0 introduced automatic DNS rebinding protection that
auto-enables when host="127.0.0.1" (the default). This breaks containerized
deployments (Kubernetes, Docker) because the protection rejects requests
with Host headers like "nextcloud-mcp-server.default.svc.cluster.local:8000".

Root cause:
- FastMCP defaults to host="127.0.0.1"
- SDK auto-enables DNS rebinding protection with allowed_hosts=["127.0.0.1:*", "localhost:*", "[::1]:*"]
- K8s/Docker requests use service DNS names or proxied hostnames
- Protection middleware rejects these requests (421 Misdirected Request)

Solution:
- Explicitly pass transport_security=TransportSecuritySettings(enable_dns_rebinding_protection=False)
- Applied to all three FastMCP initializations (OAuth, Smithery, BasicAuth)
- DNS rebinding attacks mitigated by OAuth authentication and network isolation

This fixes issue #373 and enables MCP 1.23.x upgrade in PR #382.

For detailed analysis, see docs/MCP-1.23-DNS-REBINDING-FIX.md
2025-12-12 17:30:22 +01:00
Chris Coutinho 19183ad14a fix: address PR review feedback
Address all reviewer comments from PR #387:

1.  Add unit tests for annotations (tests/server/test_annotations.py)
   - 10 comprehensive test functions validating all annotation patterns
   - Tests for titles, read-only, destructive, idempotent operations
   - Validates specific ADR-017 decisions (webdav write, semantic search)
   - Cross-category consistency checks

2.  Fix nc_webdav_write_file idempotency classification
   - Changed from idempotentHint=False to idempotentHint=True
   - Rationale: Uses HTTP PUT without version control
   - Writing same content to same path = same end state (idempotent)

3.  Fix semantic search openWorldHint inconsistency
   - Changed from openWorldHint=False to openWorldHint=True
   - Rationale: Consistent with other Nextcloud tools
   - Nextcloud is external to MCP server (indexed data is implementation detail)

4.  Update ADR-017 with resolved decisions
   - Converted Open Questions to Resolved Questions
   - Added detailed rationale for webdav write and semantic search
   - Updated status from Proposed to Implemented
   - Added decision timeline with dates

5.  Add MCP Tool Annotations guidelines to CLAUDE.md
   - Comprehensive section with code examples for all patterns
   - Key principles documented (idempotency, destructive, open world)
   - References ADR-017 for detailed rationale

All OAuth tools verified to have proper annotations (oauth_tools.py lines 686-751).
2025-12-11 13:50:55 +01:00
Chris Coutinho e1412320a7 feat: add MCP tool annotations for enhanced UX
Add ToolAnnotations to all 105+ MCP tools across 13 modules to enable
better client-side UX with human-readable titles and behavioral hints.

Changes:
- Add title and ToolAnnotations to all @mcp.tool() decorators
- Apply correct idempotency classification per ADR-017
- Add destructiveHint for delete operations
- Set openWorldHint=False for semantic search (internal data only)

Modules updated:
- OAuth (4 tools): Authentication and provisioning
- Notes (7 tools): Note management
- WebDAV (11 tools): File operations
- Semantic (3 tools): Semantic search and RAG
- Calendar (16 tools): Events and todos
- Contacts (7 tools): Address book management
- Sharing (5 tools): File/folder sharing
- Tables (6 tools): Structured data
- Deck (25 tools): Kanban board management
- Cookbook (13 tools): Recipe management
- News (8 tools): RSS feed reader

Annotation patterns:
- Read operations: readOnlyHint=True, openWorldHint=True
- Create operations: idempotentHint=False, openWorldHint=True
- Update operations: idempotentHint=False, openWorldHint=True
- Delete operations: destructiveHint=True, idempotentHint=True, openWorldHint=True

See docs/ADR-017-mcp-tool-annotations.md for rationale and implementation details.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 12:45:02 +01:00
Chris Coutinho 3f06e2ee77 fix: resolve all type checking errors (8 errors fixed)
Fixed 8 type checker errors across the codebase:

- vector/scanner.py: Handle None scroll results with null-safe iteration
- search/{bm25_hybrid,semantic}.py: Add None checks for result.payload
- auth/{unified_verifier,webhook_routes}.py: Assert non-None auth credentials
- client/webdav.py: Add None checks before int() conversions
- providers/openai.py: Assert embedding_model is not None
- search/algorithms.py: Explicitly type doc_types set and cast values
- observability/logging_config.py: Match parent class signature (log_data)

Also fixed test_create_tag_creates_system_tag to match WebDAV implementation
(was testing OCS API endpoint, now tests correct WebDAV endpoint with
Content-Location header).

Type checker: 0 errors (down from 8), 20 warnings (ignored)
Tests: All 192 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-08 01:09:02 +01:00
Chris Coutinho 92c4bf36f6 perf(news): use direct API endpoint for get_item()
Replace O(n) fetch-all-and-filter approach with O(1) direct API call.
The News API v1-3 supports GET /items/{id} for single-item retrieval.

- Update get_item() to use direct endpoint
- Add unit test for get_item() method
- Fixes critical performance issue identified in code review

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 17:22:51 +01:00
Chris Coutinho a5cb6e1242 refactor(news): simplify vector sync to fetch all items
Remove the complex starred+unread filtering logic in scan_news_items().
The News app's auto-purge feature (default: 200 items per feed) already
limits the total number of items, making explicit filtering unnecessary.

Changes:
- Replace two API calls (starred + unread) with single all-items call
- Remove deduplication logic that merged both lists
- Update docstring to explain the simpler approach

This reduces code complexity while maintaining the same effective coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 15:05:34 +01:00
Chris Coutinho a33f6a2f15 feat(news): add Nextcloud News app integration
Add full integration for the Nextcloud News (RSS/Atom reader) app:

- Add NewsClient with complete CRUD operations for folders, feeds, and items
- Add 8 read-only MCP tools for listing/getting folders, feeds, items
- Add Pydantic models for News entities with camelCase alias support
- Add vector sync support for starred + unread items
- Add HTML to Markdown converter using markdownify for better embeddings
- Add Docker post-install hook to enable News app
- Add 25 unit tests for NewsClient API methods

Vector sync indexes starred and unread items, providing a balanced approach
that captures important (starred) and current (unread) content without
indexing the entire article history.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 14:39:31 +01:00
Chris Coutinho 92b97bda00 fix: Add rate limit retry logic to OpenAI provider
Add exponential backoff retry handling for OpenAI API rate limits
(429 errors). This is needed for GitHub Models API which has stricter
rate limits than standard OpenAI API.

- Add retry_on_rate_limit decorator with exponential backoff
- Max 5 retries with delays: 2s → 4s → 8s → 16s → 32s
- Apply to embed(), _embed_batch_request(), and generate() methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 17:24:48 +01:00
Chris Coutinho 5c73b85f65 fix: Increase MCP sampling timeout to 5 minutes for slower LLMs
- Increase sampling timeout from 30s to 300s in semantic.py to accommodate
  slower local LLMs like Ollama
- Refactor RAG integration tests to support multiple providers (ollama,
  openai, anthropic, bedrock)
- Remove unnecessary embedding_provider fixture since MCP server handles
  embeddings internally
- Add --provider flag via tests/integration/conftest.py
- Add provider_fixtures.py with factory functions for generation providers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 05:43:48 +01:00
Chris Coutinho 4e9859117c fix: Share vector sync state with FastMCP session lifespan via module singleton
The refactor in fafeaf3 moved background tasks to Starlette server lifespan
but broke nc_get_vector_sync_status because it still looked for streams in
FastMCP's AppContext (lifespan_context).

Add VectorSyncState module singleton to bridge the lifespans:
- starlette_lifespan sets the singleton when starting background tasks
- app_lifespan_basic reads from singleton and includes in AppContext
- MCP tools can now access document_receive_stream for pending count

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 04:20:47 +01:00
Chris Coutinho fafeaf3d83 refactor: Move background tasks to server lifespan and deprecate SSE transport
- Move scanner/processor tasks from FastMCP session lifespan to Starlette
  server lifespan (correct architecture: background tasks run once at
  server level, not per-session)
- Change default CLI transport from SSE to streamable-http
- Remove SSE transport option from CLI (SSE is deprecated)
- Remove SSE client session factory from test fixtures
- Add tracing instrumentation to BM25 hybrid search operations for
  better observability

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 04:02:30 +01:00
Chris Coutinho 2ab8dad6a5 fix: Use WebDAV for tag creation and add LLM-as-a-judge for RAG tests
- Change create_tag() to use WebDAV POST instead of OCS API which
  returned 404 in some Nextcloud versions
- Add llm_judge() helper that evaluates system output against ground
  truth with simple TRUE/FALSE prompt
- Replace keyword-based assertions in RAG tests with LLM judge for
  more flexible semantic evaluation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 02:24:01 +01:00
Chris Coutinho 2896fa1dc9 feat: Add tag management methods to WebDAV client
- Add get_file_info() to get file info including file ID via PROPFIND
- Add create_tag() to create system tags via OCS API
- Add get_or_create_tag() for idempotent tag creation
- Add assign_tag_to_file() to assign tags to files via WebDAV
- Add remove_tag_from_file() to remove tags from files

Also refactors RAG evaluation:
- Add indexed_manual_pdf fixture using existing nc_client/nc_mcp_client
- Remove manual tag creation steps from workflow (now handled by fixture)
- Add comprehensive unit tests for new WebDAV methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 01:51:42 +01:00
Chris Coutinho 208365cd3d feat: Add OpenAI provider support for embeddings and generation
Adds OpenAI provider to the unified provider architecture (ADR-015),
supporting:
- OpenAI API (api.openai.com)
- GitHub Models API (models.github.ai/inference)
- OpenAI-compatible endpoints (Fireworks, Together, etc.)

Features:
- Embedding support with text-embedding-3-small/large models
- Text generation via chat completions API
- Automatic retry with exponential backoff for rate limits
- Provider auto-detection in registry (priority after Bedrock)

Environment variables:
- OPENAI_API_KEY: API key (required)
- OPENAI_BASE_URL: Base URL override (optional)
- OPENAI_EMBEDDING_MODEL: Embedding model (default: text-embedding-3-small)
- OPENAI_GENERATION_MODEL: Generation model (default: gpt-4o-mini)

Also adds:
- Integration tests for RAG pipeline with MCP sampling
- MCP client sampling support for integration tests
- Ground truth Q&A pairs for Nextcloud User Manual

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 00:33:32 +01:00
Chris Coutinho 03b984d5a7 fix(smithery): Enable JSON response format for scanner compatibility
The Smithery scanner was reporting "0 tools" despite the server returning
valid tool definitions. Root cause: the server was returning SSE-formatted
responses (event: message\ndata: {...}) which the scanner couldn't parse.

Changes:
- Add json_response=True to FastMCP for Smithery stateless mode
- Clean up verbose docstring examples in semantic.py and webdav.py

The MCP spec allows both SSE and plain JSON responses for HTTP transport.
Setting json_response=True returns Content-Type: application/json with
plain JSON-RPC instead of text/event-stream with SSE format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 22:01:18 +01:00
Chris Coutinho ea79e94842 Merge pull request #343 from cbcoutinho/fix/vector-viz-search
perf: Optimize vector viz search performance
2025-11-22 19:53:40 +01:00
Chris Coutinho b0612cfa0f perf: Optimize vector viz search performance
- Replace sequential Qdrant scroll calls with batch retrieve
  (50 HTTP requests → 1 request, ~50x faster vector fetch)

- Add point_id to SearchResult to enable batch retrieval by Qdrant point ID

- Reuse query embedding from search algorithm in viz_routes
  (eliminates redundant embedding call, saves ~30ms)

- Make BM25 encode() async with thread pool to avoid blocking event loop
  (~4.4s was blocking, now properly async)

- Run PCA computation in thread pool to avoid blocking event loop
  (~1.2s was blocking, now properly async)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 19:47:43 +01:00
Chris Coutinho 3b41776110 Merge pull request #342 from cbcoutinho/feature/smithery
feat: Add Smithery stateless deployment support (ADR-016)
2025-11-22 19:39:53 +01:00
Chris Coutinho 39fba49cfe fix(smithery): Add JSON Schema metadata to mcp-config endpoint
Add proper JSON Schema metadata fields per Smithery documentation:
- $schema: JSON Schema draft-07
- $id: Schema identifier URL
- title: Human-readable title
- description: Schema description
- x-query-style: "flat" (no nested objects in our schema)
- additionalProperties: false

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 18:28:05 +01:00
Chris Coutinho 706a15f0bc fix(smithery): Use container runtime pattern for config discovery
ADR-016: For container runtime deployment, Smithery does not auto-generate
the .well-known/mcp-config endpoint like it does for Python CLI runtime.

Changes:
- Remove [tool.smithery] from pyproject.toml (not used in container mode)
- Remove smithery_server.py (Python CLI runtime specific)
- Add .well-known/mcp-config endpoint to return JSON Schema config
- Add SmitheryConfigMiddleware to extract config from URL query params
- Use ContextVar to pass session config to tool handlers

The container runtime passes config as URL query parameters to /mcp:
  GET /mcp?nextcloud_url=...&username=...&app_password=...

Tested:
- All 164 unit tests passing
- Docker container builds successfully
- .well-known/mcp-config returns valid JSON Schema
- Health endpoints working

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 18:22:55 +01:00
Chris Coutinho b8dc413b73 feat: Add Smithery CLI deployment support
- Add smithery package as dependency
- Create smithery_server.py with @smithery.server() decorator
- Add SmitheryConfigSchema for session config (nextcloud_url, username, app_password)
- Add [tool.smithery] section to pyproject.toml
- Remove manual .well-known/mcp-config endpoint (Smithery handles this)

Smithery CLI will automatically:
- Extract config schema from the decorated function
- Handle session config parsing from query parameters
- Make config accessible via ctx.session_config in tools

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 18:05:33 +01:00
Chris Coutinho 8d29ce0122 fix: Add Smithery lifespan and auth mode detection
- Add SmitheryAppContext dataclass for stateless mode
- Add app_lifespan_smithery() with minimal lifespan (no shared state)
- Update is_oauth_mode() to detect Smithery mode and return BasicAuth
- Use Smithery lifespan when SMITHERY_DEPLOYMENT=true
- Add .well-known/mcp-config endpoint for config discovery
- Skip document processors in Smithery mode (not enabled)

Fixes startup issues in Smithery mode where missing env credentials
would incorrectly trigger OAuth mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 17:48:53 +01:00
Chris Coutinho f93d650992 feat: Implement ADR-016 Smithery stateless deployment mode
Adds support for Smithery hosted deployment with stateless operation:

- Add DeploymentMode enum with SELF_HOSTED and SMITHERY_STATELESS modes
- Add get_deployment_mode() to detect mode from SMITHERY_DEPLOYMENT env var
- Update get_client() to create per-request clients from session config
- Add conditional tool registration (skip semantic search in Smithery mode)
- Add conditional /app admin UI mounting (skip in Smithery mode)
- Create smithery.yaml with configSchema for user credentials
- Create Dockerfile.smithery for minimal stateless container
- Create smithery_main.py entrypoint for Smithery deployment

In Smithery mode:
- Users provide nextcloud_url, username, app_password via session config
- Each request creates a fresh NextcloudClient (no state between requests)
- Semantic search tools are disabled (no vector database)
- Admin UI (/app) is disabled (no webhooks, vector viz)

All existing self-hosted functionality remains unchanged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 17:30:42 +01:00
Chris Coutinho 34fd17ba55 fix: Use alpha_composite for proper RGBA highlight blending
Drawing directly with ImageDraw on RGBA mode doesn't blend alpha
properly. Use Image.alpha_composite() with a transparent overlay
to achieve correct semi-transparent highlight fills.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 17:04:29 +01:00
Chris Coutinho 8baa07db84 fix: Remove pymupdf.layout.activate() to fix page_chunks behavior
pymupdf.layout.activate() causes pymupdf4llm.to_markdown() to ignore the
page_chunks=True option, returning a single string instead of list[dict].
This broke per-page chunking needed for semantic search indexing.

See: https://github.com/pymupdf/pymupdf4llm/issues/323

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 16:58:35 +01:00
Chris Coutinho ba8a53803a refactor: Simplify PDF text extraction with single to_markdown call
Replace parallel per-page extraction with single to_markdown(page_chunks=True)
call. This is more efficient as pymupdf4llm can optimize internally for
full-document processing instead of making N separate calls for N pages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 03:52:02 +01:00
Chris Coutinho 31fade9730 perf: Optimize PDF processing with parallel extraction and single-render highlights
Phase 1 - PDF Highlighting Optimization:
- Render each page ONCE instead of once per chunk (N chunks = 1 render, not N)
- Use PIL to draw bounding boxes on copied base images (fast) instead of
  re-rendering page via pymupdf (slow)
- Add _find_chunk_bbox() to extract bbox without modifying page

Phase 2 - Parallel Page Extraction:
- Use anyio task group with run_sync() for parallel page extraction
- Each page extracted in separate thread via anyio.to_thread.run_sync()
- Event loop stays responsive during extraction
- Remove obsolete _process_sync() method

Expected improvement: 30-50% reduction in total PDF processing time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 03:11:56 +01:00
Chris Coutinho fffe483c02 fix: Centralize PDF processing and generate separate images per chunk
Previously, pymupdf4llm.to_markdown() was called twice - once in
PyMuPDFProcessor during indexing and again in PDFHighlighter during
visualization. Different image path lengths caused different character
offsets, leading to highlighted pages not matching their chunks.

Also fixed issue where all chunks on the same page showed all highlights
instead of just their own highlight. Now restores original page contents
between chunks using xref stream caching.

Changes:
- Add PDFHighlighter class requiring pre-computed page_boundaries and
  full_text from document processor (no fallback extraction)
- Pass pre-computed data from processor to highlighter
- Extract page-relative portion of chunk text for cross-page chunks
- Add bounding box highlighting using text anchor search
- Run highlight generation in parallel with embedding/BM25
- Cache and restore page contents to isolate highlights per chunk

Results: Highlighting success rate improved from 51% to 95% (121/128).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 02:46:30 +01:00
Chris Coutinho a62a007c87 feat: Add context expansion to semantic search with chunk overlap removal
Implements optional context expansion for semantic search results that
fetches adjacent chunks (N-1 and N+1) from Qdrant to provide before/after
context. Removes configurable chunk overlap (default 200 chars) to avoid
duplicate text appearing in both context and excerpt.

Key changes:
- Add include_context and context_chars parameters to nc_semantic_search
  and nc_semantic_search_answer tools
- Implement Qdrant cache fast path for chunk retrieval (avoids re-fetching
  and re-parsing documents, especially important for PDFs)
- Add _get_chunk_by_index_from_qdrant() to fetch adjacent chunks
- Remove chunk overlap from before_context (last N chars) and after_context
  (first N chars) to prevent duplicate text
- Fetch context in parallel with anyio.Semaphore (max 20 concurrent)
- Pass through page_number from SearchResult to SemanticSearchResult
- Remove document-level deduplication (keep chunk-level dedup from algorithm)

Context expansion is opt-in via include_context=true parameter. When enabled:
- Populates has_context_expansion, marked_text, before_context, after_context
- Adds truncation flags when context exceeds context_chars limit
- Falls back to document fetch for legacy data with truncated excerpts

Related: nextcloud_mcp_server/search/context.py:87-382,
         nextcloud_mcp_server/server/semantic.py:161-255
2025-11-21 01:02:22 +01:00
Chris Coutinho 5a251a99e6 fix: Set is_placeholder=False in processor to fix search filtering
The processor was not setting is_placeholder field when writing real
document chunks to Qdrant. This caused the placeholder filter to exclude
all documents (since None != False), resulting in 0 search results.

Now explicitly sets is_placeholder: False in payload when writing real
indexed chunks, allowing search filters to correctly distinguish between
placeholders and real documents.
2025-11-20 17:15:19 +01:00
Chris Coutinho 25ef33de7f feat: Use Ollama native batch API in embed_batch()
- Switch from sequential loop to /api/embed batch endpoint
- Use 'input' array parameter instead of individual 'prompt' requests
- Process in chunks of 32 to avoid quality degradation (issue #6262)
- Reduces HTTP overhead: 128 texts = 4 requests instead of 128
- Maintains backward compatibility with embed() for single embeddings

Ref: ollama/ollama#6262
2025-11-20 16:50:13 +01:00
Chris Coutinho ec2c274cd9 fix: Increase placeholder staleness threshold to 5x scan interval
- Changed from 2x (120s) to 5x (300s) scan interval
- Large PDFs take 3-4 minutes to process, need longer threshold
- Prevents premature requeuing of in-flight documents
2025-11-20 15:36:49 +01:00
Chris Coutinho 47f0b3db9a fix: Add placeholder staleness check to prevent duplicate processing
- Only requeue documents if placeholder is older than 2x scan interval (120s default)
- Prevents scanner from immediately requeuing in-flight documents
- Fixes issue where PDFs were being reprocessed every 60 seconds
- Staleness check applied to both notes and files scanning logic
2025-11-20 15:30:10 +01:00
Chris Coutinho 233de3508f fix: Use empty SparseVector instead of None for placeholders
Qdrant validation rejects None for sparse vectors in named vector dicts.
Use models.SparseVector(indices=[], values=[]) instead to create valid
empty sparse vectors for placeholder points.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 15:15:10 +01:00
Chris Coutinho 13b2d0048c feat: Implement Qdrant placeholder state management
Introduces a placeholder-based state tracking system to prevent duplicate
document processing during the gap between scanner queuing and processor
completion.

**Key Changes:**

1. **Placeholder Helper Functions** (`vector/placeholder.py`):
   - `write_placeholder_point()` - Creates zero-vector placeholder when queuing
   - `query_document_metadata()` - Queries for existing entry (placeholder or real)
   - `delete_placeholder_point()` - Removes placeholder before writing real vectors
   - `get_placeholder_filter()` - Filters placeholders from user-facing queries

2. **Scanner Updates** (`vector/scanner.py`):
   - Replace `indexed_at` comparison with `modified_at` comparison
   - Write placeholder before queuing each document
   - Query per-document metadata instead of bulk-querying indexed_at
   - Fixes bug where files were resubmitted every scan cycle

3. **Processor Updates** (`vector/processor.py`):
   - Delete placeholder before upserting real vectors
   - Ensures no duplicate points in Qdrant

4. **Query Filters** (all search files):
   - Add `get_placeholder_filter()` to all user-facing queries
   - Ensures placeholders never appear in search results or visualizations
   - Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py

**Architecture:**
- Placeholders use zero vectors with dimension from embedding service
- Payload includes `is_placeholder: True` flag for filtering
- Status field tracks: "pending", "processing", "completed", "failed"
- Deterministic UUIDs using uuid5 for consistent point IDs

**Impact:**
- Eliminates duplicate processing of same documents
- Fixes race condition where long-running documents get queued multiple times
- Prevents scanner from resubmitting files every scan cycle
- Maintains clean separation between in-flight and indexed documents

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 15:04:00 +01:00
Chris Coutinho 944dd760ca fix: Return empty array instead of null for query_coords when no results
When vector visualization search returns zero results, the code was returning
query_coords: null, which caused JavaScript error "can't access property 0,
queryCoords is null" when the frontend tried to access the array.

Changed to return empty array [] to match expected type and prevent crash.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 14:18:02 +01:00
Chris Coutinho d67aa6ae5c fix: Align PDF text extraction between indexing and context expansion
This commit fixes two critical issues with PDF processing:

1. **Text extraction mismatch (context expansion bug)**:
   - Indexing used pymupdf4llm.to_markdown() producing markdown text
   - Context expansion used page.get_text() producing plain text
   - Different text formats caused character offset misalignment
   - Search would find correct chunk, but expansion showed wrong section
   - Fixed by making context.py use pymupdf4llm.to_markdown() consistently

2. **Diagnostic logging for page number assignment**:
   - Added logging to verify page_boundaries exist in metadata
   - Added logging to verify assign_page_numbers() assigns values
   - Helps diagnose why page numbers show as null in search results

3. **mime_type storage bug**:
   - Fixed incorrect field reference in processor.py:405
   - Was using file_metadata.get("content_type", "")
   - Should use content_type from WebDAV response

Changes:
- nextcloud_mcp_server/search/context.py: Use pymupdf4llm.to_markdown()
  for PDF text extraction to match indexing method
- nextcloud_mcp_server/vector/processor.py: Add diagnostic logging for
  page boundaries and assignment, fix mime_type storage
- tests/unit/client/test_webdav.py: Fix import sorting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 13:57:50 +01:00
Chris Coutinho f1a5fac1b9 fix: Update models and viz to use int-only doc_id
- algorithms.py: Revert SearchResult.id to int (all docs use int IDs now)
- semantic.py: Revert SemanticSearchResult.id to int, remove Union import
- viz_routes.py: Remove str() conversion when querying doc_id from Qdrant
- viz_routes.py: Convert doc_id from query param to int in chunk context

Fixes vector visualization which was collapsing all chunks to a single
point because Qdrant queries were failing to match doc_id (string vs int).
2025-11-20 12:32:27 +01:00
Chris Coutinho d0691d5aa0 feat: Switch files to use numeric IDs with file_path resolution
- scanner.py: Use file_info['id'] as doc_id instead of file_path
- scanner.py: Pass file_path in DocumentTask for content retrieval
- processor.py: Store file_path in Qdrant payload for later lookup
- context.py: Add _get_file_path_from_qdrant() to resolve file_id → file_path
- context.py: Update get_chunk_with_context() to handle file ID resolution

This makes the system resilient to file renames since file IDs are stable
identifiers in Nextcloud, while file paths can change.
2025-11-20 12:00:47 +01:00
Chris Coutinho f1610bbd2e fix: Reconstruct full content for notes to match indexed offsets
Notes are indexed as "{title}\n\n{content}" in processor.py but were
being retrieved as just content during chunk expansion, causing
chunk_start_offset and chunk_end_offset to be misaligned.

This fix reconstructs the full content structure when fetching notes
for chunk expansion, ensuring the displayed chunks match the excerpts
shown in search results.

Fixes chunk/excerpt mismatch reported in vector visualization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 11:33:12 +01:00
Chris Coutinho 327d843f64 feat: Implement per-chunk vector visualization with context expansion
Major improvements to vector visualization page:
- Refactor PCA to display individual chunks instead of averaged documents
- Add context expansion module for fetching surrounding text from notes and PDFs
- Update deduplication to use (doc_id, doc_type, chunk_start, chunk_end) keys
- Fix Alpine.js rendering with chunk-specific keys including offsets
- Refactor authentication helper to return NextcloudClient for better reuse
- Add async context manager support to NextcloudClient

Technical details:
- viz_routes.py: Fetch specific chunk vectors instead of averaging per document
- context.py: New module supporting both notes and PDF text extraction via PyMuPDF
- search algorithms: Extract page_number, chunk_index, total_chunks from Qdrant
- vector-viz.js/html: Use chunk positions in expansion tracking keys

This enables users to see which specific chunks match their query
and view them with surrounding context in the PCA visualization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 11:22:20 +01:00
Chris Coutinho b8010270c1 fix: Add async/await, PDF metadata, and type safety fixes
This commit addresses multiple issues with async operations, PDF metadata
extraction, and type safety in document processing and search.

## Async/Await Fixes
- processor.py:259 - Added await for chunker.chunk_text(content)
- processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts)
- tests/unit/test_document_chunker.py - Converted all 12 test methods to async

## PDF Metadata Enhancement
- pymupdf.py:143 - Added file_size metadata extraction
- pymupdf.py:145-206 - Refactored to extract text page-by-page
  - Manually loop through pages instead of using page_chunks=True
  - Generate page_boundaries metadata for precise page tracking
  - Works around pymupdf.layout.activate() breaking page_chunks=True
- processor.py:32-66 - Added assign_page_numbers() helper function
  - Assigns page numbers to chunks based on overlap with page boundaries
  - Handles chunks spanning multiple pages
- processor.py:298-300 - Call assign_page_numbers() for PDF files

## Type Safety Fixes
- bm25_hybrid.py:184 - Removed int() conversion of doc_id
- semantic.py:131 - Removed int() conversion of doc_id
- viz_routes.py:275 - Removed int() conversion of doc_id
- Added comments documenting that doc_id can be int (notes) or str (file paths)

## Testing
- All 18 tests passing (12 unit + 6 integration)
- No type errors in modified files
- Container logs show successful processing
- Vector viz searches working correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 02:37:07 +01:00
Chris Coutinho c4ce28f05d fix: Improve 3D plot rendering with explicit dimensions and window resize support
- Get container dimensions before creating Plotly layout to render at correct size immediately
- Add init() method with window resize listener for responsive plot sizing
- Remove post-render resize call (no longer needed with explicit dimensions)
- Improve colorbar positioning and scene domain configuration

This eliminates the visual "jump" during initial render and ensures the plot resizes smoothly when the browser window changes size.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 19:43:20 +01:00
Chris Coutinho c126c3ec03 fix: Preserve 3D plot camera and improve documentation
This commit addresses PR feedback and fixes plot camera behavior.

## JavaScript Fix - Camera Preservation
- Changed plot update strategy from recreating layout to using Plotly.restyle()
- Query point visibility now toggles via restyle() which only modifies trace visibility
- Camera position/zoom naturally preserved since layout remains untouched
- Resolves jumpy plot behavior when toggling "Show Query Point" checkbox

Related: nextcloud_mcp_server/auth/static/vector-viz.js:58-73

## Documentation Improvements
- Condensed vector-sync-ui.md from 316 to 94 lines (~70% reduction)
- Removed redundant FAQ section (content merged into main sections)
- Simplified use cases from 4 detailed sections to 3 focused paragraphs
- Streamlined troubleshooting to 3 common issues
- Merged technical details into overview section
- Retained all essential information while improving readability

## Screenshot Updates
Removed old/outdated images (5 files):
- rag-workflow-bidirectional-final.png
- rag-workflow-prominent-llm.png
- rag-workflow-simple-final.png
- vector-viz-interface.png
- welcome-page.png

Replaced with current screenshots (3 files):
- vector-viz-document-types-2col.png - Now shows plot + results
- vector-viz-chunk-context.png - Centered content view
- vector-viz-results.png - Updated results list

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 14:10:53 +01:00