feat: Add context expansion to semantic search with chunk overlap removal

Implements optional context expansion for semantic search results that
fetches adjacent chunks (N-1 and N+1) from Qdrant to provide before/after
context. Removes configurable chunk overlap (default 200 chars) to avoid
duplicate text appearing in both context and excerpt.

Key changes:
- Add include_context and context_chars parameters to nc_semantic_search
  and nc_semantic_search_answer tools
- Implement Qdrant cache fast path for chunk retrieval (avoids re-fetching
  and re-parsing documents, especially important for PDFs)
- Add _get_chunk_by_index_from_qdrant() to fetch adjacent chunks
- Remove chunk overlap from before_context (last N chars) and after_context
  (first N chars) to prevent duplicate text
- Fetch context in parallel with anyio.Semaphore (max 20 concurrent)
- Pass through page_number from SearchResult to SemanticSearchResult
- Remove document-level deduplication (keep chunk-level dedup from algorithm)

Context expansion is opt-in via include_context=true parameter. When enabled:
- Populates has_context_expansion, marked_text, before_context, after_context
- Adds truncation flags when context exceeds context_chars limit
- Falls back to document fetch for legacy data with truncated excerpts

Related: nextcloud_mcp_server/search/context.py:87-382,
         nextcloud_mcp_server/server/semantic.py:161-255

This commit is contained in:

Chris Coutinho

2025-11-21 01:02:22 +01:00

parent 5a251a99e6

commit a62a007c87

5 changed files with 359 additions and 19 deletions

									
										nextcloud_mcp_server/document_processors/pymupdf.py
									
		+1
		-1
	
												View File
												
				@@ -8,12 +8,12 @@ from typing import Any, Optional

				import pymupdf

				import pymupdf.layout

				import pymupdf4llm

				from .base import DocumentProcessor, ProcessingResult, ProcessorError

				# Activate layout analysis for better text extraction

				pymupdf.layout.activate()

				import pymupdf4llm  # noqa

				logger = logging.getLogger(__name__)