Files
nextcloud-mcp-server/nextcloud_mcp_server/vector/document_chunker.py
T
Chris Coutinho 8f45e996e8 feat: implement vector sync scanner and processor (ADR-007 Phase 2)
Implements background vector database synchronization using anyio
TaskGroups for BasicAuth mode with single-user credentials.

Scanner Implementation:
- Periodic document discovery (hourly, configurable)
- Timestamp-based change detection (Nextcloud vs Qdrant)
- Wake event for immediate scanning on-demand
- Supports both initial sync (all docs) and incremental sync (changes only)
- Detects deleted documents and queues for removal

Processor Implementation:
- Concurrent document processing pool (3 workers default)
- I/O-bound embedding generation via Ollama API
- Retry logic with exponential backoff (3 retries)
- Document chunking (512 words, 50-word overlap)
- Handles both index and delete operations
- Upserts vectors to Qdrant with rich metadata

App Lifespan Integration:
- Extended AppContext with background task state
- Modified app_lifespan_basic() to start tasks via anyio TaskGroups
- Graceful shutdown with coordinated task cancellation
- Only activates when VECTOR_SYNC_ENABLED=true

Embedding Service:
- OllamaEmbeddingProvider with TLS support
- Singleton pattern for shared client instances
- Batch embedding support for efficiency
- Auto-detects embedding dimension (768 for nomic-embed-text)

Qdrant Client:
- Async client wrapper with singleton pattern
- Auto-creates collection on first use
- COSINE distance metric for semantic similarity
- Integrates with embedding service for dimension detection

Health Check Enhancement:
- Added Qdrant status check to /health/ready endpoint
- Only checks when VECTOR_SYNC_ENABLED=true
- 2-second timeout for health probe
- Reports connection errors with details

Configuration:
- VECTOR_SYNC_ENABLED: Enable background sync
- VECTOR_SYNC_SCAN_INTERVAL: Scanner frequency (3600s default)
- VECTOR_SYNC_PROCESSOR_WORKERS: Concurrent processors (3 default)
- QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION: Vector DB config
- OLLAMA_BASE_URL, OLLAMA_EMBEDDING_MODEL: Embedding service config

Dependencies Added:
- qdrant-client>=1.7.0: Vector database client

Docker Compose:
- Added Qdrant service with health check
- Exposed ports 6333 (REST) and 6334 (gRPC)
- Configured MCP service with vector sync environment
- Added qdrant-data volume for persistence

Known Issue:
- FastMCP lifespan not triggering for streamable-http transport
- Background tasks will start once lifespan integration is complete
- Lifespan triggers on MCP session establishment, not server startup

Related: ADR-007 Background Vector Database Synchronization

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 21:14:38 +01:00

52 lines
1.4 KiB
Python

"""Document chunking for large texts."""
import logging
logger = logging.getLogger(__name__)
class DocumentChunker:
"""Chunk large documents for optimal embedding."""
def __init__(self, chunk_size: int = 512, overlap: int = 50):
"""
Initialize document chunker.
Args:
chunk_size: Number of words per chunk (default: 512)
overlap: Number of overlapping words between chunks (default: 50)
"""
self.chunk_size = chunk_size
self.overlap = overlap
def chunk_text(self, content: str) -> list[str]:
"""
Split text into overlapping chunks.
Uses simple word-based chunking with configurable overlap to preserve
context across chunk boundaries.
Args:
content: Text content to chunk
Returns:
List of text chunks (may be single item if content is small)
"""
# Simple word-based chunking
words = content.split()
if len(words) <= self.chunk_size:
return [content]
chunks = []
start = 0
while start < len(words):
end = start + self.chunk_size
chunk_words = words[start:end]
chunks.append(" ".join(chunk_words))
start = end - self.overlap
logger.debug(f"Chunked document into {len(chunks)} chunks ({len(words)} words)")
return chunks