8f45e996e8
Implements background vector database synchronization using anyio TaskGroups for BasicAuth mode with single-user credentials. Scanner Implementation: - Periodic document discovery (hourly, configurable) - Timestamp-based change detection (Nextcloud vs Qdrant) - Wake event for immediate scanning on-demand - Supports both initial sync (all docs) and incremental sync (changes only) - Detects deleted documents and queues for removal Processor Implementation: - Concurrent document processing pool (3 workers default) - I/O-bound embedding generation via Ollama API - Retry logic with exponential backoff (3 retries) - Document chunking (512 words, 50-word overlap) - Handles both index and delete operations - Upserts vectors to Qdrant with rich metadata App Lifespan Integration: - Extended AppContext with background task state - Modified app_lifespan_basic() to start tasks via anyio TaskGroups - Graceful shutdown with coordinated task cancellation - Only activates when VECTOR_SYNC_ENABLED=true Embedding Service: - OllamaEmbeddingProvider with TLS support - Singleton pattern for shared client instances - Batch embedding support for efficiency - Auto-detects embedding dimension (768 for nomic-embed-text) Qdrant Client: - Async client wrapper with singleton pattern - Auto-creates collection on first use - COSINE distance metric for semantic similarity - Integrates with embedding service for dimension detection Health Check Enhancement: - Added Qdrant status check to /health/ready endpoint - Only checks when VECTOR_SYNC_ENABLED=true - 2-second timeout for health probe - Reports connection errors with details Configuration: - VECTOR_SYNC_ENABLED: Enable background sync - VECTOR_SYNC_SCAN_INTERVAL: Scanner frequency (3600s default) - VECTOR_SYNC_PROCESSOR_WORKERS: Concurrent processors (3 default) - QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION: Vector DB config - OLLAMA_BASE_URL, OLLAMA_EMBEDDING_MODEL: Embedding service config Dependencies Added: - qdrant-client>=1.7.0: Vector database client Docker Compose: - Added Qdrant service with health check - Exposed ports 6333 (REST) and 6334 (gRPC) - Configured MCP service with vector sync environment - Added qdrant-data volume for persistence Known Issue: - FastMCP lifespan not triggering for streamable-http transport - Background tasks will start once lifespan integration is complete - Lifespan triggers on MCP session establishment, not server startup Related: ADR-007 Background Vector Database Synchronization 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
52 lines
1.4 KiB
Python
52 lines
1.4 KiB
Python
"""Document chunking for large texts."""
|
|
|
|
import logging
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class DocumentChunker:
|
|
"""Chunk large documents for optimal embedding."""
|
|
|
|
def __init__(self, chunk_size: int = 512, overlap: int = 50):
|
|
"""
|
|
Initialize document chunker.
|
|
|
|
Args:
|
|
chunk_size: Number of words per chunk (default: 512)
|
|
overlap: Number of overlapping words between chunks (default: 50)
|
|
"""
|
|
self.chunk_size = chunk_size
|
|
self.overlap = overlap
|
|
|
|
def chunk_text(self, content: str) -> list[str]:
|
|
"""
|
|
Split text into overlapping chunks.
|
|
|
|
Uses simple word-based chunking with configurable overlap to preserve
|
|
context across chunk boundaries.
|
|
|
|
Args:
|
|
content: Text content to chunk
|
|
|
|
Returns:
|
|
List of text chunks (may be single item if content is small)
|
|
"""
|
|
# Simple word-based chunking
|
|
words = content.split()
|
|
|
|
if len(words) <= self.chunk_size:
|
|
return [content]
|
|
|
|
chunks = []
|
|
start = 0
|
|
|
|
while start < len(words):
|
|
end = start + self.chunk_size
|
|
chunk_words = words[start:end]
|
|
chunks.append(" ".join(chunk_words))
|
|
start = end - self.overlap
|
|
|
|
logger.debug(f"Chunked document into {len(chunks)} chunks ({len(words)} words)")
|
|
return chunks
|