feat(vector): Add configurable chunk size and overlap for document embedding
Enable users to tune document chunking parameters to match their embedding model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP environment variables. - **config.py**: Added `document_chunk_size` (default: 512) and `document_chunk_overlap` (default: 50) configuration fields with validation: - Ensures overlap < chunk_size - Warns if chunk_size < 100 words - Prevents negative overlap values - **processor.py**: Updated DocumentChunker instantiation to use config settings instead of hardcoded values (line 174-177) - **tests/unit/test_config.py**: Added TestChunkConfigValidation class with 9 tests covering: - Default values - Valid configurations - Validation errors (overlap >= chunk_size, negative overlap) - Warning for small chunk sizes - Environment variable loading - **docs/configuration.md**: Added comprehensive "Document Chunking Configuration" section with: - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words) - Overlap recommendations (10-20% of chunk size) - Configuration examples for different use cases - Added env vars to reference table - **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy" section with: - Chunking process explanation - Example showing sliding window behavior - Search behavior with chunks - Tuning recommendations - **env.sample**: Added complete "Semantic Search & Vector Sync Configuration" section with: - Vector sync settings - Qdrant configuration (3 modes) - Ollama embedding service - Document chunking configuration - **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP with usage notes \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` 1. \`overlap\` must be less than \`chunk_size\` 2. \`overlap\` cannot be negative 3. Warning issued if \`chunk_size\` < 100 words **Precise matching** (small notes, specific queries): \`\`\`bash DOCUMENT_CHUNK_SIZE=256 DOCUMENT_CHUNK_OVERLAP=25 \`\`\` **Balanced** (default, general purpose): \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` **Contextual** (long documents, broader topics): \`\`\`bash DOCUMENT_CHUNK_SIZE=1024 DOCUMENT_CHUNK_OVERLAP=100 \`\`\` ✅ **User control** - Tune chunking to match embedding model capabilities ✅ **Experimentation** - Test different chunk sizes for optimal results ✅ **Model alignment** - Match chunk size to embedding context window ✅ **Backward compatible** - Defaults maintain existing behavior ✅ **Well validated** - Comprehensive tests prevent misconfiguration All 22 config validation tests pass (9 new tests for chunking): - Default values work correctly - Validation prevents invalid configurations - Environment variables load properly - Warning system works as expected With configurable chunk sizes, users can now experiment with different Ollama embedding models and tune chunk parameters for optimal semantic search quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
+72
@@ -124,3 +124,75 @@ ENABLE_CUSTOM_PROCESSOR=false
|
||||
|
||||
# Comma-separated MIME types your processor supports
|
||||
#CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg,image/png
|
||||
|
||||
# ============================================
|
||||
# Semantic Search & Vector Sync Configuration
|
||||
# ============================================
|
||||
# EXPERIMENTAL: Semantic search for Notes app (multi-app support planned)
|
||||
# Requires: Qdrant vector database + Ollama embedding service
|
||||
# Disabled by default
|
||||
|
||||
# Enable background vector indexing
|
||||
VECTOR_SYNC_ENABLED=false
|
||||
|
||||
# Document scan interval in seconds (default: 300 = 5 minutes)
|
||||
# How often to check for new/updated documents
|
||||
#VECTOR_SYNC_SCAN_INTERVAL=300
|
||||
|
||||
# Concurrent indexing workers (default: 3)
|
||||
# Number of parallel workers for embedding generation
|
||||
#VECTOR_SYNC_PROCESSOR_WORKERS=3
|
||||
|
||||
# Max queued documents (default: 10000)
|
||||
# Maximum documents waiting to be processed
|
||||
#VECTOR_SYNC_QUEUE_MAX_SIZE=10000
|
||||
|
||||
# ============================================
|
||||
# Qdrant Vector Database Configuration
|
||||
# ============================================
|
||||
# Choose ONE of three modes:
|
||||
# 1. In-memory mode (default): Set neither QDRANT_URL nor QDRANT_LOCATION
|
||||
# 2. Persistent local: Set QDRANT_LOCATION=/path/to/data
|
||||
# 3. Network mode: Set QDRANT_URL=http://qdrant:6333
|
||||
|
||||
# Network mode: URL to Qdrant service
|
||||
#QDRANT_URL=http://qdrant:6333
|
||||
|
||||
# Local mode: Path to store vectors (use :memory: for in-memory)
|
||||
#QDRANT_LOCATION=:memory:
|
||||
|
||||
# API key for network mode (optional)
|
||||
#QDRANT_API_KEY=
|
||||
|
||||
# Collection name (optional - auto-generated if not set)
|
||||
# Auto-generation format: {deployment-id}-{model-name}
|
||||
# Allows safe model switching and multi-server deployments
|
||||
#QDRANT_COLLECTION=nextcloud_content
|
||||
|
||||
# ============================================
|
||||
# Ollama Embedding Service Configuration
|
||||
# ============================================
|
||||
# Ollama endpoint for embeddings (if not set, uses SimpleEmbeddingProvider fallback)
|
||||
#OLLAMA_BASE_URL=http://ollama:11434
|
||||
|
||||
# Embedding model to use (default: nomic-embed-text, 768 dimensions)
|
||||
# Changing this creates a new collection (requires re-embedding all documents)
|
||||
#OLLAMA_EMBEDDING_MODEL=nomic-embed-text
|
||||
|
||||
# Verify SSL certificates (default: true)
|
||||
#OLLAMA_VERIFY_SSL=true
|
||||
|
||||
# ============================================
|
||||
# Document Chunking Configuration
|
||||
# ============================================
|
||||
# Configure how documents are split before embedding
|
||||
|
||||
# Words per chunk (default: 512)
|
||||
# Smaller chunks (256-384): More precise, less context, more storage
|
||||
# Larger chunks (768-1024): More context, less precise, less storage
|
||||
#DOCUMENT_CHUNK_SIZE=512
|
||||
|
||||
# Overlapping words between chunks (default: 50)
|
||||
# Recommended: 10-20% of chunk size
|
||||
# Preserves context across chunk boundaries
|
||||
#DOCUMENT_CHUNK_OVERLAP=50
|
||||
|
||||
Reference in New Issue
Block a user