feat(vector): Add configurable chunk size and overlap for document embedding
Enable users to tune document chunking parameters to match their embedding model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP environment variables. - **config.py**: Added `document_chunk_size` (default: 512) and `document_chunk_overlap` (default: 50) configuration fields with validation: - Ensures overlap < chunk_size - Warns if chunk_size < 100 words - Prevents negative overlap values - **processor.py**: Updated DocumentChunker instantiation to use config settings instead of hardcoded values (line 174-177) - **tests/unit/test_config.py**: Added TestChunkConfigValidation class with 9 tests covering: - Default values - Valid configurations - Validation errors (overlap >= chunk_size, negative overlap) - Warning for small chunk sizes - Environment variable loading - **docs/configuration.md**: Added comprehensive "Document Chunking Configuration" section with: - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words) - Overlap recommendations (10-20% of chunk size) - Configuration examples for different use cases - Added env vars to reference table - **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy" section with: - Chunking process explanation - Example showing sliding window behavior - Search behavior with chunks - Tuning recommendations - **env.sample**: Added complete "Semantic Search & Vector Sync Configuration" section with: - Vector sync settings - Qdrant configuration (3 modes) - Ollama embedding service - Document chunking configuration - **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP with usage notes \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` 1. \`overlap\` must be less than \`chunk_size\` 2. \`overlap\` cannot be negative 3. Warning issued if \`chunk_size\` < 100 words **Precise matching** (small notes, specific queries): \`\`\`bash DOCUMENT_CHUNK_SIZE=256 DOCUMENT_CHUNK_OVERLAP=25 \`\`\` **Balanced** (default, general purpose): \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` **Contextual** (long documents, broader topics): \`\`\`bash DOCUMENT_CHUNK_SIZE=1024 DOCUMENT_CHUNK_OVERLAP=100 \`\`\` ✅ **User control** - Tune chunking to match embedding model capabilities ✅ **Experimentation** - Test different chunk sizes for optimal results ✅ **Model alignment** - Match chunk size to embedding context window ✅ **Backward compatible** - Defaults maintain existing behavior ✅ **Well validated** - Comprehensive tests prevent misconfiguration All 22 config validation tests pass (9 new tests for chunking): - Default values work correctly - Validation prevents invalid configurations - Environment variables load properly - Warning system works as expected With configurable chunk sizes, users can now experiment with different Ollama embedding models and tune chunk parameters for optimal semantic search quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
+8
-4
@@ -107,10 +107,14 @@ services:
|
||||
- QDRANT_COLLECTION=nextcloud_content
|
||||
|
||||
# Ollama configuration (optional - uses SimpleEmbeddingProvider if not set)
|
||||
#- OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443
|
||||
#- OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection
|
||||
#- OLLAMA_EMBEDDING_MODEL=embeddinggemma:300m
|
||||
#- OLLAMA_VERIFY_SSL=false
|
||||
# - OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443
|
||||
# - OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection
|
||||
# - OLLAMA_VERIFY_SSL=false
|
||||
|
||||
# Document chunking configuration (for vector embeddings)
|
||||
# Tune these based on your embedding model and content type
|
||||
# - DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512)
|
||||
# - DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words (default: 50, recommended: 10-20% of chunk size)
|
||||
|
||||
mcp-oauth:
|
||||
build: .
|
||||
|
||||
Reference in New Issue
Block a user