feat(vector): Add configurable chunk size and overlap for document embedding
Enable users to tune document chunking parameters to match their embedding model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP environment variables. - **config.py**: Added `document_chunk_size` (default: 512) and `document_chunk_overlap` (default: 50) configuration fields with validation: - Ensures overlap < chunk_size - Warns if chunk_size < 100 words - Prevents negative overlap values - **processor.py**: Updated DocumentChunker instantiation to use config settings instead of hardcoded values (line 174-177) - **tests/unit/test_config.py**: Added TestChunkConfigValidation class with 9 tests covering: - Default values - Valid configurations - Validation errors (overlap >= chunk_size, negative overlap) - Warning for small chunk sizes - Environment variable loading - **docs/configuration.md**: Added comprehensive "Document Chunking Configuration" section with: - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words) - Overlap recommendations (10-20% of chunk size) - Configuration examples for different use cases - Added env vars to reference table - **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy" section with: - Chunking process explanation - Example showing sliding window behavior - Search behavior with chunks - Tuning recommendations - **env.sample**: Added complete "Semantic Search & Vector Sync Configuration" section with: - Vector sync settings - Qdrant configuration (3 modes) - Ollama embedding service - Document chunking configuration - **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP with usage notes \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` 1. \`overlap\` must be less than \`chunk_size\` 2. \`overlap\` cannot be negative 3. Warning issued if \`chunk_size\` < 100 words **Precise matching** (small notes, specific queries): \`\`\`bash DOCUMENT_CHUNK_SIZE=256 DOCUMENT_CHUNK_OVERLAP=25 \`\`\` **Balanced** (default, general purpose): \`\`\`bash DOCUMENT_CHUNK_SIZE=512 DOCUMENT_CHUNK_OVERLAP=50 \`\`\` **Contextual** (long documents, broader topics): \`\`\`bash DOCUMENT_CHUNK_SIZE=1024 DOCUMENT_CHUNK_OVERLAP=100 \`\`\` ✅ **User control** - Tune chunking to match embedding model capabilities ✅ **Experimentation** - Test different chunk sizes for optimal results ✅ **Model alignment** - Match chunk size to embedding context window ✅ **Backward compatible** - Defaults maintain existing behavior ✅ **Well validated** - Comprehensive tests prevent misconfiguration All 22 config validation tests pass (9 new tests for chunking): - Default values work correctly - Validation prevents invalid configurations - Environment variables load properly - Warning system works as expected With configurable chunk sizes, users can now experiment with different Ollama embedding models and tune chunk parameters for optimal semantic search quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -174,6 +174,10 @@ class Settings:
|
||||
ollama_embedding_model: str = "nomic-embed-text"
|
||||
ollama_verify_ssl: bool = True
|
||||
|
||||
# Document chunking settings (for vector embeddings)
|
||||
document_chunk_size: int = 512 # Words per chunk
|
||||
document_chunk_overlap: int = 50 # Overlapping words between chunks
|
||||
|
||||
# Observability settings
|
||||
metrics_enabled: bool = True
|
||||
metrics_port: int = 9090
|
||||
@@ -209,6 +213,25 @@ class Settings:
|
||||
"API key is only relevant for network mode and will be ignored."
|
||||
)
|
||||
|
||||
# Validate chunking configuration
|
||||
if self.document_chunk_overlap >= self.document_chunk_size:
|
||||
raise ValueError(
|
||||
f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) must be less than "
|
||||
f"DOCUMENT_CHUNK_SIZE ({self.document_chunk_size}). "
|
||||
f"Overlap should be 10-20% of chunk size for optimal results."
|
||||
)
|
||||
|
||||
if self.document_chunk_size < 100:
|
||||
logger.warning(
|
||||
f"DOCUMENT_CHUNK_SIZE is set to {self.document_chunk_size} words, which is quite small. "
|
||||
f"Smaller chunks may lose context. Consider using at least 256 words."
|
||||
)
|
||||
|
||||
if self.document_chunk_overlap < 0:
|
||||
raise ValueError(
|
||||
f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) cannot be negative."
|
||||
)
|
||||
|
||||
def get_collection_name(self) -> str:
|
||||
"""
|
||||
Get Qdrant collection name.
|
||||
@@ -305,6 +328,9 @@ def get_settings() -> Settings:
|
||||
ollama_base_url=os.getenv("OLLAMA_BASE_URL"),
|
||||
ollama_embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"),
|
||||
ollama_verify_ssl=os.getenv("OLLAMA_VERIFY_SSL", "true").lower() == "true",
|
||||
# Document chunking settings
|
||||
document_chunk_size=int(os.getenv("DOCUMENT_CHUNK_SIZE", "512")),
|
||||
document_chunk_overlap=int(os.getenv("DOCUMENT_CHUNK_OVERLAP", "50")),
|
||||
# Observability settings
|
||||
metrics_enabled=os.getenv("METRICS_ENABLED", "true").lower() == "true",
|
||||
metrics_port=int(os.getenv("METRICS_PORT", "9090")),
|
||||
|
||||
@@ -170,8 +170,11 @@ async def _index_document(
|
||||
else:
|
||||
raise ValueError(f"Unsupported doc_type: {doc_task.doc_type}")
|
||||
|
||||
# Tokenize and chunk
|
||||
chunker = DocumentChunker(chunk_size=512, overlap=50)
|
||||
# Tokenize and chunk (using configured chunk size and overlap)
|
||||
chunker = DocumentChunker(
|
||||
chunk_size=settings.document_chunk_size,
|
||||
overlap=settings.document_chunk_overlap,
|
||||
)
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Generate embeddings (I/O bound - external API call)
|
||||
|
||||
Reference in New Issue
Block a user