feat(vector): Add configurable chunk size and overlap for document embedding

Enable users to tune document chunking parameters to match their embedding
model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP
environment variables.

- **config.py**: Added `document_chunk_size` (default: 512) and
  `document_chunk_overlap` (default: 50) configuration fields with validation:
  - Ensures overlap < chunk_size
  - Warns if chunk_size < 100 words
  - Prevents negative overlap values

- **processor.py**: Updated DocumentChunker instantiation to use config
  settings instead of hardcoded values (line 174-177)

- **tests/unit/test_config.py**: Added TestChunkConfigValidation class with
  9 tests covering:
  - Default values
  - Valid configurations
  - Validation errors (overlap >= chunk_size, negative overlap)
  - Warning for small chunk sizes
  - Environment variable loading

- **docs/configuration.md**: Added comprehensive "Document Chunking
  Configuration" section with:
  - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words)
  - Overlap recommendations (10-20% of chunk size)
  - Configuration examples for different use cases
  - Added env vars to reference table

- **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy"
  section with:
  - Chunking process explanation
  - Example showing sliding window behavior
  - Search behavior with chunks
  - Tuning recommendations

- **env.sample**: Added complete "Semantic Search & Vector Sync Configuration"
  section with:
  - Vector sync settings
  - Qdrant configuration (3 modes)
  - Ollama embedding service
  - Document chunking configuration

- **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and
  DOCUMENT_CHUNK_OVERLAP with usage notes

\`\`\`bash
DOCUMENT_CHUNK_SIZE=512

DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

1. \`overlap\` must be less than \`chunk_size\`
2. \`overlap\` cannot be negative
3. Warning issued if \`chunk_size\` < 100 words

**Precise matching** (small notes, specific queries):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
\`\`\`

**Balanced** (default, general purpose):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

**Contextual** (long documents, broader topics):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
\`\`\`

✅ **User control** - Tune chunking to match embedding model capabilities
✅ **Experimentation** - Test different chunk sizes for optimal results
✅ **Model alignment** - Match chunk size to embedding context window
✅ **Backward compatible** - Defaults maintain existing behavior
✅ **Well validated** - Comprehensive tests prevent misconfiguration

All 22 config validation tests pass (9 new tests for chunking):
- Default values work correctly
- Validation prevents invalid configurations
- Environment variables load properly
- Warning system works as expected

With configurable chunk sizes, users can now experiment with different Ollama
embedding models and tune chunk parameters for optimal semantic search quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

This commit is contained in:

Chris Coutinho

2025-11-10 02:45:05 +01:00

parent f3050e9b45

commit cb39b3fca4

8 changed files with 321 additions and 7 deletions

									
										docker-compose.yml
									
		+8
		-4
	
												View File
												
				@@ -107,10 +107,14 @@ services:

				      - QDRANT_COLLECTION=nextcloud_content

				      # Ollama configuration (optional - uses SimpleEmbeddingProvider if not set)

				      #- OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443

				      #- OLLAMA_EMBEDDING_MODEL=nomic-embed-text  # Changing this creates new collection

				      #- OLLAMA_EMBEDDING_MODEL=embeddinggemma:300m

				      #- OLLAMA_VERIFY_SSL=false

				      # - OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443

				      # - OLLAMA_EMBEDDING_MODEL=nomic-embed-text  # Changing this creates new collection

				      # - OLLAMA_VERIFY_SSL=false

				      # Document chunking configuration (for vector embeddings)

				      # Tune these based on your embedding model and content type

				      # - DOCUMENT_CHUNK_SIZE=512      # Words per chunk (default: 512)

				      # - DOCUMENT_CHUNK_OVERLAP=50    # Overlapping words (default: 50, recommended: 10-20% of chunk size)

				  mcp-oauth:

				    build: .