feat(vector): Add configurable chunk size and overlap for document embedding

Enable users to tune document chunking parameters to match their embedding
model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP
environment variables.

- **config.py**: Added `document_chunk_size` (default: 512) and
  `document_chunk_overlap` (default: 50) configuration fields with validation:
  - Ensures overlap < chunk_size
  - Warns if chunk_size < 100 words
  - Prevents negative overlap values

- **processor.py**: Updated DocumentChunker instantiation to use config
  settings instead of hardcoded values (line 174-177)

- **tests/unit/test_config.py**: Added TestChunkConfigValidation class with
  9 tests covering:
  - Default values
  - Valid configurations
  - Validation errors (overlap >= chunk_size, negative overlap)
  - Warning for small chunk sizes
  - Environment variable loading

- **docs/configuration.md**: Added comprehensive "Document Chunking
  Configuration" section with:
  - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words)
  - Overlap recommendations (10-20% of chunk size)
  - Configuration examples for different use cases
  - Added env vars to reference table

- **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy"
  section with:
  - Chunking process explanation
  - Example showing sliding window behavior
  - Search behavior with chunks
  - Tuning recommendations

- **env.sample**: Added complete "Semantic Search & Vector Sync Configuration"
  section with:
  - Vector sync settings
  - Qdrant configuration (3 modes)
  - Ollama embedding service
  - Document chunking configuration

- **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and
  DOCUMENT_CHUNK_OVERLAP with usage notes

\`\`\`bash
DOCUMENT_CHUNK_SIZE=512

DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

1. \`overlap\` must be less than \`chunk_size\`
2. \`overlap\` cannot be negative
3. Warning issued if \`chunk_size\` < 100 words

**Precise matching** (small notes, specific queries):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
\`\`\`

**Balanced** (default, general purpose):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

**Contextual** (long documents, broader topics):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
\`\`\`

 **User control** - Tune chunking to match embedding model capabilities
 **Experimentation** - Test different chunk sizes for optimal results
 **Model alignment** - Match chunk size to embedding context window
 **Backward compatible** - Defaults maintain existing behavior
 **Well validated** - Comprehensive tests prevent misconfiguration

All 22 config validation tests pass (9 new tests for chunking):
- Default values work correctly
- Validation prevents invalid configurations
- Environment variables load properly
- Warning system works as expected

With configurable chunk sizes, users can now experiment with different Ollama
embedding models and tune chunk parameters for optimal semantic search quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-11-10 02:45:05 +01:00
parent f3050e9b45
commit cb39b3fca4
8 changed files with 321 additions and 7 deletions
+1 -1
View File
@@ -28,7 +28,7 @@ docker run -p 127.0.0.1:8000:8000 --env-file .env --rm \
ghcr.io/cbcoutinho/nextcloud-mcp-server:latest
# 3. Test the connection
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/health/ready
```
**Next Steps:**
+8 -4
View File
@@ -107,10 +107,14 @@ services:
- QDRANT_COLLECTION=nextcloud_content
# Ollama configuration (optional - uses SimpleEmbeddingProvider if not set)
#- OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443
#- OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection
#- OLLAMA_EMBEDDING_MODEL=embeddinggemma:300m
#- OLLAMA_VERIFY_SSL=false
# - OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443
# - OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection
# - OLLAMA_VERIFY_SSL=false
# Document chunking configuration (for vector embeddings)
# Tune these based on your embedding model and content type
# - DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512)
# - DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words (default: 50, recommended: 10-20% of chunk size)
mcp-oauth:
build: .
+54
View File
@@ -293,6 +293,10 @@ VECTOR_SYNC_ENABLED=true # Enable background indexing
VECTOR_SYNC_SCAN_INTERVAL=300 # Scan interval in seconds (default: 5 minutes)
VECTOR_SYNC_PROCESSOR_WORKERS=3 # Concurrent indexing workers (default: 3)
VECTOR_SYNC_QUEUE_MAX_SIZE=10000 # Max queued documents (default: 10000)
# Document chunking settings (for vector embeddings)
DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512)
DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words between chunks (default: 50)
```
### Embedding Service Configuration
@@ -313,6 +317,54 @@ OLLAMA_VERIFY_SSL=true # Verify SSL certificates
If `OLLAMA_BASE_URL` is not set, the server uses a simple random embedding provider for testing. This is **not suitable for production** as it generates random embeddings with no semantic meaning.
### Document Chunking Configuration
The server chunks documents before embedding to handle documents larger than the embedding model's context window. Chunk size and overlap can be tuned based on your embedding model and content type.
#### Choosing Chunk Size
**Smaller chunks (256-384 words)**:
- More precise matching
- Less context per chunk
- Better for finding specific information
- Higher storage requirements (more vectors)
**Larger chunks (768-1024 words)**:
- More context per chunk
- Less precise matching
- Better for understanding broader topics
- Lower storage requirements (fewer vectors)
**Default (512 words)**:
- Balanced approach suitable for most use cases
- Works well with typical note lengths
- Good compromise between precision and context
#### Choosing Overlap
Overlap preserves context across chunk boundaries. Recommended settings:
- **10-20% of chunk size** (e.g., 50-100 words for 512-word chunks)
- **Too small** (<10%): May lose context at boundaries
- **Too large** (>20%): Redundant storage, diminishing returns
**Examples**:
```dotenv
# Precise matching for short notes
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
# Default balanced configuration
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
# More context for long documents
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
```
**Important**: Changing chunk size requires re-embedding all documents. The collection naming strategy (see "Qdrant Collection Naming" above) helps manage this by creating separate collections for different configurations.
### Environment Variables Reference
| Variable | Required | Default | Description |
@@ -328,6 +380,8 @@ If `OLLAMA_BASE_URL` is not set, the server uses a simple random embedding provi
| `OLLAMA_BASE_URL` | ⚠️ Optional | - | Ollama API endpoint for embeddings |
| `OLLAMA_EMBEDDING_MODEL` | ⚠️ Optional | `nomic-embed-text` | Embedding model to use |
| `OLLAMA_VERIFY_SSL` | ⚠️ Optional | `true` | Verify SSL certificates |
| `DOCUMENT_CHUNK_SIZE` | ⚠️ Optional | `512` | Words per chunk for document embedding |
| `DOCUMENT_CHUNK_OVERLAP` | ⚠️ Optional | `50` | Overlapping words between chunks (must be < chunk size) |
### Docker Compose Example
+47
View File
@@ -177,6 +177,53 @@ Currently only `NotesScanner` is implemented. Future: `CalendarScanner`, `DeckSc
- `user_id`: Multi-tenancy filtering (each user's vectors isolated)
- `doc_type`: App identifier ("note", "event", "card", etc.)
- `etag`: Change detection for incremental updates
- `chunk_index`: Position of this chunk within the document (0-indexed)
- `total_chunks`: Total number of chunks for this document
- `excerpt`: First 200 characters of chunk (for display)
### Document Chunking Strategy
Documents are chunked before embedding to handle content larger than the embedding model's context window and to improve search precision.
**Configuration:**
```dotenv
DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default)
DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words between chunks (default)
```
**Chunking Process:**
1. **Text combination**: Document title + content (e.g., `"Note Title\n\nNote content..."`)
2. **Word-based splitting**: Simple whitespace tokenization
3. **Sliding window**: Create overlapping chunks
4. **Individual embedding**: Each chunk gets its own vector
5. **Separate storage**: Each chunk stored as distinct point in Qdrant
**Example:**
```
Document (1000 words):
→ Chunk 0: words 0-511
→ Chunk 1: words 462-973 (overlaps by 50 words)
→ Chunk 2: words 924-999 (last chunk, partial)
Each chunk stored as separate vector with metadata:
- chunk_index: 0, 1, 2
- total_chunks: 3
- excerpt: First 200 chars of each chunk
```
**Search Behavior:**
- **Vector search** operates on chunks (not whole documents)
- **Deduplication** collapses multiple matching chunks from same document
- **Best match** returns highest-scoring chunk's excerpt
- **Access verification** still performed at document level
**Tuning Recommendations:**
- **Small chunks (256-384 words)**: More precise, less context, more storage
- **Large chunks (768-1024 words)**: More context, less precise, less storage
- **Overlap (10-20% of chunk size)**: Preserves context across boundaries
- **Match to embedding model**: Consider model's context window when sizing
**Important**: Changing chunk size requires re-embedding all documents. Use the collection naming strategy to manage different chunking configurations.
### Collection Naming and Model Switching
+72
View File
@@ -124,3 +124,75 @@ ENABLE_CUSTOM_PROCESSOR=false
# Comma-separated MIME types your processor supports
#CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg,image/png
# ============================================
# Semantic Search & Vector Sync Configuration
# ============================================
# EXPERIMENTAL: Semantic search for Notes app (multi-app support planned)
# Requires: Qdrant vector database + Ollama embedding service
# Disabled by default
# Enable background vector indexing
VECTOR_SYNC_ENABLED=false
# Document scan interval in seconds (default: 300 = 5 minutes)
# How often to check for new/updated documents
#VECTOR_SYNC_SCAN_INTERVAL=300
# Concurrent indexing workers (default: 3)
# Number of parallel workers for embedding generation
#VECTOR_SYNC_PROCESSOR_WORKERS=3
# Max queued documents (default: 10000)
# Maximum documents waiting to be processed
#VECTOR_SYNC_QUEUE_MAX_SIZE=10000
# ============================================
# Qdrant Vector Database Configuration
# ============================================
# Choose ONE of three modes:
# 1. In-memory mode (default): Set neither QDRANT_URL nor QDRANT_LOCATION
# 2. Persistent local: Set QDRANT_LOCATION=/path/to/data
# 3. Network mode: Set QDRANT_URL=http://qdrant:6333
# Network mode: URL to Qdrant service
#QDRANT_URL=http://qdrant:6333
# Local mode: Path to store vectors (use :memory: for in-memory)
#QDRANT_LOCATION=:memory:
# API key for network mode (optional)
#QDRANT_API_KEY=
# Collection name (optional - auto-generated if not set)
# Auto-generation format: {deployment-id}-{model-name}
# Allows safe model switching and multi-server deployments
#QDRANT_COLLECTION=nextcloud_content
# ============================================
# Ollama Embedding Service Configuration
# ============================================
# Ollama endpoint for embeddings (if not set, uses SimpleEmbeddingProvider fallback)
#OLLAMA_BASE_URL=http://ollama:11434
# Embedding model to use (default: nomic-embed-text, 768 dimensions)
# Changing this creates a new collection (requires re-embedding all documents)
#OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# Verify SSL certificates (default: true)
#OLLAMA_VERIFY_SSL=true
# ============================================
# Document Chunking Configuration
# ============================================
# Configure how documents are split before embedding
# Words per chunk (default: 512)
# Smaller chunks (256-384): More precise, less context, more storage
# Larger chunks (768-1024): More context, less precise, less storage
#DOCUMENT_CHUNK_SIZE=512
# Overlapping words between chunks (default: 50)
# Recommended: 10-20% of chunk size
# Preserves context across chunk boundaries
#DOCUMENT_CHUNK_OVERLAP=50
+26
View File
@@ -174,6 +174,10 @@ class Settings:
ollama_embedding_model: str = "nomic-embed-text"
ollama_verify_ssl: bool = True
# Document chunking settings (for vector embeddings)
document_chunk_size: int = 512 # Words per chunk
document_chunk_overlap: int = 50 # Overlapping words between chunks
# Observability settings
metrics_enabled: bool = True
metrics_port: int = 9090
@@ -209,6 +213,25 @@ class Settings:
"API key is only relevant for network mode and will be ignored."
)
# Validate chunking configuration
if self.document_chunk_overlap >= self.document_chunk_size:
raise ValueError(
f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) must be less than "
f"DOCUMENT_CHUNK_SIZE ({self.document_chunk_size}). "
f"Overlap should be 10-20% of chunk size for optimal results."
)
if self.document_chunk_size < 100:
logger.warning(
f"DOCUMENT_CHUNK_SIZE is set to {self.document_chunk_size} words, which is quite small. "
f"Smaller chunks may lose context. Consider using at least 256 words."
)
if self.document_chunk_overlap < 0:
raise ValueError(
f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) cannot be negative."
)
def get_collection_name(self) -> str:
"""
Get Qdrant collection name.
@@ -305,6 +328,9 @@ def get_settings() -> Settings:
ollama_base_url=os.getenv("OLLAMA_BASE_URL"),
ollama_embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"),
ollama_verify_ssl=os.getenv("OLLAMA_VERIFY_SSL", "true").lower() == "true",
# Document chunking settings
document_chunk_size=int(os.getenv("DOCUMENT_CHUNK_SIZE", "512")),
document_chunk_overlap=int(os.getenv("DOCUMENT_CHUNK_OVERLAP", "50")),
# Observability settings
metrics_enabled=os.getenv("METRICS_ENABLED", "true").lower() == "true",
metrics_port=int(os.getenv("METRICS_PORT", "9090")),
+5 -2
View File
@@ -170,8 +170,11 @@ async def _index_document(
else:
raise ValueError(f"Unsupported doc_type: {doc_task.doc_type}")
# Tokenize and chunk
chunker = DocumentChunker(chunk_size=512, overlap=50)
# Tokenize and chunk (using configured chunk size and overlap)
chunker = DocumentChunker(
chunk_size=settings.document_chunk_size,
overlap=settings.document_chunk_overlap,
)
chunks = chunker.chunk_text(content)
# Generate embeddings (I/O bound - external API call)
+108
View File
@@ -151,3 +151,111 @@ class TestGetSettings:
assert settings.vector_sync_scan_interval == 600
assert settings.vector_sync_processor_workers == 5
assert settings.vector_sync_queue_max_size == 5000
class TestChunkConfigValidation:
"""Test document chunking configuration validation."""
def test_default_chunk_settings(self):
"""Test default chunk size and overlap values."""
settings = Settings()
assert settings.document_chunk_size == 512
assert settings.document_chunk_overlap == 50
def test_valid_chunk_settings(self):
"""Test valid chunk size and overlap configuration."""
settings = Settings(
document_chunk_size=1024,
document_chunk_overlap=100,
)
assert settings.document_chunk_size == 1024
assert settings.document_chunk_overlap == 100
def test_overlap_greater_than_or_equal_to_chunk_size_raises_error(self):
"""Test that overlap >= chunk size raises ValueError."""
with pytest.raises(
ValueError,
match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE",
):
Settings(
document_chunk_size=512,
document_chunk_overlap=512,
)
def test_overlap_larger_than_chunk_size_raises_error(self):
"""Test that overlap > chunk size raises ValueError."""
with pytest.raises(
ValueError,
match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE",
):
Settings(
document_chunk_size=256,
document_chunk_overlap=300,
)
def test_negative_overlap_raises_error(self):
"""Test that negative overlap raises ValueError."""
with pytest.raises(
ValueError,
match="DOCUMENT_CHUNK_OVERLAP .* cannot be negative",
):
Settings(
document_chunk_size=512,
document_chunk_overlap=-10,
)
def test_small_chunk_size_warning(self, caplog):
"""Test that chunk size < 100 triggers warning."""
import logging
caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config")
Settings(
document_chunk_size=64,
document_chunk_overlap=10,
)
assert (
"DOCUMENT_CHUNK_SIZE is set to 64 words, which is quite small"
in caplog.text
)
assert "Consider using at least 256 words" in caplog.text
def test_reasonable_chunk_size_no_warning(self, caplog):
"""Test that chunk size >= 100 doesn't trigger warning."""
import logging
caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config")
Settings(
document_chunk_size=256,
document_chunk_overlap=25,
)
assert "DOCUMENT_CHUNK_SIZE" not in caplog.text
@patch.dict(
os.environ,
{
"DOCUMENT_CHUNK_SIZE": "1024",
"DOCUMENT_CHUNK_OVERLAP": "102",
},
clear=True,
)
def test_get_settings_chunk_config(self):
"""Test get_settings() with chunk configuration."""
settings = get_settings()
assert settings.document_chunk_size == 1024
assert settings.document_chunk_overlap == 102
@patch.dict(
os.environ,
{
"DOCUMENT_CHUNK_SIZE": "256",
"DOCUMENT_CHUNK_OVERLAP": "256",
},
clear=True,
)
def test_get_settings_invalid_chunk_config_raises_error(self):
"""Test get_settings() raises error for invalid chunk config."""
with pytest.raises(
ValueError,
match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE",
):
get_settings()