diff --git a/README.md b/README.md index d3c632e..577c0b2 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ docker run -p 127.0.0.1:8000:8000 --env-file .env --rm \ ghcr.io/cbcoutinho/nextcloud-mcp-server:latest # 3. Test the connection -curl http://127.0.0.1:8000/health +curl http://127.0.0.1:8000/health/ready ``` **Next Steps:** diff --git a/docker-compose.yml b/docker-compose.yml index 16592a1..f8bd866 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -107,10 +107,14 @@ services: - QDRANT_COLLECTION=nextcloud_content # Ollama configuration (optional - uses SimpleEmbeddingProvider if not set) - #- OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443 - #- OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection - #- OLLAMA_EMBEDDING_MODEL=embeddinggemma:300m - #- OLLAMA_VERIFY_SSL=false + # - OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443 + # - OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection + # - OLLAMA_VERIFY_SSL=false + + # Document chunking configuration (for vector embeddings) + # Tune these based on your embedding model and content type + # - DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512) + # - DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words (default: 50, recommended: 10-20% of chunk size) mcp-oauth: build: . diff --git a/docs/configuration.md b/docs/configuration.md index f7a6d6a..e451c3f 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -293,6 +293,10 @@ VECTOR_SYNC_ENABLED=true # Enable background indexing VECTOR_SYNC_SCAN_INTERVAL=300 # Scan interval in seconds (default: 5 minutes) VECTOR_SYNC_PROCESSOR_WORKERS=3 # Concurrent indexing workers (default: 3) VECTOR_SYNC_QUEUE_MAX_SIZE=10000 # Max queued documents (default: 10000) + +# Document chunking settings (for vector embeddings) +DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512) +DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words between chunks (default: 50) ``` ### Embedding Service Configuration @@ -313,6 +317,54 @@ OLLAMA_VERIFY_SSL=true # Verify SSL certificates If `OLLAMA_BASE_URL` is not set, the server uses a simple random embedding provider for testing. This is **not suitable for production** as it generates random embeddings with no semantic meaning. +### Document Chunking Configuration + +The server chunks documents before embedding to handle documents larger than the embedding model's context window. Chunk size and overlap can be tuned based on your embedding model and content type. + +#### Choosing Chunk Size + +**Smaller chunks (256-384 words)**: +- More precise matching +- Less context per chunk +- Better for finding specific information +- Higher storage requirements (more vectors) + +**Larger chunks (768-1024 words)**: +- More context per chunk +- Less precise matching +- Better for understanding broader topics +- Lower storage requirements (fewer vectors) + +**Default (512 words)**: +- Balanced approach suitable for most use cases +- Works well with typical note lengths +- Good compromise between precision and context + +#### Choosing Overlap + +Overlap preserves context across chunk boundaries. Recommended settings: + +- **10-20% of chunk size** (e.g., 50-100 words for 512-word chunks) +- **Too small** (<10%): May lose context at boundaries +- **Too large** (>20%): Redundant storage, diminishing returns + +**Examples**: +```dotenv +# Precise matching for short notes +DOCUMENT_CHUNK_SIZE=256 +DOCUMENT_CHUNK_OVERLAP=25 + +# Default balanced configuration +DOCUMENT_CHUNK_SIZE=512 +DOCUMENT_CHUNK_OVERLAP=50 + +# More context for long documents +DOCUMENT_CHUNK_SIZE=1024 +DOCUMENT_CHUNK_OVERLAP=100 +``` + +**Important**: Changing chunk size requires re-embedding all documents. The collection naming strategy (see "Qdrant Collection Naming" above) helps manage this by creating separate collections for different configurations. + ### Environment Variables Reference | Variable | Required | Default | Description | @@ -328,6 +380,8 @@ If `OLLAMA_BASE_URL` is not set, the server uses a simple random embedding provi | `OLLAMA_BASE_URL` | ⚠️ Optional | - | Ollama API endpoint for embeddings | | `OLLAMA_EMBEDDING_MODEL` | ⚠️ Optional | `nomic-embed-text` | Embedding model to use | | `OLLAMA_VERIFY_SSL` | ⚠️ Optional | `true` | Verify SSL certificates | +| `DOCUMENT_CHUNK_SIZE` | ⚠️ Optional | `512` | Words per chunk for document embedding | +| `DOCUMENT_CHUNK_OVERLAP` | ⚠️ Optional | `50` | Overlapping words between chunks (must be < chunk size) | ### Docker Compose Example diff --git a/docs/semantic-search-architecture.md b/docs/semantic-search-architecture.md index 8738cd8..c8e663a 100644 --- a/docs/semantic-search-architecture.md +++ b/docs/semantic-search-architecture.md @@ -177,6 +177,53 @@ Currently only `NotesScanner` is implemented. Future: `CalendarScanner`, `DeckSc - `user_id`: Multi-tenancy filtering (each user's vectors isolated) - `doc_type`: App identifier ("note", "event", "card", etc.) - `etag`: Change detection for incremental updates +- `chunk_index`: Position of this chunk within the document (0-indexed) +- `total_chunks`: Total number of chunks for this document +- `excerpt`: First 200 characters of chunk (for display) + +### Document Chunking Strategy + +Documents are chunked before embedding to handle content larger than the embedding model's context window and to improve search precision. + +**Configuration:** +```dotenv +DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default) +DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words between chunks (default) +``` + +**Chunking Process:** +1. **Text combination**: Document title + content (e.g., `"Note Title\n\nNote content..."`) +2. **Word-based splitting**: Simple whitespace tokenization +3. **Sliding window**: Create overlapping chunks +4. **Individual embedding**: Each chunk gets its own vector +5. **Separate storage**: Each chunk stored as distinct point in Qdrant + +**Example:** +``` +Document (1000 words): +→ Chunk 0: words 0-511 +→ Chunk 1: words 462-973 (overlaps by 50 words) +→ Chunk 2: words 924-999 (last chunk, partial) + +Each chunk stored as separate vector with metadata: +- chunk_index: 0, 1, 2 +- total_chunks: 3 +- excerpt: First 200 chars of each chunk +``` + +**Search Behavior:** +- **Vector search** operates on chunks (not whole documents) +- **Deduplication** collapses multiple matching chunks from same document +- **Best match** returns highest-scoring chunk's excerpt +- **Access verification** still performed at document level + +**Tuning Recommendations:** +- **Small chunks (256-384 words)**: More precise, less context, more storage +- **Large chunks (768-1024 words)**: More context, less precise, less storage +- **Overlap (10-20% of chunk size)**: Preserves context across boundaries +- **Match to embedding model**: Consider model's context window when sizing + +**Important**: Changing chunk size requires re-embedding all documents. Use the collection naming strategy to manage different chunking configurations. ### Collection Naming and Model Switching diff --git a/env.sample b/env.sample index ad46abc..5cd983e 100644 --- a/env.sample +++ b/env.sample @@ -124,3 +124,75 @@ ENABLE_CUSTOM_PROCESSOR=false # Comma-separated MIME types your processor supports #CUSTOM_PROCESSOR_TYPES=application/pdf,image/jpeg,image/png + +# ============================================ +# Semantic Search & Vector Sync Configuration +# ============================================ +# EXPERIMENTAL: Semantic search for Notes app (multi-app support planned) +# Requires: Qdrant vector database + Ollama embedding service +# Disabled by default + +# Enable background vector indexing +VECTOR_SYNC_ENABLED=false + +# Document scan interval in seconds (default: 300 = 5 minutes) +# How often to check for new/updated documents +#VECTOR_SYNC_SCAN_INTERVAL=300 + +# Concurrent indexing workers (default: 3) +# Number of parallel workers for embedding generation +#VECTOR_SYNC_PROCESSOR_WORKERS=3 + +# Max queued documents (default: 10000) +# Maximum documents waiting to be processed +#VECTOR_SYNC_QUEUE_MAX_SIZE=10000 + +# ============================================ +# Qdrant Vector Database Configuration +# ============================================ +# Choose ONE of three modes: +# 1. In-memory mode (default): Set neither QDRANT_URL nor QDRANT_LOCATION +# 2. Persistent local: Set QDRANT_LOCATION=/path/to/data +# 3. Network mode: Set QDRANT_URL=http://qdrant:6333 + +# Network mode: URL to Qdrant service +#QDRANT_URL=http://qdrant:6333 + +# Local mode: Path to store vectors (use :memory: for in-memory) +#QDRANT_LOCATION=:memory: + +# API key for network mode (optional) +#QDRANT_API_KEY= + +# Collection name (optional - auto-generated if not set) +# Auto-generation format: {deployment-id}-{model-name} +# Allows safe model switching and multi-server deployments +#QDRANT_COLLECTION=nextcloud_content + +# ============================================ +# Ollama Embedding Service Configuration +# ============================================ +# Ollama endpoint for embeddings (if not set, uses SimpleEmbeddingProvider fallback) +#OLLAMA_BASE_URL=http://ollama:11434 + +# Embedding model to use (default: nomic-embed-text, 768 dimensions) +# Changing this creates a new collection (requires re-embedding all documents) +#OLLAMA_EMBEDDING_MODEL=nomic-embed-text + +# Verify SSL certificates (default: true) +#OLLAMA_VERIFY_SSL=true + +# ============================================ +# Document Chunking Configuration +# ============================================ +# Configure how documents are split before embedding + +# Words per chunk (default: 512) +# Smaller chunks (256-384): More precise, less context, more storage +# Larger chunks (768-1024): More context, less precise, less storage +#DOCUMENT_CHUNK_SIZE=512 + +# Overlapping words between chunks (default: 50) +# Recommended: 10-20% of chunk size +# Preserves context across chunk boundaries +#DOCUMENT_CHUNK_OVERLAP=50 diff --git a/nextcloud_mcp_server/config.py b/nextcloud_mcp_server/config.py index 603d28a..61b4ea0 100644 --- a/nextcloud_mcp_server/config.py +++ b/nextcloud_mcp_server/config.py @@ -174,6 +174,10 @@ class Settings: ollama_embedding_model: str = "nomic-embed-text" ollama_verify_ssl: bool = True + # Document chunking settings (for vector embeddings) + document_chunk_size: int = 512 # Words per chunk + document_chunk_overlap: int = 50 # Overlapping words between chunks + # Observability settings metrics_enabled: bool = True metrics_port: int = 9090 @@ -209,6 +213,25 @@ class Settings: "API key is only relevant for network mode and will be ignored." ) + # Validate chunking configuration + if self.document_chunk_overlap >= self.document_chunk_size: + raise ValueError( + f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) must be less than " + f"DOCUMENT_CHUNK_SIZE ({self.document_chunk_size}). " + f"Overlap should be 10-20% of chunk size for optimal results." + ) + + if self.document_chunk_size < 100: + logger.warning( + f"DOCUMENT_CHUNK_SIZE is set to {self.document_chunk_size} words, which is quite small. " + f"Smaller chunks may lose context. Consider using at least 256 words." + ) + + if self.document_chunk_overlap < 0: + raise ValueError( + f"DOCUMENT_CHUNK_OVERLAP ({self.document_chunk_overlap}) cannot be negative." + ) + def get_collection_name(self) -> str: """ Get Qdrant collection name. @@ -305,6 +328,9 @@ def get_settings() -> Settings: ollama_base_url=os.getenv("OLLAMA_BASE_URL"), ollama_embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"), ollama_verify_ssl=os.getenv("OLLAMA_VERIFY_SSL", "true").lower() == "true", + # Document chunking settings + document_chunk_size=int(os.getenv("DOCUMENT_CHUNK_SIZE", "512")), + document_chunk_overlap=int(os.getenv("DOCUMENT_CHUNK_OVERLAP", "50")), # Observability settings metrics_enabled=os.getenv("METRICS_ENABLED", "true").lower() == "true", metrics_port=int(os.getenv("METRICS_PORT", "9090")), diff --git a/nextcloud_mcp_server/vector/processor.py b/nextcloud_mcp_server/vector/processor.py index 9105070..424e716 100644 --- a/nextcloud_mcp_server/vector/processor.py +++ b/nextcloud_mcp_server/vector/processor.py @@ -170,8 +170,11 @@ async def _index_document( else: raise ValueError(f"Unsupported doc_type: {doc_task.doc_type}") - # Tokenize and chunk - chunker = DocumentChunker(chunk_size=512, overlap=50) + # Tokenize and chunk (using configured chunk size and overlap) + chunker = DocumentChunker( + chunk_size=settings.document_chunk_size, + overlap=settings.document_chunk_overlap, + ) chunks = chunker.chunk_text(content) # Generate embeddings (I/O bound - external API call) diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py index f24e040..2caaa05 100644 --- a/tests/unit/test_config.py +++ b/tests/unit/test_config.py @@ -151,3 +151,111 @@ class TestGetSettings: assert settings.vector_sync_scan_interval == 600 assert settings.vector_sync_processor_workers == 5 assert settings.vector_sync_queue_max_size == 5000 + + +class TestChunkConfigValidation: + """Test document chunking configuration validation.""" + + def test_default_chunk_settings(self): + """Test default chunk size and overlap values.""" + settings = Settings() + assert settings.document_chunk_size == 512 + assert settings.document_chunk_overlap == 50 + + def test_valid_chunk_settings(self): + """Test valid chunk size and overlap configuration.""" + settings = Settings( + document_chunk_size=1024, + document_chunk_overlap=100, + ) + assert settings.document_chunk_size == 1024 + assert settings.document_chunk_overlap == 100 + + def test_overlap_greater_than_or_equal_to_chunk_size_raises_error(self): + """Test that overlap >= chunk size raises ValueError.""" + with pytest.raises( + ValueError, + match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE", + ): + Settings( + document_chunk_size=512, + document_chunk_overlap=512, + ) + + def test_overlap_larger_than_chunk_size_raises_error(self): + """Test that overlap > chunk size raises ValueError.""" + with pytest.raises( + ValueError, + match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE", + ): + Settings( + document_chunk_size=256, + document_chunk_overlap=300, + ) + + def test_negative_overlap_raises_error(self): + """Test that negative overlap raises ValueError.""" + with pytest.raises( + ValueError, + match="DOCUMENT_CHUNK_OVERLAP .* cannot be negative", + ): + Settings( + document_chunk_size=512, + document_chunk_overlap=-10, + ) + + def test_small_chunk_size_warning(self, caplog): + """Test that chunk size < 100 triggers warning.""" + import logging + + caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config") + Settings( + document_chunk_size=64, + document_chunk_overlap=10, + ) + assert ( + "DOCUMENT_CHUNK_SIZE is set to 64 words, which is quite small" + in caplog.text + ) + assert "Consider using at least 256 words" in caplog.text + + def test_reasonable_chunk_size_no_warning(self, caplog): + """Test that chunk size >= 100 doesn't trigger warning.""" + import logging + + caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config") + Settings( + document_chunk_size=256, + document_chunk_overlap=25, + ) + assert "DOCUMENT_CHUNK_SIZE" not in caplog.text + + @patch.dict( + os.environ, + { + "DOCUMENT_CHUNK_SIZE": "1024", + "DOCUMENT_CHUNK_OVERLAP": "102", + }, + clear=True, + ) + def test_get_settings_chunk_config(self): + """Test get_settings() with chunk configuration.""" + settings = get_settings() + assert settings.document_chunk_size == 1024 + assert settings.document_chunk_overlap == 102 + + @patch.dict( + os.environ, + { + "DOCUMENT_CHUNK_SIZE": "256", + "DOCUMENT_CHUNK_OVERLAP": "256", + }, + clear=True, + ) + def test_get_settings_invalid_chunk_config_raises_error(self): + """Test get_settings() raises error for invalid chunk config.""" + with pytest.raises( + ValueError, + match="DOCUMENT_CHUNK_OVERLAP .* must be less than DOCUMENT_CHUNK_SIZE", + ): + get_settings()