Files
nextcloud-mcp-server/docker-compose.yml
T
Chris Coutinho cb39b3fca4 feat(vector): Add configurable chunk size and overlap for document embedding
Enable users to tune document chunking parameters to match their embedding
model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP
environment variables.

- **config.py**: Added `document_chunk_size` (default: 512) and
  `document_chunk_overlap` (default: 50) configuration fields with validation:
  - Ensures overlap < chunk_size
  - Warns if chunk_size < 100 words
  - Prevents negative overlap values

- **processor.py**: Updated DocumentChunker instantiation to use config
  settings instead of hardcoded values (line 174-177)

- **tests/unit/test_config.py**: Added TestChunkConfigValidation class with
  9 tests covering:
  - Default values
  - Valid configurations
  - Validation errors (overlap >= chunk_size, negative overlap)
  - Warning for small chunk sizes
  - Environment variable loading

- **docs/configuration.md**: Added comprehensive "Document Chunking
  Configuration" section with:
  - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words)
  - Overlap recommendations (10-20% of chunk size)
  - Configuration examples for different use cases
  - Added env vars to reference table

- **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy"
  section with:
  - Chunking process explanation
  - Example showing sliding window behavior
  - Search behavior with chunks
  - Tuning recommendations

- **env.sample**: Added complete "Semantic Search & Vector Sync Configuration"
  section with:
  - Vector sync settings
  - Qdrant configuration (3 modes)
  - Ollama embedding service
  - Document chunking configuration

- **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and
  DOCUMENT_CHUNK_OVERLAP with usage notes

\`\`\`bash
DOCUMENT_CHUNK_SIZE=512

DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

1. \`overlap\` must be less than \`chunk_size\`
2. \`overlap\` cannot be negative
3. Warning issued if \`chunk_size\` < 100 words

**Precise matching** (small notes, specific queries):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
\`\`\`

**Balanced** (default, general purpose):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

**Contextual** (long documents, broader topics):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
\`\`\`

 **User control** - Tune chunking to match embedding model capabilities
 **Experimentation** - Test different chunk sizes for optimal results
 **Model alignment** - Match chunk size to embedding context window
 **Backward compatible** - Defaults maintain existing behavior
 **Well validated** - Comprehensive tests prevent misconfiguration

All 22 config validation tests pass (9 new tests for chunking):
- Default values work correctly
- Validation prevents invalid configurations
- Environment variables load properly
- Warning system works as expected

With configurable chunk sizes, users can now experiment with different Ollama
embedding models and tune chunk parameters for optimal semantic search quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 02:47:57 +01:00

248 lines
10 KiB
YAML

services:
# Note: MariaDB is external service. You can find more information about the configuration here:
# https://hub.docker.com/_/mariadb
db:
# Note: Check the recommend version here: https://docs.nextcloud.com/server/latest/admin_manual/installation/system_requirements.html#server
image: docker.io/library/mariadb:lts@sha256:ae6119716edac6998ae85508431b3d2e666530ddf4e94c61a10710caec9b0f71
restart: always
command: --transaction-isolation=READ-COMMITTED
volumes:
- db:/var/lib/mysql
environment:
- MYSQL_ROOT_PASSWORD=password
- MYSQL_PASSWORD=password
- MYSQL_DATABASE=nextcloud
- MYSQL_USER=nextcloud
# Note: Redis is an external service. You can find more information about the configuration here:
# https://hub.docker.com/_/redis
redis:
image: docker.io/library/redis:alpine@sha256:28c9c4d7596949a24b183eaaab6455f8e5d55ecbf72d02ff5e2c17fe72671d31
restart: always
app:
image: docker.io/library/nextcloud:32.0.1@sha256:5b043f7ea2f609d5ff5635f475c30d303bec17775a5c3f7fa435e3818e669120
restart: always
ports:
- 0.0.0.0:8080:80
depends_on:
- redis
- db
- keycloak
volumes:
- nextcloud:/var/www/html
- ./app-hooks:/docker-entrypoint-hooks.d:ro
# Mount OIDC development directory outside /var/www/html to avoid rsync conflicts
# The post-installation hook will register /opt/apps as an additional app directory
- ./third_party:/opt/apps:ro
environment:
- NEXTCLOUD_TRUSTED_DOMAINS=app
- NEXTCLOUD_ADMIN_USER=admin
- NEXTCLOUD_ADMIN_PASSWORD=admin
- MYSQL_PASSWORD=password
- MYSQL_DATABASE=nextcloud
- MYSQL_USER=nextcloud
- MYSQL_HOST=db
- REDIS_HOST=redis
healthcheck:
test: ["CMD-SHELL", "curl -Ss http://localhost/status.php | grep '\"installed\":true' || exit 1"]
interval: 10s
timeout: 30s
retries: 30
recipes:
image: docker.io/library/nginx:alpine@sha256:b3c656d55d7ad751196f21b7fd2e8d4da9cb430e32f646adcf92441b72f82b14
restart: always
volumes:
- ./tests/fixtures/test_recipe.html:/usr/share/nginx/html/test_recipe.html:ro
- ./tests/fixtures/nginx.conf:/etc/nginx/nginx.conf:ro
unstructured:
image: downloads.unstructured.io/unstructured-io/unstructured-api:latest@sha256:54282d3a25f33fd6cf69bc45b3d37770f213593f58b6dfe5e85fe546376b2807
restart: always
ports:
- 127.0.0.1:8002:8000
# Unstructured API runs on port 8000 internally
# We expose it on 8002 externally to avoid conflict
profiles:
- unstructured
mcp:
build: .
command: ["--transport", "streamable-http"]
restart: always
depends_on:
app:
condition: service_healthy
ports:
- 127.0.0.1:8000:8000
volumes:
- mcp-data:/app/data
environment:
- NEXTCLOUD_HOST=http://app:80
- NEXTCLOUD_USERNAME=admin
- NEXTCLOUD_PASSWORD=admin
# Vector sync configuration (ADR-007)
- VECTOR_SYNC_ENABLED=true
- VECTOR_SYNC_SCAN_INTERVAL=10
- VECTOR_SYNC_PROCESSOR_WORKERS=1
- LOG_FORMAT=text
# Qdrant configuration (three modes):
# 1. Network mode: Set QDRANT_URL=http://qdrant:6333 (requires qdrant service)
# 2. In-memory mode: Set QDRANT_LOCATION=:memory: (default if nothing set)
# 3. Persistent local: Set QDRANT_LOCATION=/app/data/qdrant (stored in mcp-data volume)
#- QDRANT_LOCATION=/app/data/qdrant
- QDRANT_URL=http://qdrant:6333 # Uncomment for network mode
- QDRANT_API_KEY=${QDRANT_API_KEY:-my_secret_api_key} # Only for network mode
# Collection naming: Auto-generated as {deployment-id}-{model-name}
# - Deployment ID: OTEL_SERVICE_NAME (if set) or hostname (fallback)
# - Model name: OLLAMA_EMBEDDING_MODEL
# - Example: "nextcloud-mcp-server-nomic-embed-text"
# - Changing models creates new collection (requires re-embedding)
# - Set QDRANT_COLLECTION to override auto-generation:
- QDRANT_COLLECTION=nextcloud_content
# Ollama configuration (optional - uses SimpleEmbeddingProvider if not set)
# - OLLAMA_BASE_URL=https://ollama.internal.coutinho.io:443
# - OLLAMA_EMBEDDING_MODEL=nomic-embed-text # Changing this creates new collection
# - OLLAMA_VERIFY_SSL=false
# Document chunking configuration (for vector embeddings)
# Tune these based on your embedding model and content type
# - DOCUMENT_CHUNK_SIZE=512 # Words per chunk (default: 512)
# - DOCUMENT_CHUNK_OVERLAP=50 # Overlapping words (default: 50, recommended: 10-20% of chunk size)
mcp-oauth:
build: .
command: ["--transport", "streamable-http", "--oauth", "--port", "8001", "--oauth-token-type", "jwt"]
restart: always
depends_on:
app:
condition: service_healthy
ports:
- 127.0.0.1:8001:8001
environment:
# Generic OIDC configuration (integrated mode - Nextcloud OIDC app)
# OIDC_DISCOVERY_URL not set - defaults to NEXTCLOUD_HOST/.well-known/openid-configuration
# OIDC_CLIENT_ID not set - uses Dynamic Client Registration (DCR)
- NEXTCLOUD_HOST=http://app:80
- NEXTCLOUD_MCP_SERVER_URL=http://localhost:8001
- NEXTCLOUD_RESOURCE_URI=http://localhost:8080 # ADR-005: Nextcloud resource identifier for audience validation
- NEXTCLOUD_PUBLIC_ISSUER_URL=http://localhost:8080
- NEXTCLOUD_OIDC_SCOPES=openid profile email notes:read notes:write calendar:read calendar:write contacts:read contacts:write cookbook:read cookbook:write deck:read deck:write tables:read tables:write files:read files:write sharing:read sharing:write todo:read todo:write
# Refresh token storage (ADR-002 Tier 1)
- ENABLE_OFFLINE_ACCESS=true
- TOKEN_ENCRYPTION_KEY=ESF1BvEQdGYsCluwMx9Cxvw3uh5pFowPH7Rg_nIliyo=
- TOKEN_STORAGE_DB=/app/data/tokens.db
# ADR-005: Multi-audience mode (default - ENABLE_TOKEN_EXCHANGE=false)
# Tokens must contain BOTH MCP and Nextcloud audiences
# No token exchange needed - tokens work for both MCP auth and Nextcloud APIs
# NO admin credentials - using OAuth with Dynamic Client Registration (DCR)
# Client credentials registered via RFC 7591 and stored in volume
# JWT token type is used for testing (faster validation, scopes embedded in token)
volumes:
- oauth-client-storage:/app/.oauth
- oauth-tokens:/app/data
keycloak:
image: quay.io/keycloak/keycloak:26.4.4@sha256:c6459d5fae1b759f5d667ebdc6237ab3121379c3494e213898569014ede1846d
command:
- "start-dev"
- "--import-realm"
- "--hostname=http://localhost:8888"
- "--hostname-strict=false"
- "--hostname-backchannel-dynamic=true"
- "--features=preview" # Enable Legacy V1 token exchange (supports both Standard V2 and Legacy V1)
ports:
- 127.0.0.1:8888:8080
environment:
- KC_BOOTSTRAP_ADMIN_USERNAME=admin
- KC_BOOTSTRAP_ADMIN_PASSWORD=admin
volumes:
- ./keycloak/realm-export.json:/opt/keycloak/data/import/realm.json:ro
healthcheck:
test: ["CMD-SHELL", "exec 3<>/dev/tcp/localhost/8080 && echo -e 'GET /realms/nextcloud-mcp HTTP/1.1\\r\\nHost: localhost\\r\\nConnection: close\\r\\n\\r\\n' >&3 && cat <&3 | grep -q 'HTTP/1.1 200'"]
interval: 10s
timeout: 5s
retries: 30
mcp-keycloak:
build: .
command: ["--transport", "streamable-http", "--oauth", "--port", "8002"]
restart: always
depends_on:
keycloak:
condition: service_healthy
app:
condition: service_started
ports:
- 127.0.0.1:8002:8002
environment:
# Generic OIDC configuration (external IdP mode - Keycloak)
# Provider auto-detected from OIDC_DISCOVERY_URL issuer
# Using internal Docker hostname for discovery to get consistent issuer
- OIDC_DISCOVERY_URL=http://keycloak:8080/realms/nextcloud-mcp/.well-known/openid-configuration
- OIDC_CLIENT_ID=nextcloud-mcp-server
- OIDC_CLIENT_SECRET=mcp-secret-change-in-production
- OIDC_JWKS_URI=http://keycloak:8080/realms/nextcloud-mcp/protocol/openid-connect/certs
# Nextcloud API endpoint (for accessing APIs with validated token)
- NEXTCLOUD_HOST=http://app:80
- NEXTCLOUD_MCP_SERVER_URL=http://localhost:8002
- NEXTCLOUD_RESOURCE_URI=nextcloud # ADR-005: Keycloak uses client IDs as audiences, not URLs
- NEXTCLOUD_PUBLIC_ISSUER_URL=http://localhost:8888/realms/nextcloud-mcp
# Refresh token storage (ADR-002 Tier 1 & 2)
- ENABLE_OFFLINE_ACCESS=true
- TOKEN_ENCRYPTION_KEY=ESF1BvEQdGYsCluwMx9Cxvw3uh5pFowPH7Rg_nIliyo=
- TOKEN_STORAGE_DB=/app/data/tokens.db
# ADR-005: Token exchange mode (RFC 8693)
# Exchange MCP tokens (aud: nextcloud-mcp-server) for Nextcloud tokens (aud: http://localhost:8080)
# Provides strict audience separation between MCP session and Nextcloud API access
- ENABLE_TOKEN_EXCHANGE=true
- TOKEN_EXCHANGE_CACHE_TTL=300 # Cache exchanged tokens for 5 minutes (default)
# OAuth scopes (optional - uses defaults if not specified)
- NEXTCLOUD_OIDC_SCOPES=openid profile email offline_access notes:read notes:write calendar:read calendar:write contacts:read contacts:write cookbook:read cookbook:write deck:read deck:write tables:read tables:write files:read files:write sharing:read sharing:write todo:read todo:write
# NO admin credentials - using external IdP OAuth only!
volumes:
- keycloak-tokens:/app/data
- keycloak-oauth-storage:/app/.oauth
qdrant:
image: qdrant/qdrant:v1.15.5
restart: always
ports:
- 127.0.0.1:6333:6333 # REST API
- 127.0.0.1:6334:6334 # gRPC (optional)
volumes:
- qdrant-data:/qdrant/storage
environment:
- QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY:-my_secret_api_key}
healthcheck:
test: ["CMD-SHELL", "test -f /qdrant/.qdrant-initialized"]
interval: 10s
timeout: 5s
retries: 10
profiles:
- qdrant
volumes:
nextcloud:
db:
oauth-client-storage:
oauth-tokens:
keycloak-tokens:
keycloak-oauth-storage:
qdrant-data:
mcp-data: