Files

T

Chris Coutinho cb39b3fca4 feat(vector): Add configurable chunk size and overlap for document embedding

Enable users to tune document chunking parameters to match their embedding
model and content type by adding DOCUMENT_CHUNK_SIZE and DOCUMENT_CHUNK_OVERLAP
environment variables.

- **config.py**: Added `document_chunk_size` (default: 512) and
  `document_chunk_overlap` (default: 50) configuration fields with validation:
  - Ensures overlap < chunk_size
  - Warns if chunk_size < 100 words
  - Prevents negative overlap values

- **processor.py**: Updated DocumentChunker instantiation to use config
  settings instead of hardcoded values (line 174-177)

- **tests/unit/test_config.py**: Added TestChunkConfigValidation class with
  9 tests covering:
  - Default values
  - Valid configurations
  - Validation errors (overlap >= chunk_size, negative overlap)
  - Warning for small chunk sizes
  - Environment variable loading

- **docs/configuration.md**: Added comprehensive "Document Chunking
  Configuration" section with:
  - Chunk size selection guidance (256-384 vs 512 vs 768-1024 words)
  - Overlap recommendations (10-20% of chunk size)
  - Configuration examples for different use cases
  - Added env vars to reference table

- **docs/semantic-search-architecture.md**: Added "Document Chunking Strategy"
  section with:
  - Chunking process explanation
  - Example showing sliding window behavior
  - Search behavior with chunks
  - Tuning recommendations

- **env.sample**: Added complete "Semantic Search & Vector Sync Configuration"
  section with:
  - Vector sync settings
  - Qdrant configuration (3 modes)
  - Ollama embedding service
  - Document chunking configuration

- **docker-compose.yml**: Added commented examples for DOCUMENT_CHUNK_SIZE and
  DOCUMENT_CHUNK_OVERLAP with usage notes

\`\`\`bash
DOCUMENT_CHUNK_SIZE=512

DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

1. \`overlap\` must be less than \`chunk_size\`
2. \`overlap\` cannot be negative
3. Warning issued if \`chunk_size\` < 100 words

**Precise matching** (small notes, specific queries):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25
\`\`\`

**Balanced** (default, general purpose):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50
\`\`\`

**Contextual** (long documents, broader topics):
\`\`\`bash
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100
\`\`\`

✅ **User control** - Tune chunking to match embedding model capabilities
✅ **Experimentation** - Test different chunk sizes for optimal results
✅ **Model alignment** - Match chunk size to embedding context window
✅ **Backward compatible** - Defaults maintain existing behavior
✅ **Well validated** - Comprehensive tests prevent misconfiguration

All 22 config validation tests pass (9 new tests for chunking):
- Default values work correctly
- Validation prevents invalid configurations
- Environment variables load properly
- Warning system works as expected

With configurable chunk sizes, users can now experiment with different Ollama
embedding models and tune chunk parameters for optimal semantic search quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-10 02:47:57 +01:00

18 KiB

Raw Permalink Blame History

Configuration

The Nextcloud MCP server requires configuration to connect to your Nextcloud instance. Configuration is provided through environment variables, typically stored in a .env file.

Quick Start

Create a .env file based on env.sample:

cp env.sample .env
# Edit .env with your Nextcloud details

Then choose your authentication mode:

OAuth2/OIDC Configuration (Recommended)
Basic Authentication Configuration

OAuth2/OIDC Configuration

OAuth2/OIDC is the recommended authentication mode for production deployments.

Minimal Configuration (Auto-registration)

# .env file for OAuth with auto-registration
NEXTCLOUD_HOST=https://your.nextcloud.instance.com

# Leave these EMPTY for OAuth mode
NEXTCLOUD_USERNAME=
NEXTCLOUD_PASSWORD=

This minimal configuration uses dynamic client registration to automatically register an OAuth client at startup.

Full Configuration (Pre-configured Client)

# .env file for OAuth with pre-configured client
NEXTCLOUD_HOST=https://your.nextcloud.instance.com

# OAuth Client Credentials (optional - auto-registers if not provided)
NEXTCLOUD_OIDC_CLIENT_ID=your-client-id
NEXTCLOUD_OIDC_CLIENT_SECRET=your-client-secret

# OAuth Callback Settings (optional)
NEXTCLOUD_MCP_SERVER_URL=http://localhost:8000

# Leave these EMPTY for OAuth mode
NEXTCLOUD_USERNAME=
NEXTCLOUD_PASSWORD=

Environment Variables Reference

Variable	Required	Default	Description
`NEXTCLOUD_HOST`	✅ Yes	-	Full URL of your Nextcloud instance (e.g., `https://cloud.example.com`)
`NEXTCLOUD_OIDC_CLIENT_ID`	⚠️ Optional	-	OAuth client ID (auto-registers if empty)
`NEXTCLOUD_OIDC_CLIENT_SECRET`	⚠️ Optional	-	OAuth client secret (auto-registers if empty)
`NEXTCLOUD_MCP_SERVER_URL`	⚠️ Optional	`http://localhost:8000`	MCP server URL for OAuth callbacks
`NEXTCLOUD_USERNAME`	❌ Must be empty	-	Leave empty to enable OAuth mode
`NEXTCLOUD_PASSWORD`	❌ Must be empty	-	Leave empty to enable OAuth mode

Prerequisites

Before using OAuth configuration:

Install required Nextcloud apps (both are required):
- oidc - OIDC Identity Provider (Apps → Security)
- user_oidc - OpenID Connect user backend (Apps → Security)
Configure the apps:
- Enable dynamic client registration (if using auto-registration) - Settings → OIDC
- Enable Bearer token validation: php occ config:system:set user_oidc oidc_provider_bearer_validation --value=true --type=boolean
Apply Bearer token patch - The user_oidc app requires a patch for non-OCS endpoints - See Upstream Status for details

See the OAuth Setup Guide for detailed step-by-step instructions, or OAuth Quick Start for a 5-minute setup.

Basic Authentication (Legacy)

Basic Authentication is maintained for backward compatibility. It uses username and password credentials.

Warning

Security Notice: Basic Authentication stores credentials in environment variables and is less secure than OAuth. Use OAuth for production deployments.

Configuration

# .env file for BasicAuth mode
NEXTCLOUD_HOST=https://your.nextcloud.instance.com
NEXTCLOUD_USERNAME=your_nextcloud_username
NEXTCLOUD_PASSWORD=your_app_password_or_password

Environment Variables Reference

Variable	Required	Description
`NEXTCLOUD_HOST`	✅ Yes	Full URL of your Nextcloud instance
`NEXTCLOUD_USERNAME`	✅ Yes	Your Nextcloud username
`NEXTCLOUD_PASSWORD`	✅ Yes	Recommended: Use a dedicated Nextcloud App Password. Generate one in Nextcloud Security settings. Alternatively, use your login password (less secure).

Semantic Search Configuration (Optional)

The MCP server includes semantic search capabilities powered by vector embeddings. This feature requires a vector database (Qdrant) and an embedding service.

Qdrant Vector Database Modes

The server supports three Qdrant deployment modes:

In-Memory Mode (Default) - Simplest for development and testing
Persistent Local Mode - For single-instance deployments with persistence
Network Mode - For production with dedicated Qdrant service

1. In-Memory Mode (Default)

No configuration needed! If neither QDRANT_URL nor QDRANT_LOCATION is set, the server defaults to in-memory mode:

# No Qdrant configuration needed - defaults to :memory:
VECTOR_SYNC_ENABLED=true

Pros:

Zero configuration
Fast startup
Perfect for testing

Cons:

Data lost on restart
Limited to available RAM

2. Persistent Local Mode

For single-instance deployments that need persistence without a separate Qdrant service:

# Local persistent storage
QDRANT_LOCATION=/app/data/qdrant  # Or any writable path
VECTOR_SYNC_ENABLED=true

Pros:

Data persists across restarts
No separate service needed
Suitable for small/medium deployments

Cons:

Limited to single instance
Shares resources with MCP server

3. Network Mode

For production deployments with a dedicated Qdrant service:

# Network mode configuration
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY=your-secret-api-key  # Optional
QDRANT_COLLECTION=nextcloud_content  # Optional
VECTOR_SYNC_ENABLED=true

Pros:

Scalable and performant
Can be shared across multiple MCP instances
Supports clustering and replication

Cons:

Requires separate Qdrant service
More complex deployment

Qdrant Collection Naming

Collection names are automatically generated to include the embedding model, ensuring safe model switching and preventing dimension mismatches.

Auto-Generated Naming (Default)

Format: {deployment-id}-{model-name}

Components:

Deployment ID: OTEL_SERVICE_NAME (if configured) or hostname (fallback)
Model name: OLLAMA_EMBEDDING_MODEL

Examples:

# With OTEL service name configured
OTEL_SERVICE_NAME=my-mcp-server
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# → Collection: "my-mcp-server-nomic-embed-text"

# Simple Docker deployment (OTEL not configured)
# hostname=mcp-container
OLLAMA_EMBEDDING_MODEL=all-minilm
# → Collection: "mcp-container-all-minilm"

Switching Embedding Models

When you change OLLAMA_EMBEDDING_MODEL, a new collection is automatically created:

# Initial setup
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# Collection: "my-server-nomic-embed-text" (768 dimensions)

# Change model
OLLAMA_EMBEDDING_MODEL=all-minilm
# Collection: "my-server-all-minilm" (384 dimensions)
# → New collection created, full re-embedding occurs

Important:

Collections are mutually exclusive - vectors cannot be shared between different embedding models
Switching models requires re-embedding all documents (may take time for large note collections)
Old collection remains in Qdrant and can be deleted manually if no longer needed

Explicit Override

Set QDRANT_COLLECTION to use a specific collection name:

QDRANT_COLLECTION=my-custom-collection  # Bypasses auto-generation

Use cases:

Backward compatibility with existing deployments
Custom naming schemes
Sharing a collection across deployments (advanced)

Multi-Server Deployments

Each server should have a unique deployment ID to avoid collection collisions:

# Server 1 (Production)
OTEL_SERVICE_NAME=mcp-prod
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# → Collection: "mcp-prod-nomic-embed-text"

# Server 2 (Staging)
OTEL_SERVICE_NAME=mcp-staging
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# → Collection: "mcp-staging-nomic-embed-text"

# Server 3 (Different model)
OTEL_SERVICE_NAME=mcp-experimental
OLLAMA_EMBEDDING_MODEL=bge-large
# → Collection: "mcp-experimental-bge-large"

Benefits:

Multiple MCP servers can share one Qdrant instance safely
No naming collisions between deployments
Clear collection ownership (can see which deployment and model)

Dimension Validation

The server validates collection dimensions on startup:

Dimension mismatch for collection 'my-server-nomic-embed-text':
  Expected: 384 (from embedding model 'all-minilm')
  Found: 768
This usually means you changed the embedding model.
Solutions:
  1. Delete the old collection: Collection will be recreated with new dimensions
  2. Set QDRANT_COLLECTION to use a different collection name
  3. Revert OLLAMA_EMBEDDING_MODEL to the original model

What this prevents:

Runtime errors from dimension mismatches
Data corruption in Qdrant
Confusing error messages during indexing

Vector Sync Configuration

Control background indexing behavior:

# Vector sync settings (ADR-007)
VECTOR_SYNC_ENABLED=true              # Enable background indexing
VECTOR_SYNC_SCAN_INTERVAL=300         # Scan interval in seconds (default: 5 minutes)
VECTOR_SYNC_PROCESSOR_WORKERS=3       # Concurrent indexing workers (default: 3)
VECTOR_SYNC_QUEUE_MAX_SIZE=10000      # Max queued documents (default: 10000)

# Document chunking settings (for vector embeddings)
DOCUMENT_CHUNK_SIZE=512               # Words per chunk (default: 512)
DOCUMENT_CHUNK_OVERLAP=50             # Overlapping words between chunks (default: 50)

Embedding Service Configuration

The server uses an embedding service to generate vector representations. Two options are available:

Ollama (Recommended)

Use a local Ollama instance for embeddings:

OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text  # Default model
OLLAMA_VERIFY_SSL=true                   # Verify SSL certificates

Simple Embedding Provider (Fallback)

If OLLAMA_BASE_URL is not set, the server uses a simple random embedding provider for testing. This is not suitable for production as it generates random embeddings with no semantic meaning.

Document Chunking Configuration

The server chunks documents before embedding to handle documents larger than the embedding model's context window. Chunk size and overlap can be tuned based on your embedding model and content type.

Choosing Chunk Size

Smaller chunks (256-384 words):

More precise matching
Less context per chunk
Better for finding specific information
Higher storage requirements (more vectors)

Larger chunks (768-1024 words):

More context per chunk
Less precise matching
Better for understanding broader topics
Lower storage requirements (fewer vectors)

Default (512 words):

Balanced approach suitable for most use cases
Works well with typical note lengths
Good compromise between precision and context

Choosing Overlap

Overlap preserves context across chunk boundaries. Recommended settings:

10-20% of chunk size (e.g., 50-100 words for 512-word chunks)
Too small (<10%): May lose context at boundaries
Too large (>20%): Redundant storage, diminishing returns

Examples:

# Precise matching for short notes
DOCUMENT_CHUNK_SIZE=256
DOCUMENT_CHUNK_OVERLAP=25

# Default balanced configuration
DOCUMENT_CHUNK_SIZE=512
DOCUMENT_CHUNK_OVERLAP=50

# More context for long documents
DOCUMENT_CHUNK_SIZE=1024
DOCUMENT_CHUNK_OVERLAP=100

Important: Changing chunk size requires re-embedding all documents. The collection naming strategy (see "Qdrant Collection Naming" above) helps manage this by creating separate collections for different configurations.

Environment Variables Reference

Variable	Required	Default	Description
`QDRANT_URL`	⚠️ Optional	-	Qdrant service URL (network mode) - mutually exclusive with `QDRANT_LOCATION`
`QDRANT_LOCATION`	⚠️ Optional	`:memory:`	Local Qdrant path (`:memory:` or `/path/to/data`) - mutually exclusive with `QDRANT_URL`
`QDRANT_API_KEY`	⚠️ Optional	-	Qdrant API key (network mode only)
`QDRANT_COLLECTION`	⚠️ Optional	`nextcloud_content`	Qdrant collection name
`VECTOR_SYNC_ENABLED`	⚠️ Optional	`false`	Enable background vector indexing
`VECTOR_SYNC_SCAN_INTERVAL`	⚠️ Optional	`300`	Document scan interval (seconds)
`VECTOR_SYNC_PROCESSOR_WORKERS`	⚠️ Optional	`3`	Concurrent indexing workers
`VECTOR_SYNC_QUEUE_MAX_SIZE`	⚠️ Optional	`10000`	Max queued documents
`OLLAMA_BASE_URL`	⚠️ Optional	-	Ollama API endpoint for embeddings
`OLLAMA_EMBEDDING_MODEL`	⚠️ Optional	`nomic-embed-text`	Embedding model to use
`OLLAMA_VERIFY_SSL`	⚠️ Optional	`true`	Verify SSL certificates
`DOCUMENT_CHUNK_SIZE`	⚠️ Optional	`512`	Words per chunk for document embedding
`DOCUMENT_CHUNK_OVERLAP`	⚠️ Optional	`50`	Overlapping words between chunks (must be < chunk size)

Docker Compose Example

Enable network mode Qdrant with docker-compose:

services:
  mcp:
    environment:
      - QDRANT_URL=http://qdrant:6333
      - VECTOR_SYNC_ENABLED=true

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - 127.0.0.1:6333:6333
    volumes:
      - qdrant-data:/qdrant/storage
    profiles:
      - qdrant  # Optional service

volumes:
  qdrant-data:

Start with Qdrant service:

docker-compose --profile qdrant up

Or use default in-memory mode (no --profile needed):

docker-compose up

Loading Environment Variables

After creating your .env file, load the environment variables:

On Linux/macOS

# Load all variables from .env
export $(grep -v '^#' .env | xargs)

On Windows (PowerShell)

# Load variables from .env
Get-Content .env | ForEach-Object {
    if ($_ -match '^\s*([^#][^=]*)\s*=\s*(.*)$') {
        [Environment]::SetEnvironmentVariable($matches[1].Trim(), $matches[2].Trim(), "Process")
    }
}

Via Docker

# Docker automatically loads .env when using --env-file
docker run -p 127.0.0.1:8000:8000 --env-file .env --rm \
  ghcr.io/cbcoutinho/nextcloud-mcp-server:latest

CLI Configuration

Some configuration options can also be provided via CLI arguments. CLI arguments take precedence over environment variables.

uv run nextcloud-mcp-server --help

Options:
  --oauth / --no-oauth            Force OAuth mode (if enabled) or
                                  BasicAuth mode (if disabled). By default,
                                  auto-detected based on environment
                                  variables.
  --oauth-client-id TEXT          OAuth client ID (can also use
                                  NEXTCLOUD_OIDC_CLIENT_ID env var)
  --oauth-client-secret TEXT      OAuth client secret (can also use
                                  NEXTCLOUD_OIDC_CLIENT_SECRET env var)
  --mcp-server-url TEXT           MCP server URL for OAuth callbacks (can
                                  also use NEXTCLOUD_MCP_SERVER_URL env
                                  var)  [default: http://localhost:8000]

Server Options

Options:
  -h, --host TEXT                 Server host  [default: 127.0.0.1]
  -p, --port INTEGER              Server port  [default: 8000]
  -w, --workers INTEGER           Number of worker processes
  -r, --reload                    Enable auto-reload
  -l, --log-level [critical|error|warning|info|debug|trace]
                                  Logging level  [default: info]
  -t, --transport [sse|streamable-http|http]
                                  MCP transport protocol  [default: sse]

App Selection

Options:
  -e, --enable-app [notes|tables|webdav|calendar|contacts|deck]
                                  Enable specific Nextcloud app APIs. Can
                                  be specified multiple times. If not
                                  specified, all apps are enabled.

Example CLI Usage

# OAuth mode with custom client and port
uv run nextcloud-mcp-server --oauth \
  --oauth-client-id abc123 \
  --oauth-client-secret xyz789 \
  --port 8080

# BasicAuth mode with specific apps only
uv run nextcloud-mcp-server --no-oauth \
  --enable-app notes \
  --enable-app calendar

Configuration Best Practices

For Development

Use BasicAuth for quick setup and testing
Or use OAuth with auto-registration (dynamic client registration)
Store .env file in your project directory
Add .env to .gitignore

For Production

Always use OAuth2/OIDC with pre-configured clients
Store OAuth client credentials securely
Use environment variables from your deployment platform (Docker secrets, Kubernetes ConfigMaps, etc.)
Never commit credentials to version control
SQLite database permissions are handled automatically by the server

For Docker

Mount OAuth client storage as a volume for persistence:

docker run -v $(pwd)/.oauth:/app/.oauth --env-file .env \
  ghcr.io/cbcoutinho/nextcloud-mcp-server:latest

Use Docker secrets for sensitive values in production

18 KiB Raw Permalink Blame History

Configuration

Quick Start

OAuth2/OIDC Configuration

Minimal Configuration (Auto-registration)

Full Configuration (Pre-configured Client)

Environment Variables Reference

Prerequisites

Basic Authentication (Legacy)

Configuration

Environment Variables Reference

Semantic Search Configuration (Optional)

Qdrant Vector Database Modes

1. In-Memory Mode (Default)

2. Persistent Local Mode

3. Network Mode

Qdrant Collection Naming

Auto-Generated Naming (Default)

Switching Embedding Models

Explicit Override

Multi-Server Deployments

Dimension Validation

Vector Sync Configuration

Embedding Service Configuration

Ollama (Recommended)

Simple Embedding Provider (Fallback)

Document Chunking Configuration

Choosing Chunk Size

Choosing Overlap

Environment Variables Reference

Docker Compose Example

Loading Environment Variables

On Linux/macOS

On Windows (PowerShell)

Via Docker

CLI Configuration

OAuth-related CLI Options

Server Options

App Selection

Example CLI Usage

Configuration Best Practices

For Development

For Production

For Docker

See Also

18 KiB

Raw Permalink Blame History