feat(vector): Support multiple embedding models with auto-generated collection names

This PR enables safe switching between embedding models and multi-server
deployments by implementing auto-generated Qdrant collection names based on
deployment ID and model name.

## Problem

Previously, all deployments used a single hardcoded collection name
"nextcloud_content", which caused two critical issues:

1. **Dimension mismatches when switching models**: Changing
   OLLAMA_EMBEDDING_MODEL (e.g., nomic-embed-text at 768D → all-minilm at
   384D) would cause runtime errors as vectors couldn't be inserted into a
   collection with incompatible dimensions.

2. **Collection collisions in multi-server setups**: Multiple MCP servers
   sharing a single Qdrant instance would overwrite each other's data,
   making horizontal scaling impossible.

## Solution

### Auto-Generated Collection Naming

Collections are now automatically named using the pattern:
\`{deployment-id}-{model-name}\`

**Deployment ID**: Uses \`OTEL_SERVICE_NAME\` if configured (and not default
value), otherwise falls back to \`hostname\` for simple Docker deployments.

**Model Name**: From \`OLLAMA_EMBEDDING_MODEL\` with path separators sanitized.

**Examples**:
- \`my-mcp-server-nomic-embed-text\` (with OTEL_SERVICE_NAME=my-mcp-server)
- \`mcp-container-all-minilm\` (simple Docker, hostname=mcp-container)

**Override**: Users can still set \`QDRANT_COLLECTION\` explicitly to bypass
auto-generation for backward compatibility.

### Dimension Validation

Added startup validation that checks collection dimensions match the
embedding service. If a mismatch is detected, the server fails fast with a
clear error message explaining:
- Expected vs actual dimensions
- Likely cause (model change)
- Solutions (delete collection, use different name, or revert model)

### Improved Sampling Error Handling

Enhanced MCP sampling rejection handling to treat user rejections as normal
behavior rather than errors:

- **User rejections** ("rejected", "denied") → INFO log, no traceback
- **Unsupported clients** → INFO log, no traceback
- **Other MCP errors** → WARNING log, no traceback
- **Unexpected errors** → ERROR log WITH traceback

This aligns with the MCP specification where clients SHOULD prompt users for
approval/denial of sampling requests.

## Changes

### Core Implementation

- **nextcloud_mcp_server/config.py**: Added \`get_collection_name()\` method
  with deployment ID detection and model name sanitization
- **nextcloud_mcp_server/vector/qdrant_client.py**: Dimension validation on
  collection open with helpful error messages
- **nextcloud_mcp_server/vector/{scanner,processor}.py**: Updated to use
  \`get_collection_name()\`
- **nextcloud_mcp_server/auth/userinfo_routes.py**: Vector sync status uses
  \`get_collection_name()\`
- **nextcloud_mcp_server/server/semantic.py**:
  - Updated semantic search tools to use \`get_collection_name()\`
  - Improved sampling rejection error handling (McpError vs Exception)

### Documentation

- **docs/semantic-search-architecture.md**: New comprehensive architecture
  document (557 lines) covering background sync, semantic search flow, RAG
  implementation, and deployment modes
- **docs/configuration.md**: Added detailed "Qdrant Collection Naming"
  section with examples and multi-server deployment guidance
- **docker-compose.yml**: Added comments explaining collection naming behavior
- **README.md**: Updated semantic search descriptions to clarify
  experimental status, Notes-only support, and infrastructure requirements

## Migration Guide

**For existing single-server deployments:**

Option 1 (Recommended): Use explicit collection name for continuity
\`\`\`bash
QDRANT_COLLECTION=nextcloud_content  # Keep existing collection
\`\`\`

Option 2: Allow auto-generation and re-embed
\`\`\`bash
# Remove QDRANT_COLLECTION override
# New collection will be created based on deployment ID + model
# Requires re-embedding all documents (may take time)
\`\`\`

**For new multi-server deployments:**

Set unique OTEL service names per server:
\`\`\`bash
# Server 1
OTEL_SERVICE_NAME=mcp-prod
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# → Collection: "mcp-prod-nomic-embed-text"

# Server 2
OTEL_SERVICE_NAME=mcp-staging
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# → Collection: "mcp-staging-nomic-embed-text"
\`\`\`

## Benefits

 **Safe model switching**: Each model gets its own collection, preventing
   dimension mismatch errors
 **Multi-server support**: Multiple MCP servers can share one Qdrant
   instance without conflicts
 **Clear ownership**: Collection names show which deployment and model owns
   the data
 **Better error messages**: Dimension validation provides actionable
   guidance
 **Backward compatible**: Existing deployments can continue using
   \`QDRANT_COLLECTION\` override

## Testing

Validated with:
- Single-server deployments (default hostname-based naming)
- Multi-server deployments (OTEL service name-based naming)
- Model switching scenarios (dimension validation)
- Collection override scenarios (backward compatibility)

Next steps: Testing various Ollama embedding models to investigate optimal
chunk sizes and performance characteristics.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-11-10 01:18:30 +01:00
parent 4a6c60113b
commit e575c8e57b
10 changed files with 1361 additions and 339 deletions
+164 -32
View File
@@ -68,17 +68,25 @@ def configure_semantic_tools(mcp: FastMCP):
client = await get_client(ctx)
username = client.username
logger.info(
f"Semantic search: query='{query}', user={username}, "
f"limit={limit}, score_threshold={score_threshold}"
)
try:
# Generate embedding for query
embedding_service = get_embedding_service()
query_embedding = await embedding_service.embed(query)
logger.debug(
f"Generated embedding for query (dimension={len(query_embedding)})"
)
# Search Qdrant with user filtering
# Note: Currently only searching notes (doc_type="note")
# Future: Remove doc_type filter to search all apps
qdrant_client = await get_qdrant_client()
search_response = await qdrant_client.query_points(
collection_name=settings.qdrant_collection,
collection_name=settings.get_collection_name(),
query=query_embedding,
query_filter=Filter(
must=[
@@ -98,6 +106,15 @@ def configure_semantic_tools(mcp: FastMCP):
with_vectors=False, # Don't return vectors to save bandwidth
)
logger.info(
f"Qdrant returned {len(search_response.points)} results "
f"(before deduplication and access verification)"
)
if search_response.points:
# Log top 3 scores to help with threshold tuning
top_scores = [p.score for p in search_response.points[:3]]
logger.debug(f"Top 3 similarity scores: {top_scores}")
# Deduplicate by document ID (multiple chunks per document)
seen_doc_ids = set()
results = []
@@ -137,9 +154,14 @@ def configure_semantic_tools(mcp: FastMCP):
except HTTPStatusError as e:
if e.response.status_code == 403:
# User lost access, skip this document
logger.debug(f"Skipping note {doc_id}: access denied (403)")
continue
elif e.response.status_code == 404:
# Document was deleted but not yet removed from vector DB
logger.debug(
f"Skipping note {doc_id}: not found (404), "
f"likely deleted after indexing"
)
continue
else:
# Log other errors but continue processing
@@ -148,6 +170,16 @@ def configure_semantic_tools(mcp: FastMCP):
)
continue
logger.info(
f"Returning {len(results)} results after deduplication and access verification"
)
if results:
result_details = [
f"note_{r.id} (score={r.score:.3f}, title='{r.title}')"
for r in results[:5] # Show top 5
]
logger.debug(f"Top results: {', '.join(result_details)}")
return SemanticSearchResponse(
results=results,
query=query,
@@ -259,7 +291,47 @@ def configure_semantic_tools(mcp: FastMCP):
success=True,
)
# 3. Construct context from retrieved documents
# 3. Check if client supports sampling
from mcp.types import ClientCapabilities, SamplingCapability
client_has_sampling = ctx.session.check_client_capability(
ClientCapabilities(sampling=SamplingCapability())
)
# Log capability check result for debugging
logger.info(
f"Sampling capability check: client_has_sampling={client_has_sampling}, "
f"query='{query}'"
)
if hasattr(ctx.session, "_client_params") and ctx.session._client_params:
client_caps = ctx.session._client_params.capabilities
logger.debug(
f"Client advertised capabilities: "
f"roots={client_caps.roots is not None}, "
f"sampling={client_caps.sampling is not None}, "
f"experimental={client_caps.experimental is not None}"
)
if not client_has_sampling:
logger.info(
f"Client does not support sampling (query: '{query}'), "
f"returning {len(search_response.results)} documents"
)
return SamplingSearchResponse(
query=query,
generated_answer=(
f"[Sampling not supported by client]\n\n"
f"Your MCP client doesn't support answer generation. "
f"Found {search_response.total_found} relevant documents. "
f"Please review the sources below."
),
sources=search_response.results,
total_found=search_response.total_found,
search_method="semantic_sampling_unsupported",
success=True,
)
# 4. Construct context from retrieved documents
context_parts = []
for idx, result in enumerate(search_response.results, 1):
context_parts.append(
@@ -273,7 +345,7 @@ def configure_semantic_tools(mcp: FastMCP):
context = "\n".join(context_parts)
# 4. Construct prompt - reuse user's query, add context and instructions
# 5. Construct prompt - reuse user's query, add context and instructions
prompt = (
f"{query}\n\n"
f"Here are relevant documents from Nextcloud (notes, calendar events, deck cards, files, contacts):\n\n"
@@ -282,31 +354,35 @@ def configure_semantic_tools(mcp: FastMCP):
f"Cite the document numbers when referencing specific information."
)
logger.debug(
f"Requesting sampling for query: {query} "
f"({len(search_response.results)} documents retrieved)"
logger.info(
f"Initiating sampling request: query_length={len(query)}, "
f"documents={len(search_response.results)}, "
f"prompt_length={len(prompt)}, max_tokens={max_answer_tokens}"
)
# 5. Request LLM completion via MCP sampling
try:
sampling_result = await ctx.session.create_message(
messages=[
SamplingMessage(
role="user",
content=TextContent(type="text", text=prompt),
)
],
max_tokens=max_answer_tokens,
temperature=0.7,
model_preferences=ModelPreferences(
hints=[ModelHint(name="claude-3-5-sonnet")],
intelligencePriority=0.8,
speedPriority=0.5,
),
include_context="thisServer",
)
# 6. Request LLM completion via MCP sampling with timeout
import anyio
# 6. Extract answer from sampling response
try:
with anyio.fail_after(30):
sampling_result = await ctx.session.create_message(
messages=[
SamplingMessage(
role="user",
content=TextContent(type="text", text=prompt),
)
],
max_tokens=max_answer_tokens,
temperature=0.7,
model_preferences=ModelPreferences(
hints=[ModelHint(name="claude-3-5-sonnet")],
intelligencePriority=0.8,
speedPriority=0.5,
),
include_context="thisServer",
)
# 7. Extract answer from sampling response
if sampling_result.content.type == "text":
generated_answer = sampling_result.content.text
else:
@@ -318,7 +394,8 @@ def configure_semantic_tools(mcp: FastMCP):
logger.info(
f"Sampling successful: model={sampling_result.model}, "
f"stop_reason={sampling_result.stopReason}"
f"stop_reason={sampling_result.stopReason}, "
f"answer_length={len(generated_answer)}"
)
return SamplingSearchResponse(
@@ -332,23 +409,78 @@ def configure_semantic_tools(mcp: FastMCP):
success=True,
)
except Exception as e:
# Fallback: Return documents without generated answer
except TimeoutError:
logger.warning(
f"Sampling failed ({type(e).__name__}: {e}), "
f"Sampling request timed out after 30 seconds for query: '{query}', "
f"returning search results only"
)
return SamplingSearchResponse(
query=query,
generated_answer=(
f"[Sampling request timed out]\n\n"
f"The answer generation took too long (>30s). "
f"Found {search_response.total_found} relevant documents. "
f"Please review the sources below or try a simpler query."
),
sources=search_response.results,
total_found=search_response.total_found,
search_method="semantic_sampling_timeout",
success=True,
)
except McpError as e:
# Expected MCP protocol errors (user rejection, unsupported, etc.)
error_msg = str(e)
if "rejected" in error_msg.lower() or "denied" in error_msg.lower():
# User explicitly declined - this is normal, not an error
logger.info(f"User declined sampling request for query: '{query}'")
search_method = "semantic_sampling_user_declined"
user_message = "User declined to generate an answer"
elif "not supported" in error_msg.lower():
# Client doesn't support sampling - also normal
logger.info(f"Sampling not supported by client for query: '{query}'")
search_method = "semantic_sampling_unsupported"
user_message = "Sampling not supported by this client"
else:
# Other MCP protocol errors
logger.warning(
f"MCP error during sampling for query '{query}': {error_msg}"
)
search_method = "semantic_sampling_mcp_error"
user_message = f"Sampling unavailable: {error_msg}"
return SamplingSearchResponse(
query=query,
generated_answer=(
f"[Sampling unavailable: {str(e)}]\n\n"
f"[{user_message}]\n\n"
f"Found {search_response.total_found} relevant documents. "
f"Please review the sources below."
),
sources=search_response.results,
total_found=search_response.total_found,
search_method="semantic_sampling_fallback",
search_method=search_method,
success=True,
)
except Exception as e:
# Truly unexpected errors - these SHOULD have tracebacks
logger.error(
f"Unexpected error during sampling for query '{query}': "
f"{type(e).__name__}: {e}",
exc_info=True,
)
return SamplingSearchResponse(
query=query,
generated_answer=(
f"[Unexpected error during sampling]\n\n"
f"Found {search_response.total_found} relevant documents. "
f"Please review the sources below."
),
sources=search_response.results,
total_found=search_response.total_found,
search_method="semantic_sampling_error",
success=True,
)
@@ -413,7 +545,7 @@ def configure_semantic_tools(mcp: FastMCP):
# Count documents in collection
count_result = await qdrant_client.count(
collection_name=settings.qdrant_collection
collection_name=settings.get_collection_name()
)
indexed_count = count_result.count