feat(vector): Support multiple embedding models with auto-generated collection names

This PR enables safe switching between embedding models and multi-server deployments by implementing auto-generated Qdrant collection names based on deployment ID and model name. ## Problem Previously, all deployments used a single hardcoded collection name "nextcloud_content", which caused two critical issues: 1. **Dimension mismatches when switching models**: Changing OLLAMA_EMBEDDING_MODEL (e.g., nomic-embed-text at 768D → all-minilm at 384D) would cause runtime errors as vectors couldn't be inserted into a collection with incompatible dimensions. 2. **Collection collisions in multi-server setups**: Multiple MCP servers sharing a single Qdrant instance would overwrite each other's data, making horizontal scaling impossible. ## Solution ### Auto-Generated Collection Naming Collections are now automatically named using the pattern: \`{deployment-id}-{model-name}\` **Deployment ID**: Uses \`OTEL_SERVICE_NAME\` if configured (and not default value), otherwise falls back to \`hostname\` for simple Docker deployments. **Model Name**: From \`OLLAMA_EMBEDDING_MODEL\` with path separators sanitized. **Examples**: - \`my-mcp-server-nomic-embed-text\` (with OTEL_SERVICE_NAME=my-mcp-server) - \`mcp-container-all-minilm\` (simple Docker, hostname=mcp-container) **Override**: Users can still set \`QDRANT_COLLECTION\` explicitly to bypass auto-generation for backward compatibility. ### Dimension Validation Added startup validation that checks collection dimensions match the embedding service. If a mismatch is detected, the server fails fast with a clear error message explaining: - Expected vs actual dimensions - Likely cause (model change) - Solutions (delete collection, use different name, or revert model) ### Improved Sampling Error Handling Enhanced MCP sampling rejection handling to treat user rejections as normal behavior rather than errors: - **User rejections** ("rejected", "denied") → INFO log, no traceback - **Unsupported clients** → INFO log, no traceback - **Other MCP errors** → WARNING log, no traceback - **Unexpected errors** → ERROR log WITH traceback This aligns with the MCP specification where clients SHOULD prompt users for approval/denial of sampling requests. ## Changes ### Core Implementation - **nextcloud_mcp_server/config.py**: Added \`get_collection_name()\` method with deployment ID detection and model name sanitization - **nextcloud_mcp_server/vector/qdrant_client.py**: Dimension validation on collection open with helpful error messages - **nextcloud_mcp_server/vector/{scanner,processor}.py**: Updated to use \`get_collection_name()\` - **nextcloud_mcp_server/auth/userinfo_routes.py**: Vector sync status uses \`get_collection_name()\` - **nextcloud_mcp_server/server/semantic.py**: - Updated semantic search tools to use \`get_collection_name()\` - Improved sampling rejection error handling (McpError vs Exception) ### Documentation - **docs/semantic-search-architecture.md**: New comprehensive architecture document (557 lines) covering background sync, semantic search flow, RAG implementation, and deployment modes - **docs/configuration.md**: Added detailed "Qdrant Collection Naming" section with examples and multi-server deployment guidance - **docker-compose.yml**: Added comments explaining collection naming behavior - **README.md**: Updated semantic search descriptions to clarify experimental status, Notes-only support, and infrastructure requirements ## Migration Guide **For existing single-server deployments:** Option 1 (Recommended): Use explicit collection name for continuity \`\`\`bash QDRANT_COLLECTION=nextcloud_content # Keep existing collection \`\`\` Option 2: Allow auto-generation and re-embed \`\`\`bash # Remove QDRANT_COLLECTION override # New collection will be created based on deployment ID + model # Requires re-embedding all documents (may take time) \`\`\` **For new multi-server deployments:** Set unique OTEL service names per server: \`\`\`bash # Server 1 OTEL_SERVICE_NAME=mcp-prod OLLAMA_EMBEDDING_MODEL=nomic-embed-text # → Collection: "mcp-prod-nomic-embed-text" # Server 2 OTEL_SERVICE_NAME=mcp-staging OLLAMA_EMBEDDING_MODEL=nomic-embed-text # → Collection: "mcp-staging-nomic-embed-text" \`\`\` ## Benefits ✅ **Safe model switching**: Each model gets its own collection, preventing dimension mismatch errors ✅ **Multi-server support**: Multiple MCP servers can share one Qdrant instance without conflicts ✅ **Clear ownership**: Collection names show which deployment and model owns the data ✅ **Better error messages**: Dimension validation provides actionable guidance ✅ **Backward compatible**: Existing deployments can continue using \`QDRANT_COLLECTION\` override ## Testing Validated with: - Single-server deployments (default hostname-based naming) - Multi-server deployments (OTEL service name-based naming) - Model switching scenarios (dimension validation) - Collection override scenarios (backward compatibility) Next steps: Testing various Ollama embedding models to investigate optimal chunk sizes and performance characteristics. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 01:18:30 +01:00
parent 4a6c60113b
commit e575c8e57b
10 changed files with 1361 additions and 339 deletions
@@ -68,17 +68,25 @@ def configure_semantic_tools(mcp: FastMCP):
        client = await get_client(ctx)
        username = client.username

+        logger.info(
+            f"Semantic search: query='{query}', user={username}, "
+            f"limit={limit}, score_threshold={score_threshold}"
+        )
+
        try:
            # Generate embedding for query
            embedding_service = get_embedding_service()
            query_embedding = await embedding_service.embed(query)
+            logger.debug(
+                f"Generated embedding for query (dimension={len(query_embedding)})"
+            )

            # Search Qdrant with user filtering
            # Note: Currently only searching notes (doc_type="note")
            # Future: Remove doc_type filter to search all apps
            qdrant_client = await get_qdrant_client()
            search_response = await qdrant_client.query_points(
-                collection_name=settings.qdrant_collection,
+                collection_name=settings.get_collection_name(),
                query=query_embedding,
                query_filter=Filter(
                    must=[
@@ -98,6 +106,15 @@ def configure_semantic_tools(mcp: FastMCP):
                with_vectors=False,  # Don't return vectors to save bandwidth
            )

+            logger.info(
+                f"Qdrant returned {len(search_response.points)} results "
+                f"(before deduplication and access verification)"
+            )
+            if search_response.points:
+                # Log top 3 scores to help with threshold tuning
+                top_scores = [p.score for p in search_response.points[:3]]
+                logger.debug(f"Top 3 similarity scores: {top_scores}")
+
            # Deduplicate by document ID (multiple chunks per document)
            seen_doc_ids = set()
            results = []
@@ -137,9 +154,14 @@ def configure_semantic_tools(mcp: FastMCP):
                    except HTTPStatusError as e:
                        if e.response.status_code == 403:
                            # User lost access, skip this document
+                            logger.debug(f"Skipping note {doc_id}: access denied (403)")
                            continue
                        elif e.response.status_code == 404:
                            # Document was deleted but not yet removed from vector DB
+                            logger.debug(
+                                f"Skipping note {doc_id}: not found (404), "
+                                f"likely deleted after indexing"
+                            )
                            continue
                        else:
                            # Log other errors but continue processing
@@ -148,6 +170,16 @@ def configure_semantic_tools(mcp: FastMCP):
                            )
                            continue

+            logger.info(
+                f"Returning {len(results)} results after deduplication and access verification"
+            )
+            if results:
+                result_details = [
+                    f"note_{r.id} (score={r.score:.3f}, title='{r.title}')"
+                    for r in results[:5]  # Show top 5
+                ]
+                logger.debug(f"Top results: {', '.join(result_details)}")
+
            return SemanticSearchResponse(
                results=results,
                query=query,
@@ -259,7 +291,47 @@ def configure_semantic_tools(mcp: FastMCP):
                success=True,
            )

-        # 3. Construct context from retrieved documents
+        # 3. Check if client supports sampling
+        from mcp.types import ClientCapabilities, SamplingCapability
+
+        client_has_sampling = ctx.session.check_client_capability(
+            ClientCapabilities(sampling=SamplingCapability())
+        )
+
+        # Log capability check result for debugging
+        logger.info(
+            f"Sampling capability check: client_has_sampling={client_has_sampling}, "
+            f"query='{query}'"
+        )
+        if hasattr(ctx.session, "_client_params") and ctx.session._client_params:
+            client_caps = ctx.session._client_params.capabilities
+            logger.debug(
+                f"Client advertised capabilities: "
+                f"roots={client_caps.roots is not None}, "
+                f"sampling={client_caps.sampling is not None}, "
+                f"experimental={client_caps.experimental is not None}"
+            )
+
+        if not client_has_sampling:
+            logger.info(
+                f"Client does not support sampling (query: '{query}'), "
+                f"returning {len(search_response.results)} documents"
+            )
+            return SamplingSearchResponse(
+                query=query,
+                generated_answer=(
+                    f"[Sampling not supported by client]\n\n"
+                    f"Your MCP client doesn't support answer generation. "
+                    f"Found {search_response.total_found} relevant documents. "
+                    f"Please review the sources below."
+                ),
+                sources=search_response.results,
+                total_found=search_response.total_found,
+                search_method="semantic_sampling_unsupported",
+                success=True,
+            )
+
+        # 4. Construct context from retrieved documents
        context_parts = []
        for idx, result in enumerate(search_response.results, 1):
            context_parts.append(
@@ -273,7 +345,7 @@ def configure_semantic_tools(mcp: FastMCP):

        context = "\n".join(context_parts)

-        # 4. Construct prompt - reuse user's query, add context and instructions
+        # 5. Construct prompt - reuse user's query, add context and instructions
        prompt = (
            f"{query}\n\n"
            f"Here are relevant documents from Nextcloud (notes, calendar events, deck cards, files, contacts):\n\n"
@@ -282,31 +354,35 @@ def configure_semantic_tools(mcp: FastMCP):
            f"Cite the document numbers when referencing specific information."
        )

-        logger.debug(
-            f"Requesting sampling for query: {query} "
-            f"({len(search_response.results)} documents retrieved)"
+        logger.info(
+            f"Initiating sampling request: query_length={len(query)}, "
+            f"documents={len(search_response.results)}, "
+            f"prompt_length={len(prompt)}, max_tokens={max_answer_tokens}"
        )

-        # 5. Request LLM completion via MCP sampling
-        try:
-            sampling_result = await ctx.session.create_message(
-                messages=[
-                    SamplingMessage(
-                        role="user",
-                        content=TextContent(type="text", text=prompt),
-                    )
-                ],
-                max_tokens=max_answer_tokens,
-                temperature=0.7,
-                model_preferences=ModelPreferences(
-                    hints=[ModelHint(name="claude-3-5-sonnet")],
-                    intelligencePriority=0.8,
-                    speedPriority=0.5,
-                ),
-                include_context="thisServer",
-            )
+        # 6. Request LLM completion via MCP sampling with timeout
+        import anyio

-            # 6. Extract answer from sampling response
+        try:
+            with anyio.fail_after(30):
+                sampling_result = await ctx.session.create_message(
+                    messages=[
+                        SamplingMessage(
+                            role="user",
+                            content=TextContent(type="text", text=prompt),
+                        )
+                    ],
+                    max_tokens=max_answer_tokens,
+                    temperature=0.7,
+                    model_preferences=ModelPreferences(
+                        hints=[ModelHint(name="claude-3-5-sonnet")],
+                        intelligencePriority=0.8,
+                        speedPriority=0.5,
+                    ),
+                    include_context="thisServer",
+                )
+
+            # 7. Extract answer from sampling response
            if sampling_result.content.type == "text":
                generated_answer = sampling_result.content.text
            else:
@@ -318,7 +394,8 @@ def configure_semantic_tools(mcp: FastMCP):

            logger.info(
                f"Sampling successful: model={sampling_result.model}, "
-                f"stop_reason={sampling_result.stopReason}"
+                f"stop_reason={sampling_result.stopReason}, "
+                f"answer_length={len(generated_answer)}"
            )

            return SamplingSearchResponse(
@@ -332,23 +409,78 @@ def configure_semantic_tools(mcp: FastMCP):
                success=True,
            )

-        except Exception as e:
-            # Fallback: Return documents without generated answer
+        except TimeoutError:
            logger.warning(
-                f"Sampling failed ({type(e).__name__}: {e}), "
+                f"Sampling request timed out after 30 seconds for query: '{query}', "
                f"returning search results only"
            )
+            return SamplingSearchResponse(
+                query=query,
+                generated_answer=(
+                    f"[Sampling request timed out]\n\n"
+                    f"The answer generation took too long (>30s). "
+                    f"Found {search_response.total_found} relevant documents. "
+                    f"Please review the sources below or try a simpler query."
+                ),
+                sources=search_response.results,
+                total_found=search_response.total_found,
+                search_method="semantic_sampling_timeout",
+                success=True,
+            )
+
+        except McpError as e:
+            # Expected MCP protocol errors (user rejection, unsupported, etc.)
+            error_msg = str(e)
+
+            if "rejected" in error_msg.lower() or "denied" in error_msg.lower():
+                # User explicitly declined - this is normal, not an error
+                logger.info(f"User declined sampling request for query: '{query}'")
+                search_method = "semantic_sampling_user_declined"
+                user_message = "User declined to generate an answer"
+            elif "not supported" in error_msg.lower():
+                # Client doesn't support sampling - also normal
+                logger.info(f"Sampling not supported by client for query: '{query}'")
+                search_method = "semantic_sampling_unsupported"
+                user_message = "Sampling not supported by this client"
+            else:
+                # Other MCP protocol errors
+                logger.warning(
+                    f"MCP error during sampling for query '{query}': {error_msg}"
+                )
+                search_method = "semantic_sampling_mcp_error"
+                user_message = f"Sampling unavailable: {error_msg}"

            return SamplingSearchResponse(
                query=query,
                generated_answer=(
-                    f"[Sampling unavailable: {str(e)}]\n\n"
+                    f"[{user_message}]\n\n"
                    f"Found {search_response.total_found} relevant documents. "
                    f"Please review the sources below."
                ),
                sources=search_response.results,
                total_found=search_response.total_found,
-                search_method="semantic_sampling_fallback",
+                search_method=search_method,
+                success=True,
+            )
+
+        except Exception as e:
+            # Truly unexpected errors - these SHOULD have tracebacks
+            logger.error(
+                f"Unexpected error during sampling for query '{query}': "
+                f"{type(e).__name__}: {e}",
+                exc_info=True,
+            )
+
+            return SamplingSearchResponse(
+                query=query,
+                generated_answer=(
+                    f"[Unexpected error during sampling]\n\n"
+                    f"Found {search_response.total_found} relevant documents. "
+                    f"Please review the sources below."
+                ),
+                sources=search_response.results,
+                total_found=search_response.total_found,
+                search_method="semantic_sampling_error",
                success=True,
            )

@@ -413,7 +545,7 @@ def configure_semantic_tools(mcp: FastMCP):

                # Count documents in collection
                count_result = await qdrant_client.count(
-                    collection_name=settings.qdrant_collection
+                    collection_name=settings.get_collection_name()
                )
                indexed_count = count_result.count