+
Algorithm
- Semantic (Dense Vectors)
- BM25 Hybrid (Dense + Sparse)
+ Semantic (Dense)
+ BM25 Hybrid
-
-
Fusion Method
+
+ Fusion
- RRF (Reciprocal Rank Fusion)
- DBSF (Distribution-Based Score Fusion)
+ RRF
+ DBSF
-
-
Search & Visualize
+
+
+ Search
-
-
-
Advanced Options
-
-
+
+
-
Document Types
-
-
-
-
-
-
- BM25 Hybrid Search: Combines dense semantic vectors with sparse BM25 keyword vectors.
-
-
- RRF: Reciprocal Rank Fusion - Rank-based fusion producing scores in [0.0, 1.0]
-
-
- DBSF: Distribution-Based Score Fusion - Sums normalized scores (can exceed 1.0)
-
-
-
-
-
-
-
-
-
- Executing search and computing PCA projection...
-
-
+
-
-
-
Search Results ( )
+
+
+
+
+ Executing search and computing PCA projection...
+
+
+
+
+
+
+
+
Search Results ( )
Loading results...
@@ -335,5 +160,6 @@
-
-
+
+
+
diff --git a/nextcloud_mcp_server/auth/templates/welcome.html b/nextcloud_mcp_server/auth/templates/welcome.html
new file mode 100644
index 0000000..85a3f52
--- /dev/null
+++ b/nextcloud_mcp_server/auth/templates/welcome.html
@@ -0,0 +1,392 @@
+{% extends "base.html" %}
+
+{% block title %}Welcome - Nextcloud MCP Server{% endblock %}
+
+{% block extra_head %}
+
+
+{% endblock %}
+
+{% block extra_styles %}
+ /* Welcome page specific styles */
+ .hero-section {
+ background: linear-gradient(135deg, var(--color-primary-element) 0%, #0082c9 100%);
+ color: white;
+ padding: 60px 24px;
+ margin: -24px -24px 40px -24px;
+ border-radius: 0 0 var(--border-radius-large) var(--border-radius-large);
+ text-align: center;
+ }
+
+ .hero-section h1 {
+ color: white;
+ font-size: 36px;
+ margin: 0 0 16px 0;
+ font-weight: 600;
+ }
+
+ .hero-section p {
+ font-size: 18px;
+ opacity: 0.95;
+ max-width: 700px;
+ margin: 0 auto;
+ line-height: 1.6;
+ }
+
+ .feature-grid {
+ display: grid;
+ grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+ gap: 24px;
+ margin: 32px 0;
+ }
+
+ .feature-card {
+ background: var(--color-main-background);
+ border: 2px solid var(--color-border);
+ border-radius: var(--border-radius-large);
+ padding: 24px;
+ transition: all 0.2s;
+ cursor: pointer;
+ text-decoration: none;
+ color: inherit;
+ display: block;
+ }
+
+ .feature-card:hover {
+ border-color: var(--color-primary-element);
+ box-shadow: 0 4px 12px rgba(0, 103, 158, 0.15);
+ transform: translateY(-2px);
+ }
+
+ .feature-card h3 {
+ color: var(--color-primary-element);
+ font-size: 20px;
+ margin: 12px 0 8px 0;
+ font-weight: 600;
+ display: flex;
+ align-items: center;
+ gap: 12px;
+ }
+
+ .feature-card p {
+ color: var(--color-text-maxcontrast);
+ font-size: 14px;
+ line-height: 1.6;
+ margin: 8px 0 0 0;
+ }
+
+ .feature-icon {
+ width: 48px;
+ height: 48px;
+ background: var(--color-primary-element-light);
+ border-radius: var(--border-radius);
+ display: flex;
+ align-items: center;
+ justify-content: center;
+ margin-bottom: 8px;
+ }
+
+ .feature-icon svg {
+ width: 28px;
+ height: 28px;
+ fill: var(--color-primary-element);
+ }
+
+ .info-section {
+ background: var(--color-background-hover);
+ border-radius: var(--border-radius-large);
+ padding: 32px;
+ margin: 32px 0;
+ }
+
+ .info-section h2 {
+ color: var(--color-main-text);
+ font-size: 24px;
+ margin: 0 0 16px 0;
+ border: none;
+ padding: 0;
+ }
+
+ .info-section p {
+ color: var(--color-text-maxcontrast);
+ line-height: 1.7;
+ margin: 12px 0;
+ }
+
+ .info-section ul {
+ margin: 12px 0;
+ padding-left: 24px;
+ }
+
+ .info-section li {
+ color: var(--color-text-maxcontrast);
+ line-height: 1.7;
+ margin: 8px 0;
+ }
+
+ .info-section code {
+ background: var(--color-main-background);
+ padding: 2px 8px;
+ border-radius: var(--border-radius);
+ font-size: 13px;
+ }
+
+ .auth-status {
+ background: var(--color-primary-element-light);
+ border-left: 4px solid var(--color-primary-element);
+ padding: 16px 20px;
+ margin: 24px 0;
+ border-radius: var(--border-radius);
+ display: flex;
+ align-items: center;
+ gap: 12px;
+ }
+
+ .auth-status svg {
+ width: 24px;
+ height: 24px;
+ fill: var(--color-primary-element);
+ flex-shrink: 0;
+ }
+
+ .auth-status-text {
+ flex: 1;
+ }
+
+ .auth-status-text strong {
+ display: block;
+ color: var(--color-main-text);
+ font-size: 14px;
+ margin-bottom: 4px;
+ }
+
+ .auth-status-text span {
+ color: var(--color-text-maxcontrast);
+ font-size: 13px;
+ }
+{% endblock %}
+
+{% block content %}
+
+
+
+
+
+
+
Welcome to Nextcloud MCP Server
+
+ Interactive user interface for semantic search and document retrieval.
+ Test queries, visualize results, and explore your Nextcloud content using RAG workflows.
+
+
+
+
+
+
+
+
+
+ Authenticated as: {{ username }}
+ Authentication mode: {{ auth_mode }}
+
+
+
+ {% if vector_sync_enabled %}
+
+
+
About Semantic Search
+
+ This interface provides access to semantic search capabilities powered by vector embeddings.
+ Unlike traditional keyword search, semantic search understands the meaning of your queries and finds
+ conceptually similar content across your Nextcloud apps.
+
+
+ How it works:
+
+
+ Documents from Notes, Calendar, Files, Contacts, and Deck are indexed into a vector database
+ Each document chunk is converted to a 768-dimensional vector embedding that captures semantic meaning
+ Queries are also converted to embeddings and matched against document vectors using similarity search
+ Results can be retrieved using pure semantic search or hybrid BM25 search combining keywords and semantics
+
+
+
+
+
RAG Workflow Integration
+
+ This UI allows you to test the same queries that Large Language Models (LLMs) would use in a
+ Retrieval-Augmented Generation (RAG) workflow. When an AI assistant needs to answer questions about your data:
+
+
+ Step 1: The assistant converts your question into a search query
+ Step 2: The MCP server retrieves relevant document chunks using semantic search
+ Step 3: Retrieved context is passed to the LLM to generate an informed answer
+
+
+
+
+
+ MCP Sampling RAG Workflow
+
+
+┌─────────────────┐
+│ MCP Client │ User asks: "What are health benefits of coffee?"
+│ (Claude Code) │
+└────────┬────────┘
+ │ (1) User question
+ ↓
+┌────────────────────────────────────────────────────────────────────────┐
+│ Nextcloud MCP Server │
+│ ┌──────────────────────────────────────────────────────────────────┐ │
+│ │ nc_semantic_search_answer Tool (MCP Sampling-enabled) │ │
+│ │ │ │
+│ │ (2) Semantic Search │ │
+│ │ ┌────────────────────────────────────────────────────────┐ │ │
+│ │ │ Query: "health benefits of coffee" │ │ │
+│ │ │ → Convert to 768D vector embedding │ │ │
+│ │ │ → Search Qdrant (BM25 Hybrid + RRF fusion) │ │ │
+│ │ │ → Retrieve top 5 relevant document chunks │ │ │
+│ │ └────────────────────────────────────────────────────────┘ │ │
+│ │ │ │
+│ │ (3) Construct Prompt with Context │ │
+│ │ ┌────────────────────────────────────────────────────────┐ │ │
+│ │ │ "What are health benefits of coffee? │ │ │
+│ │ │ │ │ │
+│ │ │ Documents: │ │ │
+│ │ │ - [MED-2155] Effects of habitual coffee consumption...│ │ │
+│ │ │ - [MED-1646] Beverage consumption guidance... │ │ │
+│ │ │ - [MED-1627] Coffee and depression risk... │ │ │
+│ │ │ ... │ │ │
+│ │ │ │ │ │
+│ │ │ Provide answer with citations." │ │ │
+│ │ └────────────────────────────────────────────────────────┘ │ │
+│ │ │ │
+│ │ (4) MCP Sampling Request │ │
+│ │ ─────────────────────────────────────────────────────────────> │ │
+│ └──────────────────────────────────────────────────────────────────┘ │
+└────────────────────────────────────────────────────────────────────────┘
+ │
+ │ Sampling request with prompt + context
+ ↓
+┌─────────────────┐
+│ MCP Client │ (5) Client's LLM generates answer using retrieved context
+│ (Claude) │ → "Coffee consumption (2-3 cups/day) is associated with
+└────────┬────────┘ reduced risk of type 2 diabetes, cardiovascular disease,
+ │ and improved liver health (Document 1, 2)..."
+ │
+ │ (6) Answer with citations
+ ↓
+┌─────────────────┐
+│ User │ Receives comprehensive answer with source citations
+└─────────────────┘
+
+
+
+ Key Point: The MCP server retrieves context but doesn't generate answers itself.
+ Through MCP sampling , it requests the client's LLM to generate responses, giving users
+ full control over which model is used and ensuring all processing happens client-side.
+
+
+
+ By using this interface, you can preview search results, understand relevance scores, and verify
+ that the system retrieves the right information before it reaches the LLM.
+
+
+
+
+
Available Features
+
+
+ {% else %}
+
+
+
Vector Sync is Disabled
+
+ Semantic search and vector visualization features are currently disabled.
+ To enable these features, set VECTOR_SYNC_ENABLED=true in your environment configuration.
+
+
+ Learn more:
+
+ Configuration Guide
+
+
+
+
+
+
Available Features
+
+ {% endif %}
+
+
+
+
Documentation
+
+ For detailed information about configuration, authentication modes, and advanced features,
+ please refer to the project documentation:
+
+
+
+
+
+
+{% endblock %}
diff --git a/nextcloud_mcp_server/auth/userinfo_routes.py b/nextcloud_mcp_server/auth/userinfo_routes.py
index d57806c..4a015da 100644
--- a/nextcloud_mcp_server/auth/userinfo_routes.py
+++ b/nextcloud_mcp_server/auth/userinfo_routes.py
@@ -9,15 +9,21 @@ For OAuth mode: Requires browser-based OAuth login to establish session.
import logging
import os
+from pathlib import Path
from typing import Any
import httpx
+from jinja2 import Environment, FileSystemLoader
from starlette.authentication import requires
from starlette.requests import Request
from starlette.responses import HTMLResponse, JSONResponse
logger = logging.getLogger(__name__)
+# Setup Jinja2 environment for templates
+_template_dir = Path(__file__).parent / "templates"
+_jinja_env = Environment(loader=FileSystemLoader(_template_dir))
+
async def _get_authenticated_client_for_userinfo(request: Request) -> httpx.AsyncClient:
"""Get an authenticated HTTP client for user info page operations.
@@ -431,51 +437,14 @@ async def user_info_html(request: Request) -> HTMLResponse:
oauth_ctx = getattr(request.app.state, "oauth_context", None)
login_url = str(request.url_for("oauth_login")) if oauth_ctx else "/oauth/login"
- error_html = f"""
-
-
-
-
-
-
Error - Nextcloud MCP Server
-
-
-
-
-
Error Retrieving User Info
-
- Error: {user_context["error"]}
-
-
Login again
-
-
-
- """
- return HTMLResponse(content=error_html)
+ template = _jinja_env.get_template("error.html")
+ return HTMLResponse(
+ content=template.render(
+ error_title="Error Retrieving User Info",
+ error_message=user_context["error"],
+ login_url=login_url,
+ )
+ )
# Build HTML response
auth_mode = user_context.get("auth_mode", "unknown")
@@ -654,457 +623,26 @@ async def user_info_html(request: Request) -> HTMLResponse:
"""
- html_content = f"""
-
-
-
-
-
-
Nextcloud MCP Server
+ # Check if vector sync is enabled (needed for Welcome tab)
+ vector_sync_enabled = os.getenv("VECTOR_SYNC_ENABLED", "false").lower() == "true"
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Nextcloud MCP Server
-
-
-
-
- User Info
-
- {
- ""
- if not show_vector_sync_tab
- else '''
-
- Vector Sync
-
- '''
- }
- {
- ""
- if not show_vector_sync_tab
- else '''
-
- Vector Viz
-
- '''
- }
- {
- ""
- if not show_webhooks_tab
- else '''
-
- Webhooks
-
- '''
- }
-
-
-
-
-
-
- {user_info_tab_html}
-
-
- {
- ""
- if not show_vector_sync_tab
- else f'''
-
-
- {vector_sync_tab_html}
-
- '''
- }
-
- {
- ""
- if not show_vector_sync_tab
- else '''
-
-
-
-
Loading vector visualization...
-
-
- '''
- }
-
- {
- ""
- if not show_webhooks_tab
- else f'''
-
-
- {webhooks_tab_html}
-
- '''
- }
-
-
- {
- f'
'
- if auth_mode == "oauth"
- else ""
- }
-
-
-
- """
-
- return HTMLResponse(content=html_content)
+ # Render template
+ template = _jinja_env.get_template("user_info.html")
+ return HTMLResponse(
+ content=template.render(
+ user_info_tab_html=user_info_tab_html,
+ vector_sync_tab_html=vector_sync_tab_html,
+ webhooks_tab_html=webhooks_tab_html,
+ show_vector_sync_tab=show_vector_sync_tab,
+ show_webhooks_tab=show_webhooks_tab,
+ logout_url=logout_url if auth_mode == "oauth" else None,
+ nextcloud_host_for_links=nextcloud_host_for_links,
+ # Additional context for Welcome tab
+ vector_sync_enabled=vector_sync_enabled,
+ username=username,
+ auth_mode=auth_mode,
+ )
+ )
@requires("authenticated", redirect="oauth_login")
@@ -1124,17 +662,12 @@ async def revoke_session(request: Request) -> HTMLResponse:
oauth_ctx = getattr(request.app.state, "oauth_context", None)
if not oauth_ctx:
+ template = _jinja_env.get_template("error.html")
return HTMLResponse(
- """
-
-
-
Error
-
-
Error
-
OAuth mode not enabled
-
-
- """,
+ content=template.render(
+ error_title="Error",
+ error_message="OAuth mode not enabled",
+ ),
status_code=400,
)
@@ -1142,17 +675,12 @@ async def revoke_session(request: Request) -> HTMLResponse:
session_id = request.cookies.get("mcp_session")
if not storage or not session_id:
+ template = _jinja_env.get_template("error.html")
return HTMLResponse(
- """
-
-
-
Error
-
-
Error
-
Session not found
-
-
- """,
+ content=template.render(
+ error_title="Error",
+ error_message="Session not found",
+ ),
status_code=400,
)
@@ -1165,57 +693,26 @@ async def revoke_session(request: Request) -> HTMLResponse:
# Redirect back to user page
user_page_url = str(request.url_for("user_info_html"))
+ template = _jinja_env.get_template("success.html")
return HTMLResponse(
- f"""
-
-
-
-
-
-
Background Access Revoked
-
-
-
-
-
✓ Background Access Revoked
-
Your refresh token has been deleted successfully.
-
Browser session remains active.
-
Redirecting back to user page...
-
-
-
- """
+ content=template.render(
+ success_title="✓ Background Access Revoked",
+ success_messages=[
+ "Your refresh token has been deleted successfully.",
+ "Browser session remains active.",
+ ],
+ redirect_url=user_page_url,
+ redirect_delay=2,
+ )
)
except Exception as e:
logger.error(f"Failed to revoke background access: {e}")
+ template = _jinja_env.get_template("error.html")
return HTMLResponse(
- f"""
-
-
-
Error
-
-
Error
-
Failed to revoke background access: {e}
-
-
- """,
+ content=template.render(
+ error_title="Error",
+ error_message=f"Failed to revoke background access: {e}",
+ ),
status_code=500,
)
diff --git a/nextcloud_mcp_server/auth/viz_routes.py b/nextcloud_mcp_server/auth/viz_routes.py
index d6776f4..3497084 100644
--- a/nextcloud_mcp_server/auth/viz_routes.py
+++ b/nextcloud_mcp_server/auth/viz_routes.py
@@ -1,13 +1,14 @@
"""Vector visualization routes for testing search algorithms.
Provides a web UI for users to test different search algorithms on their own
-indexed documents and visualize results in 2D space using PCA.
+indexed documents and visualize results in 3D space using PCA.
All processing happens server-side following ADR-012:
- Search execution via shared search/algorithms.py
-- PCA dimensionality reduction (768-dim → 2D)
-- Only 2D coordinates + metadata sent to client
-- Bandwidth-efficient (2 floats per doc vs 768)
+- Query embedding generation
+- PCA dimensionality reduction (768-dim → 3D)
+- Only 3D coordinates + metadata sent to client
+- Bandwidth-efficient (3 floats per doc vs 768)
"""
import logging
@@ -77,19 +78,20 @@ async def vector_visualization_html(request: Request) -> HTMLResponse:
@requires("authenticated", redirect="oauth_login")
async def vector_visualization_search(request: Request) -> JSONResponse:
- """Execute server-side search and return 2D coordinates + results.
+ """Execute server-side search and return 3D coordinates + results.
All processing happens server-side:
1. Execute search via shared algorithm module
- 2. Fetch matching vectors from Qdrant
- 3. Apply PCA reduction (768-dim → 2D)
- 4. Return coordinates + metadata only
+ 2. Generate query embedding
+ 3. Fetch matching vectors from Qdrant
+ 4. Apply PCA reduction (768-dim → 3D) to query + documents
+ 5. Return coordinates + metadata only
Args:
request: Starlette request with query parameters
Returns:
- JSON response with coordinates_2d and results
+ JSON response with coordinates_3d and results (including query point)
"""
settings = get_settings()
@@ -209,7 +211,8 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
{
"success": True,
"results": [],
- "coordinates_2d": [],
+ "coordinates_3d": [],
+ "query_coords": None,
"message": "No results found",
}
)
@@ -253,7 +256,7 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
}
)
- # Extract dense vectors (handle both named and unnamed vectors)
+ # Extract dense vectors and group by document
def extract_dense_vector(point):
if point.vector is None:
return None
@@ -263,13 +266,21 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
# If unnamed vector (array), use directly
return point.vector
- vectors = np.array(
- [v for v in (extract_dense_vector(p) for p in points) if v is not None]
- )
+ # Group chunk vectors by doc_id
+ from collections import defaultdict
+
+ doc_chunks = defaultdict(list)
+ for point in points:
+ if point.payload:
+ doc_id = int(point.payload.get("doc_id", 0))
+ vector = extract_dense_vector(point)
+ if vector is not None:
+ doc_chunks[doc_id].append(vector)
+
vector_fetch_duration = time.perf_counter() - vector_fetch_start
- if len(vectors) < 2:
- # Not enough points for PCA
+ if len(doc_chunks) < 2:
+ # Not enough documents for PCA
return JSONResponse(
{
"success": True,
@@ -283,35 +294,131 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
}
for r in search_results
],
- "coordinates_2d": [[0, 0]] * len(search_results),
- "message": "Not enough vectors for PCA",
+ "coordinates_3d": [[0, 0, 0]] * len(search_results),
+ "query_coords": [0, 0, 0],
+ "message": "Not enough documents for PCA",
}
)
- # Apply PCA dimensionality reduction (768-dim → 2D)
+ # Detect embedding dimension from first available vector
+ embedding_dim = None
+ for chunks in doc_chunks.values():
+ if chunks:
+ embedding_dim = len(chunks[0])
+ break
+
+ if embedding_dim is None:
+ return JSONResponse(
+ {
+ "success": False,
+ "error": "Could not determine embedding dimension",
+ },
+ status_code=500,
+ )
+
+ logger.info(f"Detected embedding dimension: {embedding_dim}")
+
+ # Average chunk vectors per document to create document-level embeddings
+ # Maintain order of search_results for coordinate mapping
+ doc_vectors = []
+ for result in search_results:
+ if result.id in doc_chunks:
+ # Average all chunk embeddings for this document
+ chunk_vectors = np.array(doc_chunks[result.id])
+ avg_vector = np.mean(chunk_vectors, axis=0)
+ doc_vectors.append(avg_vector)
+ logger.debug(f"Doc {result.id}: averaged {len(chunk_vectors)} chunks")
+ else:
+ # Document not found in vectors (shouldn't happen)
+ logger.warning(f"Doc {result.id} not found in fetched vectors")
+ # Use zero vector as fallback with detected dimension
+ doc_vectors.append(np.zeros(embedding_dim))
+
+ doc_vectors = np.array(doc_vectors)
+
+ # Generate query embedding for visualization
+ query_embed_start = time.perf_counter()
+ from nextcloud_mcp_server.embedding.service import get_embedding_service
+
+ embedding_service = get_embedding_service()
+ query_embedding = await embedding_service.embed(query)
+ query_embed_duration = time.perf_counter() - query_embed_start
+
+ logger.info(f"Generated query embedding (dimension={len(query_embedding)})")
+
+ # Combine query vector with document vectors for PCA
+ # Query will be the last point in the array
+ all_vectors = np.vstack([doc_vectors, np.array([query_embedding])])
+
+ # Normalize vectors to unit length (L2 normalization)
+ # This is critical because Qdrant uses COSINE distance, which only measures
+ # vector direction (angle), not magnitude. PCA uses Euclidean distance which
+ # considers both direction and magnitude. By normalizing to unit length,
+ # Euclidean distances in PCA space will match cosine distances.
+ norms = np.linalg.norm(all_vectors, axis=1, keepdims=True)
+
+ # Check for zero-norm vectors (can happen with empty/corrupted embeddings)
+ zero_norm_mask = norms[:, 0] < 1e-10
+ if zero_norm_mask.any():
+ zero_indices = np.where(zero_norm_mask)[0]
+ logger.warning(
+ f"Found {zero_norm_mask.sum()} zero-norm vectors at indices {zero_indices.tolist()}. "
+ "Replacing with small epsilon to avoid division by zero."
+ )
+ # Replace zero norms with small epsilon to avoid NaN
+ norms[zero_norm_mask] = 1e-10
+
+ all_vectors_normalized = all_vectors / norms
+ logger.info(
+ f"Normalized vectors: query_norm={norms[-1][0]:.3f}, "
+ f"doc_norm_range=[{norms[:-1].min():.3f}, {norms[:-1].max():.3f}]"
+ )
+
+ # Apply PCA dimensionality reduction (768-dim → 3D) on normalized vectors
pca_start = time.perf_counter()
- pca = PCA(n_components=2)
- coords_2d = pca.fit_transform(vectors)
+ pca = PCA(n_components=3)
+ coords_3d = pca.fit_transform(all_vectors_normalized)
pca_duration = time.perf_counter() - pca_start
# After fit, these attributes are guaranteed to be set
assert pca.explained_variance_ratio_ is not None
- logger.info(
- f"PCA explained variance: PC1={pca.explained_variance_ratio_[0]:.3f}, "
- f"PC2={pca.explained_variance_ratio_[1]:.3f}"
+ # Check for NaN values in PCA output (numerical instability)
+ nan_mask = np.isnan(coords_3d)
+ if nan_mask.any():
+ nan_rows = np.where(nan_mask.any(axis=1))[0]
+ logger.error(
+ f"Found NaN values in PCA output at {len(nan_rows)} points: {nan_rows.tolist()[:10]}. "
+ "Replacing NaN with 0.0 to prevent JSON serialization error."
+ )
+ # Replace NaN with 0 to allow JSON serialization
+ coords_3d = np.nan_to_num(coords_3d, nan=0.0)
+
+ # Split query coords from document coords
+ # Round to 2 decimal places for cleaner display
+ query_coords_3d = [
+ round(float(x), 2) for x in coords_3d[-1]
+ ] # Last point is query
+ doc_coords_3d = coords_3d[:-1] # All but last are documents
+
+ total_chunks = sum(len(chunks) for chunks in doc_chunks.values())
+ avg_chunks_per_doc = (
+ total_chunks / len(doc_vectors) if doc_vectors.size > 0 else 0
)
- # Map results to coordinates (use first chunk per document)
- result_coords = []
- seen_doc_ids = set()
+ logger.info(
+ f"PCA explained variance: PC1={pca.explained_variance_ratio_[0]:.3f}, "
+ f"PC2={pca.explained_variance_ratio_[1]:.3f}, "
+ f"PC3={pca.explained_variance_ratio_[2]:.3f}"
+ )
+ logger.info(
+ f"Embedding stats: documents={len(doc_vectors)}, "
+ f"total_chunks={total_chunks}, avg_chunks_per_doc={avg_chunks_per_doc:.1f}, "
+ f"query_dim={len(query_embedding)}, doc_vector_dim={doc_vectors.shape[1] if doc_vectors.size > 0 else 0}"
+ )
- for point, coord in zip(points, coords_2d):
- if point.payload:
- doc_id = int(point.payload.get("doc_id", 0))
- if doc_id not in seen_doc_ids and doc_id in doc_ids:
- seen_doc_ids.add(doc_id)
- result_coords.append(coord.tolist())
+ # Coordinates already match search_results order (1:1 mapping)
+ result_coords = [[round(float(x), 2) for x in coord] for coord in doc_coords_3d]
# Build response
response_results = [
@@ -338,26 +445,30 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
f"Viz search timing: total={total_duration * 1000:.1f}ms, "
f"search={search_duration * 1000:.1f}ms ({search_duration / total_duration * 100:.1f}%), "
f"vector_fetch={vector_fetch_duration * 1000:.1f}ms ({vector_fetch_duration / total_duration * 100:.1f}%), "
+ f"query_embed={query_embed_duration * 1000:.1f}ms ({query_embed_duration / total_duration * 100:.1f}%), "
f"pca={pca_duration * 1000:.1f}ms ({pca_duration / total_duration * 100:.1f}%), "
- f"results={len(search_results)}, vectors={len(vectors)}"
+ f"results={len(search_results)}, doc_vectors={len(doc_vectors)}"
)
return JSONResponse(
{
"success": True,
"results": response_results,
- "coordinates_2d": result_coords[: len(search_results)],
+ "coordinates_3d": result_coords[: len(search_results)],
+ "query_coords": query_coords_3d,
"pca_variance": {
"pc1": float(pca.explained_variance_ratio_[0]),
"pc2": float(pca.explained_variance_ratio_[1]),
+ "pc3": float(pca.explained_variance_ratio_[2]),
},
"timing": {
"total_ms": round(total_duration * 1000, 2),
"search_ms": round(search_duration * 1000, 2),
"vector_fetch_ms": round(vector_fetch_duration * 1000, 2),
+ "query_embed_ms": round(query_embed_duration * 1000, 2),
"pca_ms": round(pca_duration * 1000, 2),
"num_results": len(search_results),
- "num_vectors": len(vectors),
+ "num_doc_vectors": len(doc_vectors),
},
}
)
diff --git a/nextcloud_mcp_server/vector/document_chunker.py b/nextcloud_mcp_server/vector/document_chunker.py
index 0104c8f..b2c1c3d 100644
--- a/nextcloud_mcp_server/vector/document_chunker.py
+++ b/nextcloud_mcp_server/vector/document_chunker.py
@@ -3,7 +3,7 @@
import logging
from dataclasses import dataclass
-from langchain_text_splitters import MarkdownTextSplitter
+from langchain_text_splitters import RecursiveCharacterTextSplitter
logger = logging.getLogger(__name__)
@@ -20,9 +20,9 @@ class ChunkWithPosition:
class DocumentChunker:
"""Chunk large documents for optimal embedding using LangChain text splitters.
- Uses MarkdownTextSplitter which is optimized for Markdown content like
- Nextcloud Notes. Respects markdown structure (headers, code blocks, lists)
- while maintaining semantic boundaries.
+ Uses RecursiveCharacterTextSplitter which preserves semantic boundaries
+ by splitting on sentence and paragraph boundaries before resorting to
+ character-level splitting.
"""
def __init__(self, chunk_size: int = 2048, overlap: int = 200):
@@ -36,15 +36,14 @@ class DocumentChunker:
self.chunk_size = chunk_size
self.overlap = overlap
- # Initialize LangChain MarkdownTextSplitter
- # Optimized for Markdown content with special handling for:
- # - Headers (# ## ###)
- # - Code blocks (``` ```)
- # - Lists (- * 1.)
- # - Horizontal rules (---)
- # - Paragraphs and sentences
- # This preserves both markdown structure and semantic boundaries
- self.splitter = MarkdownTextSplitter(
+ # Initialize LangChain RecursiveCharacterTextSplitter
+ # Uses hierarchical splitting to preserve semantic boundaries:
+ # - Paragraphs (\n\n)
+ # - Sentences (. ! ?)
+ # - Words (spaces)
+ # - Characters (last resort)
+ # This prevents mid-sentence splitting while maintaining semantic coherence
+ self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
add_start_index=True, # Enable position tracking
@@ -55,14 +54,14 @@ class DocumentChunker:
"""
Split text into overlapping chunks with position tracking.
- Uses LangChain's MarkdownTextSplitter to create chunks that respect
- both markdown structure and semantic boundaries. Optimized for Nextcloud
- Notes content with special handling for headers, code blocks, lists, etc.
- Preserves character positions for each chunk to enable precise document
- retrieval.
+ Uses LangChain's RecursiveCharacterTextSplitter to create chunks that
+ preserve semantic boundaries by splitting at paragraphs and sentences
+ before resorting to word or character-level splitting. This ensures
+ sentences are kept intact. Preserves character positions for each chunk
+ to enable precise document retrieval.
Args:
- content: Markdown text content to chunk
+ content: Text content to chunk
Returns:
List of chunks with their character positions in the original content
diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py
index 2caaa05..be396c3 100644
--- a/tests/unit/test_config.py
+++ b/tests/unit/test_config.py
@@ -159,8 +159,8 @@ class TestChunkConfigValidation:
def test_default_chunk_settings(self):
"""Test default chunk size and overlap values."""
settings = Settings()
- assert settings.document_chunk_size == 512
- assert settings.document_chunk_overlap == 50
+ assert settings.document_chunk_size == 2048
+ assert settings.document_chunk_overlap == 200
def test_valid_chunk_settings(self):
"""Test valid chunk size and overlap configuration."""
@@ -205,7 +205,7 @@ class TestChunkConfigValidation:
)
def test_small_chunk_size_warning(self, caplog):
- """Test that chunk size < 100 triggers warning."""
+ """Test that chunk size < 512 triggers warning."""
import logging
caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config")
@@ -214,19 +214,19 @@ class TestChunkConfigValidation:
document_chunk_overlap=10,
)
assert (
- "DOCUMENT_CHUNK_SIZE is set to 64 words, which is quite small"
+ "DOCUMENT_CHUNK_SIZE is set to 64 characters, which is quite small"
in caplog.text
)
- assert "Consider using at least 256 words" in caplog.text
+ assert "Consider using at least 1024 characters" in caplog.text
def test_reasonable_chunk_size_no_warning(self, caplog):
- """Test that chunk size >= 100 doesn't trigger warning."""
+ """Test that chunk size >= 512 doesn't trigger warning."""
import logging
caplog.set_level(logging.WARNING, logger="nextcloud_mcp_server.config")
Settings(
- document_chunk_size=256,
- document_chunk_overlap=25,
+ document_chunk_size=1024,
+ document_chunk_overlap=100,
)
assert "DOCUMENT_CHUNK_SIZE" not in caplog.text