feat: Add Grafana dashboard and vector sync metric instrumentation

Implement comprehensive observability for vector database synchronization
with Grafana dashboard and Prometheus metrics.

## Part 1: Grafana Dashboard

Created all-in-one operations dashboard with 7 rows and 34 panels:

### Dashboard Structure:
- **Overview Row**: Request rate, error rate, P95 latency, active requests
- **HTTP Metrics (RED)**: Request/error rates by endpoint, latency percentiles
- **MCP Tools**: Call volume, error rates, execution duration by tool
- **Nextcloud API**: API calls/latency by app, retry patterns
- **OAuth & Authentication**: Token validations, exchanges, cache hit rate
- **Dependencies & Health**: Status for Nextcloud/Qdrant/Keycloak/Unstructured
- **Vector Sync**: Processing throughput, queue depth, Qdrant operations

### Helm Chart Integration:
- Added dashboard-configmap.yaml template for automatic provisioning
- Configured Grafana sidecar auto-discovery (label: grafana_dashboard="1")
- Added dashboards configuration section in values.yaml (opt-in)
- Updated Chart.yaml with dashboard annotations
- Enhanced NOTES.txt with dashboard deployment instructions
- Comprehensive documentation in dashboards/README.md

Dashboard supports dynamic filtering via variables:
- datasource: Prometheus data source selection
- namespace: Filter by Kubernetes namespace
- pod: Multi-select pod filtering
- interval: Query interval (1m/5m/10m/30m/1h)

## Part 2: Vector Sync Metric Instrumentation

Implemented metric recording throughout vector sync pipeline:

### metrics.py:
Added convenience functions:
- record_vector_sync_scan() - Track documents per scan
- record_vector_sync_processing() - Track processing duration/status
- record_qdrant_operation() - Track database operations
- update_vector_sync_queue_size() - Track queue depth

### scanner.py:
- Record number of documents found in each scan
- Enables monitoring of scan throughput

### processor.py:
- Record processing duration for each document
- Track success/failure status with timing
- Record Qdrant upsert/delete operations
- Handle all code paths (success, deletion, error)

### semantic.py:
- Wrap Qdrant query_points with try/except
- Record search operation success/failure

## Metrics Exposed:

- mcp_vector_sync_documents_scanned_total
- mcp_vector_sync_documents_processed_total{status}
- mcp_vector_sync_processing_duration_seconds (histogram)
- mcp_vector_sync_queue_size (gauge)
- mcp_qdrant_operations_total{operation,status}

This enables monitoring of:
- Scan and processing throughput
- Processing latency (P50/P95/P99)
- Error rates for processing and Qdrant operations
- Queue depth trends
- Complete observability of vector sync pipeline

## Testing:

Verified locally that metrics are recorded correctly:
- 36 documents scanned
- 3 documents processed (avg 7.5s each)
- 3 successful Qdrant upsert operations
- Search operations tracked

## Deployment:

Enable dashboard provisioning in Helm values:
```yaml
dashboards:
  enabled: true
  grafanaFolder: "Nextcloud MCP"
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-11-13 11:49:20 +01:00
parent 619ba5684d
commit 4ea5ed72d4
11 changed files with 1976 additions and 562 deletions
+75 -48
View File
@@ -15,6 +15,10 @@ from qdrant_client.models import FieldCondition, Filter, MatchValue, PointStruct
from nextcloud_mcp_server.client import NextcloudClient
from nextcloud_mcp_server.config import get_settings
from nextcloud_mcp_server.embedding import get_embedding_service
from nextcloud_mcp_server.observability.metrics import (
record_qdrant_operation,
record_vector_sync_processing,
)
from nextcloud_mcp_server.observability.tracing import trace_operation
from nextcloud_mcp_server.vector.document_chunker import DocumentChunker
from nextcloud_mcp_server.vector.qdrant_client import get_qdrant_client
@@ -90,6 +94,8 @@ async def process_document(doc_task: DocumentTask, nc_client: NextcloudClient):
doc_task: Document task to process
nc_client: Authenticated Nextcloud client
"""
start_time = time.time()
logger.debug(
f"Processing {doc_task.doc_type}_{doc_task.doc_id} "
f"for {doc_task.user_id} ({doc_task.operation})"
@@ -105,58 +111,79 @@ async def process_document(doc_task: DocumentTask, nc_client: NextcloudClient):
"vector_sync.doc_operation": doc_task.operation,
},
):
qdrant_client = await get_qdrant_client()
settings = get_settings()
try:
qdrant_client = await get_qdrant_client()
settings = get_settings()
# Handle deletion
if doc_task.operation == "delete":
await qdrant_client.delete(
collection_name=settings.get_collection_name(),
points_selector=Filter(
must=[
FieldCondition(
key="user_id",
match=MatchValue(value=doc_task.user_id),
),
FieldCondition(
key="doc_id",
match=MatchValue(value=doc_task.doc_id),
),
FieldCondition(
key="doc_type",
match=MatchValue(value=doc_task.doc_type),
),
]
),
)
logger.info(
f"Deleted {doc_task.doc_type}_{doc_task.doc_id} for {doc_task.user_id}"
)
return
# Handle deletion
if doc_task.operation == "delete":
await qdrant_client.delete(
collection_name=settings.get_collection_name(),
points_selector=Filter(
must=[
FieldCondition(
key="user_id",
match=MatchValue(value=doc_task.user_id),
),
FieldCondition(
key="doc_id",
match=MatchValue(value=doc_task.doc_id),
),
FieldCondition(
key="doc_type",
match=MatchValue(value=doc_task.doc_type),
),
]
),
)
logger.info(
f"Deleted {doc_task.doc_type}_{doc_task.doc_id} for {doc_task.user_id}"
)
# Handle indexing with retry
max_retries = 3
retry_delay = 1.0
# Record successful deletion metrics
duration = time.time() - start_time
record_qdrant_operation("delete", "success")
record_vector_sync_processing(duration, "success")
return
for attempt in range(max_retries):
try:
await _index_document(doc_task, nc_client, qdrant_client)
return # Success
# Handle indexing with retry
max_retries = 3
retry_delay = 1.0
except (HTTPStatusError, Exception) as e:
if attempt < max_retries - 1:
logger.warning(
f"Retry {attempt + 1}/{max_retries} for "
f"{doc_task.doc_type}_{doc_task.doc_id}: {e}"
)
await anyio.sleep(retry_delay)
retry_delay *= 2 # Exponential backoff
else:
logger.error(
f"Failed to index {doc_task.doc_type}_{doc_task.doc_id} "
f"after {max_retries} retries: {e}"
)
raise
for attempt in range(max_retries):
try:
await _index_document(doc_task, nc_client, qdrant_client)
# Record successful processing metrics
duration = time.time() - start_time
record_qdrant_operation("upsert", "success")
record_vector_sync_processing(duration, "success")
return # Success
except (HTTPStatusError, Exception) as e:
if attempt < max_retries - 1:
logger.warning(
f"Retry {attempt + 1}/{max_retries} for "
f"{doc_task.doc_type}_{doc_task.doc_id}: {e}"
)
await anyio.sleep(retry_delay)
retry_delay *= 2 # Exponential backoff
else:
logger.error(
f"Failed to index {doc_task.doc_type}_{doc_task.doc_id} "
f"after {max_retries} retries: {e}"
)
# Record failed processing metrics
duration = time.time() - start_time
record_qdrant_operation("upsert", "error")
record_vector_sync_processing(duration, "error")
raise
except Exception:
# Catch any other unexpected errors
duration = time.time() - start_time
record_vector_sync_processing(duration, "error")
raise
async def _index_document(