feat: Add Grafana dashboard and vector sync metric instrumentation
Implement comprehensive observability for vector database synchronization
with Grafana dashboard and Prometheus metrics.
## Part 1: Grafana Dashboard
Created all-in-one operations dashboard with 7 rows and 34 panels:
### Dashboard Structure:
- **Overview Row**: Request rate, error rate, P95 latency, active requests
- **HTTP Metrics (RED)**: Request/error rates by endpoint, latency percentiles
- **MCP Tools**: Call volume, error rates, execution duration by tool
- **Nextcloud API**: API calls/latency by app, retry patterns
- **OAuth & Authentication**: Token validations, exchanges, cache hit rate
- **Dependencies & Health**: Status for Nextcloud/Qdrant/Keycloak/Unstructured
- **Vector Sync**: Processing throughput, queue depth, Qdrant operations
### Helm Chart Integration:
- Added dashboard-configmap.yaml template for automatic provisioning
- Configured Grafana sidecar auto-discovery (label: grafana_dashboard="1")
- Added dashboards configuration section in values.yaml (opt-in)
- Updated Chart.yaml with dashboard annotations
- Enhanced NOTES.txt with dashboard deployment instructions
- Comprehensive documentation in dashboards/README.md
Dashboard supports dynamic filtering via variables:
- datasource: Prometheus data source selection
- namespace: Filter by Kubernetes namespace
- pod: Multi-select pod filtering
- interval: Query interval (1m/5m/10m/30m/1h)
## Part 2: Vector Sync Metric Instrumentation
Implemented metric recording throughout vector sync pipeline:
### metrics.py:
Added convenience functions:
- record_vector_sync_scan() - Track documents per scan
- record_vector_sync_processing() - Track processing duration/status
- record_qdrant_operation() - Track database operations
- update_vector_sync_queue_size() - Track queue depth
### scanner.py:
- Record number of documents found in each scan
- Enables monitoring of scan throughput
### processor.py:
- Record processing duration for each document
- Track success/failure status with timing
- Record Qdrant upsert/delete operations
- Handle all code paths (success, deletion, error)
### semantic.py:
- Wrap Qdrant query_points with try/except
- Record search operation success/failure
## Metrics Exposed:
- mcp_vector_sync_documents_scanned_total
- mcp_vector_sync_documents_processed_total{status}
- mcp_vector_sync_processing_duration_seconds (histogram)
- mcp_vector_sync_queue_size (gauge)
- mcp_qdrant_operations_total{operation,status}
This enables monitoring of:
- Scan and processing throughput
- Processing latency (P50/P95/P99)
- Error rates for processing and Qdrant operations
- Queue depth trends
- Complete observability of vector sync pipeline
## Testing:
Verified locally that metrics are recorded correctly:
- 36 documents scanned
- 3 documents processed (avg 7.5s each)
- 3 successful Qdrant upsert operations
- Search operations tracked
## Deployment:
Enable dashboard provisioning in Helm values:
```yaml
dashboards:
enabled: true
grafanaFolder: "Nextcloud MCP"
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -15,6 +15,10 @@ from qdrant_client.models import FieldCondition, Filter, MatchValue, PointStruct
|
||||
from nextcloud_mcp_server.client import NextcloudClient
|
||||
from nextcloud_mcp_server.config import get_settings
|
||||
from nextcloud_mcp_server.embedding import get_embedding_service
|
||||
from nextcloud_mcp_server.observability.metrics import (
|
||||
record_qdrant_operation,
|
||||
record_vector_sync_processing,
|
||||
)
|
||||
from nextcloud_mcp_server.observability.tracing import trace_operation
|
||||
from nextcloud_mcp_server.vector.document_chunker import DocumentChunker
|
||||
from nextcloud_mcp_server.vector.qdrant_client import get_qdrant_client
|
||||
@@ -90,6 +94,8 @@ async def process_document(doc_task: DocumentTask, nc_client: NextcloudClient):
|
||||
doc_task: Document task to process
|
||||
nc_client: Authenticated Nextcloud client
|
||||
"""
|
||||
start_time = time.time()
|
||||
|
||||
logger.debug(
|
||||
f"Processing {doc_task.doc_type}_{doc_task.doc_id} "
|
||||
f"for {doc_task.user_id} ({doc_task.operation})"
|
||||
@@ -105,58 +111,79 @@ async def process_document(doc_task: DocumentTask, nc_client: NextcloudClient):
|
||||
"vector_sync.doc_operation": doc_task.operation,
|
||||
},
|
||||
):
|
||||
qdrant_client = await get_qdrant_client()
|
||||
settings = get_settings()
|
||||
try:
|
||||
qdrant_client = await get_qdrant_client()
|
||||
settings = get_settings()
|
||||
|
||||
# Handle deletion
|
||||
if doc_task.operation == "delete":
|
||||
await qdrant_client.delete(
|
||||
collection_name=settings.get_collection_name(),
|
||||
points_selector=Filter(
|
||||
must=[
|
||||
FieldCondition(
|
||||
key="user_id",
|
||||
match=MatchValue(value=doc_task.user_id),
|
||||
),
|
||||
FieldCondition(
|
||||
key="doc_id",
|
||||
match=MatchValue(value=doc_task.doc_id),
|
||||
),
|
||||
FieldCondition(
|
||||
key="doc_type",
|
||||
match=MatchValue(value=doc_task.doc_type),
|
||||
),
|
||||
]
|
||||
),
|
||||
)
|
||||
logger.info(
|
||||
f"Deleted {doc_task.doc_type}_{doc_task.doc_id} for {doc_task.user_id}"
|
||||
)
|
||||
return
|
||||
# Handle deletion
|
||||
if doc_task.operation == "delete":
|
||||
await qdrant_client.delete(
|
||||
collection_name=settings.get_collection_name(),
|
||||
points_selector=Filter(
|
||||
must=[
|
||||
FieldCondition(
|
||||
key="user_id",
|
||||
match=MatchValue(value=doc_task.user_id),
|
||||
),
|
||||
FieldCondition(
|
||||
key="doc_id",
|
||||
match=MatchValue(value=doc_task.doc_id),
|
||||
),
|
||||
FieldCondition(
|
||||
key="doc_type",
|
||||
match=MatchValue(value=doc_task.doc_type),
|
||||
),
|
||||
]
|
||||
),
|
||||
)
|
||||
logger.info(
|
||||
f"Deleted {doc_task.doc_type}_{doc_task.doc_id} for {doc_task.user_id}"
|
||||
)
|
||||
|
||||
# Handle indexing with retry
|
||||
max_retries = 3
|
||||
retry_delay = 1.0
|
||||
# Record successful deletion metrics
|
||||
duration = time.time() - start_time
|
||||
record_qdrant_operation("delete", "success")
|
||||
record_vector_sync_processing(duration, "success")
|
||||
return
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
await _index_document(doc_task, nc_client, qdrant_client)
|
||||
return # Success
|
||||
# Handle indexing with retry
|
||||
max_retries = 3
|
||||
retry_delay = 1.0
|
||||
|
||||
except (HTTPStatusError, Exception) as e:
|
||||
if attempt < max_retries - 1:
|
||||
logger.warning(
|
||||
f"Retry {attempt + 1}/{max_retries} for "
|
||||
f"{doc_task.doc_type}_{doc_task.doc_id}: {e}"
|
||||
)
|
||||
await anyio.sleep(retry_delay)
|
||||
retry_delay *= 2 # Exponential backoff
|
||||
else:
|
||||
logger.error(
|
||||
f"Failed to index {doc_task.doc_type}_{doc_task.doc_id} "
|
||||
f"after {max_retries} retries: {e}"
|
||||
)
|
||||
raise
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
await _index_document(doc_task, nc_client, qdrant_client)
|
||||
|
||||
# Record successful processing metrics
|
||||
duration = time.time() - start_time
|
||||
record_qdrant_operation("upsert", "success")
|
||||
record_vector_sync_processing(duration, "success")
|
||||
return # Success
|
||||
|
||||
except (HTTPStatusError, Exception) as e:
|
||||
if attempt < max_retries - 1:
|
||||
logger.warning(
|
||||
f"Retry {attempt + 1}/{max_retries} for "
|
||||
f"{doc_task.doc_type}_{doc_task.doc_id}: {e}"
|
||||
)
|
||||
await anyio.sleep(retry_delay)
|
||||
retry_delay *= 2 # Exponential backoff
|
||||
else:
|
||||
logger.error(
|
||||
f"Failed to index {doc_task.doc_type}_{doc_task.doc_id} "
|
||||
f"after {max_retries} retries: {e}"
|
||||
)
|
||||
# Record failed processing metrics
|
||||
duration = time.time() - start_time
|
||||
record_qdrant_operation("upsert", "error")
|
||||
record_vector_sync_processing(duration, "error")
|
||||
raise
|
||||
|
||||
except Exception:
|
||||
# Catch any other unexpected errors
|
||||
duration = time.time() - start_time
|
||||
record_vector_sync_processing(duration, "error")
|
||||
raise
|
||||
|
||||
|
||||
async def _index_document(
|
||||
|
||||
Reference in New Issue
Block a user