c4bf077050
Enhances the @instrument_tool decorator to create distributed traces for all MCP tool executions, improving observability and debugging. Changes: - Modified @instrument_tool to wrap tool execution in trace_operation - Added automatic span creation with mcp.tool.* span names - Sanitized tool arguments before adding to span attributes (excludes password, token, secret, api_key, etag, ctx) - Limited argument strings to 500 characters to prevent huge spans - Maintained existing Prometheus metrics functionality - Updated docs/observability.md to reflect correct decorator name - Added comprehensive unit tests All ~50+ MCP tools now emit traces automatically without code changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.4 KiB
7.4 KiB
Observability and Monitoring
The Nextcloud MCP Server includes comprehensive observability features for production deployments:
- Prometheus metrics for monitoring performance and health
- OpenTelemetry distributed tracing for debugging request flows
- Structured JSON logging with trace correlation
- Kubernetes integration via ServiceMonitor and PrometheusRule
Quick Start
Local Development with Prometheus
# Enable metrics (enabled by default)
export METRICS_ENABLED=true
export METRICS_PORT=9090
# Enable tracing (optional - tracing is enabled when OTEL_EXPORTER_OTLP_ENDPOINT is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
# Start the server
docker-compose up -d mcp
Access metrics at: http://localhost:9090/metrics
Kubernetes Deployment
Metrics are automatically scraped if you have Prometheus Operator installed:
helm install nextcloud-mcp charts/nextcloud-mcp-server \
--set observability.metrics.enabled=true \
--set observability.tracing.enabled=true \
--set observability.tracing.endpoint=http://opentelemetry-collector:4317 \
--set serviceMonitor.enabled=true
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
METRICS_ENABLED |
true |
Enable Prometheus metrics |
METRICS_PORT |
9090 |
Port for metrics endpoint |
OTEL_EXPORTER_OTLP_ENDPOINT |
- | OTLP gRPC endpoint (e.g., http://otel-collector:4317). Tracing is enabled when this is set. |
OTEL_SERVICE_NAME |
nextcloud-mcp-server |
Service name in traces |
OTEL_TRACES_SAMPLER |
always_on |
Trace sampling strategy |
OTEL_TRACES_SAMPLER_ARG |
1.0 |
Sampling rate (0.0-1.0) |
LOG_FORMAT |
json |
Log format (json or text) |
LOG_LEVEL |
INFO |
Minimum log level |
LOG_INCLUDE_TRACE_CONTEXT |
true |
Include trace IDs in logs |
Helm Chart Configuration
observability:
metrics:
enabled: true
port: 9090
path: /metrics
tracing:
enabled: true
endpoint: "http://opentelemetry-collector:4317"
samplingRate: 1.0
logging:
format: json
level: INFO
includeTraceContext: true
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
Metrics
HTTP Server Metrics (RED)
mcp_http_requests_total- Total HTTP requestsmcp_http_request_duration_seconds- Request latency histogrammcp_http_requests_in_progress- In-flight requests gauge
MCP Tool Metrics
mcp_tool_calls_total- Tool invocation count by statusmcp_tool_duration_seconds- Tool execution latencymcp_tool_errors_total- Tool errors by type
Nextcloud API Metrics
mcp_nextcloud_api_requests_total- API calls by app and statusmcp_nextcloud_api_duration_seconds- API latency by appmcp_nextcloud_api_retries_total- Retry count (429, timeout, etc.)
OAuth Flow Metrics
mcp_oauth_token_validations_total- Token validation countmcp_oauth_token_exchange_total- Token exchange operationsmcp_oauth_token_cache_hits_total- Cache hit/miss ratemcp_oauth_refresh_token_operations_total- Refresh token storage ops
Vector Sync Metrics (when enabled)
mcp_vector_sync_documents_scanned_total- Documents discoveredmcp_vector_sync_documents_processed_total- Processing resultsmcp_vector_sync_processing_duration_seconds- Processing latencymcp_vector_sync_queue_size- Current queue depthmcp_qdrant_operations_total- Qdrant DB operations
Database Metrics
mcp_db_operations_total- DB operations (SQLite, Qdrant)mcp_db_operation_duration_seconds- DB latency
Dependency Health
mcp_dependency_health- External dependency status (1=up, 0=down)mcp_dependency_check_duration_seconds- Health check latency
Distributed Tracing
Span Hierarchy
HTTP POST /messages
├── mcp.tool.nc_notes_create_note
│ └── nextcloud.api.notes.POST
│ └── httpx request (auto-instrumented)
└── oauth.token.validate (if OAuth mode)
└── httpx request to IdP
Span Attributes
- MCP tools:
mcp.tool.name,mcp.tool.args(sanitized) - Nextcloud API:
nextcloud.app,http.method,http.status_code - OAuth:
oauth.operation,oauth.method - Vector sync:
vector_sync.operation,vector_sync.document_count
Trace Context in Logs
When tracing is enabled, all logs include trace_id and span_id:
{
"timestamp": "2025-01-09T12:34:56.789Z",
"level": "INFO",
"logger": "nextcloud_mcp_server.server.notes",
"message": "Note created successfully",
"trace_id": "a1b2c3d4e5f6...",
"span_id": "123456789abc...",
"note_id": 42
}
Dashboards
Prometheus Queries
Request Rate (req/s):
sum(rate(mcp_http_requests_total[5m])) by (method, endpoint)
Error Rate (%):
sum(rate(mcp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(mcp_http_requests_total[5m])) * 100
P95 Latency:
histogram_quantile(0.95,
sum(rate(mcp_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
Top Tools by Volume:
topk(10, sum(rate(mcp_tool_calls_total[5m])) by (tool_name))
Nextcloud API Health:
sum(rate(mcp_nextcloud_api_requests_total{status_code!~"2.."}[5m])) by (app)
Alerts
Recommended Alert Rules
Critical:
- Server down for >5min
- Error rate >5% for >5min
- P95 latency >1s for >5min
- Dependency down for >2min
Warning:
- Token validation errors >1% for >10min
- Vector sync queue >100 for >15min
- Qdrant slow (p95 >500ms) for >10min
See charts/nextcloud-mcp-server/templates/prometheusrule.yaml for complete definitions.
Troubleshooting
Metrics Not Appearing
- Check metrics are enabled:
curl http://localhost:9090/metrics - Verify ServiceMonitor labels match Prometheus selector
- Check Prometheus target status:
http://prometheus:9090/targets
Traces Not Appearing
- Verify OTLP endpoint is reachable:
curl http://otel-collector:4317 - Check collector logs for errors
- Verify sampling rate is not 0.0
- Check trace backend (Jaeger/Tempo) connectivity
High Cardinality Metrics
If you see cardinality warnings:
- Middleware normalizes endpoints (e.g.,
/user/123→/user/*) - OAuth tokens are never included in metric labels
- User IDs are not tracked (use tracing for per-user debugging)
Performance Impact
- Metrics: <1% overhead (counters/histograms are very fast)
- Tracing: ~2-5% overhead at 100% sampling
- JSON logging: <1% overhead vs text logging
Recommendation: Always enable metrics. Enable tracing in staging/production with 10-50% sampling.
Architecture
The observability stack integrates at multiple layers:
- HTTP Layer:
ObservabilityMiddlewaretracks all HTTP requests - MCP Layer: Tools use
@instrument_toolfor automatic metrics and trace span creation - Client Layer:
BaseNextcloudClienttracks all API calls - OAuth Layer: Token operations are traced and metered
- Background Tasks: Vector sync operations emit metrics/traces
All components use shared Prometheus Registry and OpenTelemetry TracerProvider.