Files
Chris Coutinho c4bf077050 feat: Add OpenTelemetry tracing to @instrument_tool decorator
Enhances the @instrument_tool decorator to create distributed traces
for all MCP tool executions, improving observability and debugging.

Changes:
- Modified @instrument_tool to wrap tool execution in trace_operation
- Added automatic span creation with mcp.tool.* span names
- Sanitized tool arguments before adding to span attributes
  (excludes password, token, secret, api_key, etag, ctx)
- Limited argument strings to 500 characters to prevent huge spans
- Maintained existing Prometheus metrics functionality
- Updated docs/observability.md to reflect correct decorator name
- Added comprehensive unit tests

All ~50+ MCP tools now emit traces automatically without code changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 11:16:05 +01:00

7.4 KiB

Observability and Monitoring

The Nextcloud MCP Server includes comprehensive observability features for production deployments:

  • Prometheus metrics for monitoring performance and health
  • OpenTelemetry distributed tracing for debugging request flows
  • Structured JSON logging with trace correlation
  • Kubernetes integration via ServiceMonitor and PrometheusRule

Quick Start

Local Development with Prometheus

# Enable metrics (enabled by default)
export METRICS_ENABLED=true
export METRICS_PORT=9090

# Enable tracing (optional - tracing is enabled when OTEL_EXPORTER_OTLP_ENDPOINT is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Start the server
docker-compose up -d mcp

Access metrics at: http://localhost:9090/metrics

Kubernetes Deployment

Metrics are automatically scraped if you have Prometheus Operator installed:

helm install nextcloud-mcp charts/nextcloud-mcp-server \
  --set observability.metrics.enabled=true \
  --set observability.tracing.enabled=true \
  --set observability.tracing.endpoint=http://opentelemetry-collector:4317 \
  --set serviceMonitor.enabled=true

Configuration

Environment Variables

Variable Default Description
METRICS_ENABLED true Enable Prometheus metrics
METRICS_PORT 9090 Port for metrics endpoint
OTEL_EXPORTER_OTLP_ENDPOINT - OTLP gRPC endpoint (e.g., http://otel-collector:4317). Tracing is enabled when this is set.
OTEL_SERVICE_NAME nextcloud-mcp-server Service name in traces
OTEL_TRACES_SAMPLER always_on Trace sampling strategy
OTEL_TRACES_SAMPLER_ARG 1.0 Sampling rate (0.0-1.0)
LOG_FORMAT json Log format (json or text)
LOG_LEVEL INFO Minimum log level
LOG_INCLUDE_TRACE_CONTEXT true Include trace IDs in logs

Helm Chart Configuration

observability:
  metrics:
    enabled: true
    port: 9090
    path: /metrics

  tracing:
    enabled: true
    endpoint: "http://opentelemetry-collector:4317"
    samplingRate: 1.0

  logging:
    format: json
    level: INFO
    includeTraceContext: true

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s

Metrics

HTTP Server Metrics (RED)

  • mcp_http_requests_total - Total HTTP requests
  • mcp_http_request_duration_seconds - Request latency histogram
  • mcp_http_requests_in_progress - In-flight requests gauge

MCP Tool Metrics

  • mcp_tool_calls_total - Tool invocation count by status
  • mcp_tool_duration_seconds - Tool execution latency
  • mcp_tool_errors_total - Tool errors by type

Nextcloud API Metrics

  • mcp_nextcloud_api_requests_total - API calls by app and status
  • mcp_nextcloud_api_duration_seconds - API latency by app
  • mcp_nextcloud_api_retries_total - Retry count (429, timeout, etc.)

OAuth Flow Metrics

  • mcp_oauth_token_validations_total - Token validation count
  • mcp_oauth_token_exchange_total - Token exchange operations
  • mcp_oauth_token_cache_hits_total - Cache hit/miss rate
  • mcp_oauth_refresh_token_operations_total - Refresh token storage ops

Vector Sync Metrics (when enabled)

  • mcp_vector_sync_documents_scanned_total - Documents discovered
  • mcp_vector_sync_documents_processed_total - Processing results
  • mcp_vector_sync_processing_duration_seconds - Processing latency
  • mcp_vector_sync_queue_size - Current queue depth
  • mcp_qdrant_operations_total - Qdrant DB operations

Database Metrics

  • mcp_db_operations_total - DB operations (SQLite, Qdrant)
  • mcp_db_operation_duration_seconds - DB latency

Dependency Health

  • mcp_dependency_health - External dependency status (1=up, 0=down)
  • mcp_dependency_check_duration_seconds - Health check latency

Distributed Tracing

Span Hierarchy

HTTP POST /messages
├── mcp.tool.nc_notes_create_note
│   └── nextcloud.api.notes.POST
│       └── httpx request (auto-instrumented)
└── oauth.token.validate (if OAuth mode)
    └── httpx request to IdP

Span Attributes

  • MCP tools: mcp.tool.name, mcp.tool.args (sanitized)
  • Nextcloud API: nextcloud.app, http.method, http.status_code
  • OAuth: oauth.operation, oauth.method
  • Vector sync: vector_sync.operation, vector_sync.document_count

Trace Context in Logs

When tracing is enabled, all logs include trace_id and span_id:

{
  "timestamp": "2025-01-09T12:34:56.789Z",
  "level": "INFO",
  "logger": "nextcloud_mcp_server.server.notes",
  "message": "Note created successfully",
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "123456789abc...",
  "note_id": 42
}

Dashboards

Prometheus Queries

Request Rate (req/s):

sum(rate(mcp_http_requests_total[5m])) by (method, endpoint)

Error Rate (%):

sum(rate(mcp_http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(mcp_http_requests_total[5m])) * 100

P95 Latency:

histogram_quantile(0.95,
  sum(rate(mcp_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

Top Tools by Volume:

topk(10, sum(rate(mcp_tool_calls_total[5m])) by (tool_name))

Nextcloud API Health:

sum(rate(mcp_nextcloud_api_requests_total{status_code!~"2.."}[5m])) by (app)

Alerts

Critical:

  • Server down for >5min
  • Error rate >5% for >5min
  • P95 latency >1s for >5min
  • Dependency down for >2min

Warning:

  • Token validation errors >1% for >10min
  • Vector sync queue >100 for >15min
  • Qdrant slow (p95 >500ms) for >10min

See charts/nextcloud-mcp-server/templates/prometheusrule.yaml for complete definitions.

Troubleshooting

Metrics Not Appearing

  1. Check metrics are enabled: curl http://localhost:9090/metrics
  2. Verify ServiceMonitor labels match Prometheus selector
  3. Check Prometheus target status: http://prometheus:9090/targets

Traces Not Appearing

  1. Verify OTLP endpoint is reachable: curl http://otel-collector:4317
  2. Check collector logs for errors
  3. Verify sampling rate is not 0.0
  4. Check trace backend (Jaeger/Tempo) connectivity

High Cardinality Metrics

If you see cardinality warnings:

  • Middleware normalizes endpoints (e.g., /user/123/user/*)
  • OAuth tokens are never included in metric labels
  • User IDs are not tracked (use tracing for per-user debugging)

Performance Impact

  • Metrics: <1% overhead (counters/histograms are very fast)
  • Tracing: ~2-5% overhead at 100% sampling
  • JSON logging: <1% overhead vs text logging

Recommendation: Always enable metrics. Enable tracing in staging/production with 10-50% sampling.

Architecture

The observability stack integrates at multiple layers:

  1. HTTP Layer: ObservabilityMiddleware tracks all HTTP requests
  2. MCP Layer: Tools use @instrument_tool for automatic metrics and trace span creation
  3. Client Layer: BaseNextcloudClient tracks all API calls
  4. OAuth Layer: Token operations are traced and metered
  5. Background Tasks: Vector sync operations emit metrics/traces

All components use shared Prometheus Registry and OpenTelemetry TracerProvider.

References