Files

T

Chris Coutinho c4bf077050 feat: Add OpenTelemetry tracing to @instrument_tool decorator

Enhances the @instrument_tool decorator to create distributed traces
for all MCP tool executions, improving observability and debugging.

Changes:
- Modified @instrument_tool to wrap tool execution in trace_operation
- Added automatic span creation with mcp.tool.* span names
- Sanitized tool arguments before adding to span attributes
  (excludes password, token, secret, api_key, etag, ctx)
- Limited argument strings to 500 characters to prevent huge spans
- Maintained existing Prometheus metrics functionality
- Updated docs/observability.md to reflect correct decorator name
- Added comprehensive unit tests

All ~50+ MCP tools now emit traces automatically without code changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-16 11:16:05 +01:00

7.4 KiB

Raw Permalink Blame History

Observability and Monitoring

The Nextcloud MCP Server includes comprehensive observability features for production deployments:

Prometheus metrics for monitoring performance and health
OpenTelemetry distributed tracing for debugging request flows
Structured JSON logging with trace correlation
Kubernetes integration via ServiceMonitor and PrometheusRule

Quick Start

Local Development with Prometheus

# Enable metrics (enabled by default)
export METRICS_ENABLED=true
export METRICS_PORT=9090

# Enable tracing (optional - tracing is enabled when OTEL_EXPORTER_OTLP_ENDPOINT is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Start the server
docker-compose up -d mcp

Access metrics at: http://localhost:9090/metrics

Kubernetes Deployment

Metrics are automatically scraped if you have Prometheus Operator installed:

helm install nextcloud-mcp charts/nextcloud-mcp-server \
  --set observability.metrics.enabled=true \
  --set observability.tracing.enabled=true \
  --set observability.tracing.endpoint=http://opentelemetry-collector:4317 \
  --set serviceMonitor.enabled=true

Configuration

Environment Variables

Variable	Default	Description
`METRICS_ENABLED`	`true`	Enable Prometheus metrics
`METRICS_PORT`	`9090`	Port for metrics endpoint
`OTEL_EXPORTER_OTLP_ENDPOINT`	-	OTLP gRPC endpoint (e.g., `http://otel-collector:4317`). Tracing is enabled when this is set.
`OTEL_SERVICE_NAME`	`nextcloud-mcp-server`	Service name in traces
`OTEL_TRACES_SAMPLER`	`always_on`	Trace sampling strategy
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (0.0-1.0)
`LOG_FORMAT`	`json`	Log format (`json` or `text`)
`LOG_LEVEL`	`INFO`	Minimum log level
`LOG_INCLUDE_TRACE_CONTEXT`	`true`	Include trace IDs in logs

Helm Chart Configuration

observability:
  metrics:
    enabled: true
    port: 9090
    path: /metrics

  tracing:
    enabled: true
    endpoint: "http://opentelemetry-collector:4317"
    samplingRate: 1.0

  logging:
    format: json
    level: INFO
    includeTraceContext: true

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s

Metrics

HTTP Server Metrics (RED)

mcp_http_requests_total - Total HTTP requests
mcp_http_request_duration_seconds - Request latency histogram
mcp_http_requests_in_progress - In-flight requests gauge

MCP Tool Metrics

mcp_tool_calls_total - Tool invocation count by status
mcp_tool_duration_seconds - Tool execution latency
mcp_tool_errors_total - Tool errors by type

Nextcloud API Metrics

mcp_nextcloud_api_requests_total - API calls by app and status
mcp_nextcloud_api_duration_seconds - API latency by app
mcp_nextcloud_api_retries_total - Retry count (429, timeout, etc.)

OAuth Flow Metrics

mcp_oauth_token_validations_total - Token validation count
mcp_oauth_token_exchange_total - Token exchange operations
mcp_oauth_token_cache_hits_total - Cache hit/miss rate
mcp_oauth_refresh_token_operations_total - Refresh token storage ops

Vector Sync Metrics (when enabled)

mcp_vector_sync_documents_scanned_total - Documents discovered
mcp_vector_sync_documents_processed_total - Processing results
mcp_vector_sync_processing_duration_seconds - Processing latency
mcp_vector_sync_queue_size - Current queue depth
mcp_qdrant_operations_total - Qdrant DB operations

Database Metrics

mcp_db_operations_total - DB operations (SQLite, Qdrant)
mcp_db_operation_duration_seconds - DB latency

Dependency Health

mcp_dependency_health - External dependency status (1=up, 0=down)
mcp_dependency_check_duration_seconds - Health check latency

Distributed Tracing

Span Hierarchy

HTTP POST /messages
├── mcp.tool.nc_notes_create_note
│   └── nextcloud.api.notes.POST
│       └── httpx request (auto-instrumented)
└── oauth.token.validate (if OAuth mode)
    └── httpx request to IdP

Span Attributes

MCP tools: mcp.tool.name, mcp.tool.args (sanitized)
Nextcloud API: nextcloud.app, http.method, http.status_code
OAuth: oauth.operation, oauth.method
Vector sync: vector_sync.operation, vector_sync.document_count

Trace Context in Logs

When tracing is enabled, all logs include trace_id and span_id:

{
  "timestamp": "2025-01-09T12:34:56.789Z",
  "level": "INFO",
  "logger": "nextcloud_mcp_server.server.notes",
  "message": "Note created successfully",
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "123456789abc...",
  "note_id": 42
}

Dashboards

Prometheus Queries

Request Rate (req/s):

sum(rate(mcp_http_requests_total[5m])) by (method, endpoint)

Error Rate (%):

sum(rate(mcp_http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(mcp_http_requests_total[5m])) * 100

P95 Latency:

histogram_quantile(0.95,
  sum(rate(mcp_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

Top Tools by Volume:

topk(10, sum(rate(mcp_tool_calls_total[5m])) by (tool_name))

Nextcloud API Health:

sum(rate(mcp_nextcloud_api_requests_total{status_code!~"2.."}[5m])) by (app)

Alerts

Recommended Alert Rules

Critical:

Server down for >5min
Error rate >5% for >5min
P95 latency >1s for >5min
Dependency down for >2min

Warning:

Token validation errors >1% for >10min
Vector sync queue >100 for >15min
Qdrant slow (p95 >500ms) for >10min

See charts/nextcloud-mcp-server/templates/prometheusrule.yaml for complete definitions.

Troubleshooting

Metrics Not Appearing

Check metrics are enabled: curl http://localhost:9090/metrics
Verify ServiceMonitor labels match Prometheus selector
Check Prometheus target status: http://prometheus:9090/targets

Traces Not Appearing

Verify OTLP endpoint is reachable: curl http://otel-collector:4317
Check collector logs for errors
Verify sampling rate is not 0.0
Check trace backend (Jaeger/Tempo) connectivity

High Cardinality Metrics

If you see cardinality warnings:

Middleware normalizes endpoints (e.g., /user/123 → /user/*)
OAuth tokens are never included in metric labels
User IDs are not tracked (use tracing for per-user debugging)

Performance Impact

Metrics: <1% overhead (counters/histograms are very fast)
Tracing: ~2-5% overhead at 100% sampling
JSON logging: <1% overhead vs text logging

Recommendation: Always enable metrics. Enable tracing in staging/production with 10-50% sampling.

Architecture

The observability stack integrates at multiple layers:

HTTP Layer: ObservabilityMiddleware tracks all HTTP requests
MCP Layer: Tools use @instrument_tool for automatic metrics and trace span creation
Client Layer: BaseNextcloudClient tracks all API calls
OAuth Layer: Token operations are traced and metered
Background Tasks: Vector sync operations emit metrics/traces

All components use shared Prometheus Registry and OpenTelemetry TracerProvider.

7.4 KiB Raw Permalink Blame History