Files

T

Chris Coutinho a6e5f3d8ff refactor: simplify OpenTelemetry tracing configuration

Simplifies the OpenTelemetry tracing setup by removing the redundant
OTEL_ENABLED flag and using the presence of OTEL_EXPORTER_OTLP_ENDPOINT
to determine if tracing should be enabled. This follows the standard
OpenTelemetry environment variable conventions more closely.

Changes:
- Remove OTEL_ENABLED/tracing_enabled flag in favor of checking if
  OTEL_EXPORTER_OTLP_ENDPOINT is set
- Add OTEL_EXPORTER_VERIFY_SSL configuration option for OTLP endpoints
  with self-signed certificates (defaults to false for development)
- Move HTTPXClientInstrumentor initialization to module level to ensure
  httpx calls are traced across all Nextcloud API requests
- Add tracing spans to vector sync operations (scan_user_documents)
- Fix authorization header logging to only warn about missing headers
  in OAuth mode (BasicAuth mode doesn't use Authorization headers)
- Update observability documentation to reflect simplified configuration
- Refactor Dockerfile to use --no-editable flag for uv sync

Breaking changes:
- OTEL_ENABLED environment variable is removed
- Tracing is now automatically enabled when OTEL_EXPORTER_OTLP_ENDPOINT
  is set

Migration guide:
- Remove OTEL_ENABLED=true from environment configuration
- Tracing will be enabled automatically if OTEL_EXPORTER_OTLP_ENDPOINT
  is configured

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-10 22:48:37 +01:00

7.4 KiB

Raw Blame History

Observability and Monitoring

The Nextcloud MCP Server includes comprehensive observability features for production deployments:

Prometheus metrics for monitoring performance and health
OpenTelemetry distributed tracing for debugging request flows
Structured JSON logging with trace correlation
Kubernetes integration via ServiceMonitor and PrometheusRule

Quick Start

Local Development with Prometheus

# Enable metrics (enabled by default)
export METRICS_ENABLED=true
export METRICS_PORT=9090

# Enable tracing (optional - tracing is enabled when OTEL_EXPORTER_OTLP_ENDPOINT is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Start the server
docker-compose up -d mcp

Access metrics at: http://localhost:9090/metrics

Kubernetes Deployment

Metrics are automatically scraped if you have Prometheus Operator installed:

helm install nextcloud-mcp charts/nextcloud-mcp-server \
  --set observability.metrics.enabled=true \
  --set observability.tracing.enabled=true \
  --set observability.tracing.endpoint=http://opentelemetry-collector:4317 \
  --set serviceMonitor.enabled=true

Configuration

Environment Variables

Variable	Default	Description
`METRICS_ENABLED`	`true`	Enable Prometheus metrics
`METRICS_PORT`	`9090`	Port for metrics endpoint
`OTEL_EXPORTER_OTLP_ENDPOINT`	-	OTLP gRPC endpoint (e.g., `http://otel-collector:4317`). Tracing is enabled when this is set.
`OTEL_SERVICE_NAME`	`nextcloud-mcp-server`	Service name in traces
`OTEL_TRACES_SAMPLER`	`always_on`	Trace sampling strategy
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (0.0-1.0)
`LOG_FORMAT`	`json`	Log format (`json` or `text`)
`LOG_LEVEL`	`INFO`	Minimum log level
`LOG_INCLUDE_TRACE_CONTEXT`	`true`	Include trace IDs in logs

Helm Chart Configuration

observability:
  metrics:
    enabled: true
    port: 9090
    path: /metrics

  tracing:
    enabled: true
    endpoint: "http://opentelemetry-collector:4317"
    samplingRate: 1.0

  logging:
    format: json
    level: INFO
    includeTraceContext: true

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s

Metrics

HTTP Server Metrics (RED)

mcp_http_requests_total - Total HTTP requests
mcp_http_request_duration_seconds - Request latency histogram
mcp_http_requests_in_progress - In-flight requests gauge

MCP Tool Metrics

mcp_tool_calls_total - Tool invocation count by status
mcp_tool_duration_seconds - Tool execution latency
mcp_tool_errors_total - Tool errors by type

Nextcloud API Metrics

mcp_nextcloud_api_requests_total - API calls by app and status
mcp_nextcloud_api_duration_seconds - API latency by app
mcp_nextcloud_api_retries_total - Retry count (429, timeout, etc.)

OAuth Flow Metrics

mcp_oauth_token_validations_total - Token validation count
mcp_oauth_token_exchange_total - Token exchange operations
mcp_oauth_token_cache_hits_total - Cache hit/miss rate
mcp_oauth_refresh_token_operations_total - Refresh token storage ops

Vector Sync Metrics (when enabled)

mcp_vector_sync_documents_scanned_total - Documents discovered
mcp_vector_sync_documents_processed_total - Processing results
mcp_vector_sync_processing_duration_seconds - Processing latency
mcp_vector_sync_queue_size - Current queue depth
mcp_qdrant_operations_total - Qdrant DB operations

Database Metrics

mcp_db_operations_total - DB operations (SQLite, Qdrant)
mcp_db_operation_duration_seconds - DB latency

Dependency Health

mcp_dependency_health - External dependency status (1=up, 0=down)
mcp_dependency_check_duration_seconds - Health check latency

Distributed Tracing

Span Hierarchy

HTTP POST /messages
├── mcp.tool.nc_notes_create_note
│   └── nextcloud.api.notes.POST
│       └── httpx request (auto-instrumented)
└── oauth.token.validate (if OAuth mode)
    └── httpx request to IdP

Span Attributes

MCP tools: mcp.tool.name, mcp.tool.args (sanitized)
Nextcloud API: nextcloud.app, http.method, http.status_code
OAuth: oauth.operation, oauth.method
Vector sync: vector_sync.operation, vector_sync.document_count

Trace Context in Logs

When tracing is enabled, all logs include trace_id and span_id:

{
  "timestamp": "2025-01-09T12:34:56.789Z",
  "level": "INFO",
  "logger": "nextcloud_mcp_server.server.notes",
  "message": "Note created successfully",
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "123456789abc...",
  "note_id": 42
}

Dashboards

Prometheus Queries

Request Rate (req/s):

sum(rate(mcp_http_requests_total[5m])) by (method, endpoint)

Error Rate (%):

sum(rate(mcp_http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(mcp_http_requests_total[5m])) * 100

P95 Latency:

histogram_quantile(0.95,
  sum(rate(mcp_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

Top Tools by Volume:

topk(10, sum(rate(mcp_tool_calls_total[5m])) by (tool_name))

Nextcloud API Health:

sum(rate(mcp_nextcloud_api_requests_total{status_code!~"2.."}[5m])) by (app)

Alerts

Recommended Alert Rules

Critical:

Server down for >5min
Error rate >5% for >5min
P95 latency >1s for >5min
Dependency down for >2min

Warning:

Token validation errors >1% for >10min
Vector sync queue >100 for >15min
Qdrant slow (p95 >500ms) for >10min

See charts/nextcloud-mcp-server/templates/prometheusrule.yaml for complete definitions.

Troubleshooting

Metrics Not Appearing

Check metrics are enabled: curl http://localhost:9090/metrics
Verify ServiceMonitor labels match Prometheus selector
Check Prometheus target status: http://prometheus:9090/targets

Traces Not Appearing

Verify OTLP endpoint is reachable: curl http://otel-collector:4317
Check collector logs for errors
Verify sampling rate is not 0.0
Check trace backend (Jaeger/Tempo) connectivity

High Cardinality Metrics

If you see cardinality warnings:

Middleware normalizes endpoints (e.g., /user/123 → /user/*)
OAuth tokens are never included in metric labels
User IDs are not tracked (use tracing for per-user debugging)

Performance Impact

Metrics: <1% overhead (counters/histograms are very fast)
Tracing: ~2-5% overhead at 100% sampling
JSON logging: <1% overhead vs text logging

Recommendation: Always enable metrics. Enable tracing in staging/production with 10-50% sampling.

Architecture

The observability stack integrates at multiple layers:

HTTP Layer: ObservabilityMiddleware tracks all HTTP requests
MCP Layer: Tools use @trace_mcp_tool for span creation
Client Layer: BaseNextcloudClient tracks all API calls
OAuth Layer: Token operations are traced and metered
Background Tasks: Vector sync operations emit metrics/traces

All components use shared Prometheus Registry and OpenTelemetry TracerProvider.

7.4 KiB Raw Blame History