nextcloud-mcp-server/docs/observability.md

# Observability and Monitoring

The Nextcloud MCP Server includes comprehensive observability features for production deployments:

- **Prometheus metrics** for monitoring performance and health
- **OpenTelemetry distributed tracing** for debugging request flows
- **Structured JSON logging** with trace correlation
- **Kubernetes integration** via ServiceMonitor and PrometheusRule

## Quick Start

### Local Development with Prometheus

```bash
# Enable metrics (enabled by default)
export METRICS_ENABLED=true
export METRICS_PORT=9090

# Enable tracing (optional - tracing is enabled when OTEL_EXPORTER_OTLP_ENDPOINT is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Start the server
docker-compose up -d mcp
```

Access metrics at: `http://localhost:9090/metrics`

### Kubernetes Deployment

Metrics are automatically scraped if you have Prometheus Operator installed:

```bash
helm install nextcloud-mcp charts/nextcloud-mcp-server \
  --set observability.metrics.enabled=true \
  --set observability.tracing.enabled=true \
  --set observability.tracing.endpoint=http://opentelemetry-collector:4317 \
  --set serviceMonitor.enabled=true
```

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `METRICS_ENABLED` | `true` | Enable Prometheus metrics |
| `METRICS_PORT` | `9090` | Port for metrics endpoint |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | - | OTLP gRPC endpoint (e.g., `http://otel-collector:4317`). Tracing is enabled when this is set. |
| `OTEL_SERVICE_NAME` | `nextcloud-mcp-server` | Service name in traces |
| `OTEL_TRACES_SAMPLER` | `always_on` | Trace sampling strategy |
| `OTEL_TRACES_SAMPLER_ARG` | `1.0` | Sampling rate (0.0-1.0) |
| `LOG_FORMAT` | `json` | Log format (`json` or `text`) |
| `LOG_LEVEL` | `INFO` | Minimum log level |
| `LOG_INCLUDE_TRACE_CONTEXT` | `true` | Include trace IDs in logs |

### Helm Chart Configuration

```yaml
observability:
  metrics:
    enabled: true
    port: 9090
    path: /metrics

  tracing:
    enabled: true
    endpoint: "http://opentelemetry-collector:4317"
    samplingRate: 1.0

  logging:
    format: json
    level: INFO
    includeTraceContext: true

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
```

## Metrics

### HTTP Server Metrics (RED)

- `mcp_http_requests_total` - Total HTTP requests
- `mcp_http_request_duration_seconds` - Request latency histogram
- `mcp_http_requests_in_progress` - In-flight requests gauge

### MCP Tool Metrics

- `mcp_tool_calls_total` - Tool invocation count by status
- `mcp_tool_duration_seconds` - Tool execution latency
- `mcp_tool_errors_total` - Tool errors by type

### Nextcloud API Metrics

- `mcp_nextcloud_api_requests_total` - API calls by app and status
- `mcp_nextcloud_api_duration_seconds` - API latency by app
- `mcp_nextcloud_api_retries_total` - Retry count (429, timeout, etc.)

### OAuth Flow Metrics

- `mcp_oauth_token_validations_total` - Token validation count
- `mcp_oauth_token_exchange_total` - Token exchange operations
- `mcp_oauth_token_cache_hits_total` - Cache hit/miss rate
- `mcp_oauth_refresh_token_operations_total` - Refresh token storage ops

### Vector Sync Metrics (when enabled)

- `mcp_vector_sync_documents_scanned_total` - Documents discovered
- `mcp_vector_sync_documents_processed_total` - Processing results
- `mcp_vector_sync_processing_duration_seconds` - Processing latency
- `mcp_vector_sync_queue_size` - Current queue depth
- `mcp_qdrant_operations_total` - Qdrant DB operations

### Database Metrics

- `mcp_db_operations_total` - DB operations (SQLite, Qdrant)
- `mcp_db_operation_duration_seconds` - DB latency

### Dependency Health

- `mcp_dependency_health` - External dependency status (1=up, 0=down)
- `mcp_dependency_check_duration_seconds` - Health check latency

## Distributed Tracing

### Span Hierarchy

```
HTTP POST /messages
├── mcp.tool.nc_notes_create_note
│   └── nextcloud.api.notes.POST
│       └── httpx request (auto-instrumented)
└── oauth.token.validate (if OAuth mode)
    └── httpx request to IdP
```

### Span Attributes

- **MCP tools**: `mcp.tool.name`, `mcp.tool.args` (sanitized)
- **Nextcloud API**: `nextcloud.app`, `http.method`, `http.status_code`
- **OAuth**: `oauth.operation`, `oauth.method`
- **Vector sync**: `vector_sync.operation`, `vector_sync.document_count`

### Trace Context in Logs

When tracing is enabled, all logs include `trace_id` and `span_id`:

```json
{
  "timestamp": "2025-01-09T12:34:56.789Z",
  "level": "INFO",
  "logger": "nextcloud_mcp_server.server.notes",
  "message": "Note created successfully",
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "123456789abc...",
  "note_id": 42
}
```

## Dashboards

### Prometheus Queries

**Request Rate (req/s)**:
```promql
sum(rate(mcp_http_requests_total[5m])) by (method, endpoint)
```

**Error Rate (%)**:
```promql
sum(rate(mcp_http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(mcp_http_requests_total[5m])) * 100
```

**P95 Latency**:
```promql
histogram_quantile(0.95,
  sum(rate(mcp_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
```

**Top Tools by Volume**:
```promql
topk(10, sum(rate(mcp_tool_calls_total[5m])) by (tool_name))
```

**Nextcloud API Health**:
```promql
sum(rate(mcp_nextcloud_api_requests_total{status_code!~"2.."}[5m])) by (app)
```

## Alerts

### Recommended Alert Rules

**Critical**:
- Server down for >5min
- Error rate >5% for >5min
- P95 latency >1s for >5min
- Dependency down for >2min

**Warning**:
- Token validation errors >1% for >10min
- Vector sync queue >100 for >15min
- Qdrant slow (p95 >500ms) for >10min

See `charts/nextcloud-mcp-server/templates/prometheusrule.yaml` for complete definitions.

## Troubleshooting

### Metrics Not Appearing

1. Check metrics are enabled: `curl http://localhost:9090/metrics`
2. Verify ServiceMonitor labels match Prometheus selector
3. Check Prometheus target status: `http://prometheus:9090/targets`

### Traces Not Appearing

1. Verify OTLP endpoint is reachable: `curl http://otel-collector:4317`
2. Check collector logs for errors
3. Verify sampling rate is not 0.0
4. Check trace backend (Jaeger/Tempo) connectivity

### High Cardinality Metrics

If you see cardinality warnings:
- Middleware normalizes endpoints (e.g., `/user/123` → `/user/*`)
- OAuth tokens are never included in metric labels
- User IDs are not tracked (use tracing for per-user debugging)

## Performance Impact

- **Metrics**: <1% overhead (counters/histograms are very fast)
- **Tracing**: ~2-5% overhead at 100% sampling
- **JSON logging**: <1% overhead vs text logging

**Recommendation**: Always enable metrics. Enable tracing in staging/production with 10-50% sampling.

## Architecture

The observability stack integrates at multiple layers:

1. **HTTP Layer**: `ObservabilityMiddleware` tracks all HTTP requests
2. **MCP Layer**: Tools use `@instrument_tool` for automatic metrics and trace span creation
3. **Client Layer**: `BaseNextcloudClient` tracks all API calls
4. **OAuth Layer**: Token operations are traced and metered
5. **Background Tasks**: Vector sync operations emit metrics/traces

All components use shared Prometheus `Registry` and OpenTelemetry `TracerProvider`.

## References

- [Prometheus Best Practices](https://prometheus.io/docs/practices/)
- [OpenTelemetry Python SDK](https://opentelemetry.io/docs/languages/python/)
- [Prometheus Operator](https://prometheus-operator.dev/)
- [Grafana Dashboards](https://grafana.com/docs/grafana/latest/dashboards/)