fix: Update DCR token_type tests for OIDC app changes

The Nextcloud OIDC app has updated token_type parameter values: - Changed from "Bearer" → "opaque" for opaque tokens - Changed from "JWT" → "jwt" for JWT tokens Updated test_dcr_token_type.py to use lowercase token_type values: - token_type="jwt" for JWT-formatted tokens - token_type="opaque" for opaque/bearer tokens This fixes test failures where tests were using the old "Bearer" and "JWT" (uppercase) values which are no longer recognized by the OIDC app. Fixes test: test_dcr_respects_bearer_token_type 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
docs: Add Ollama embeddings capacity analysis and investigation
2025-10-31 22:30:58 +01:00 · 2025-10-31 03:07:44 +01:00
3 changed files with 1257 additions and 16 deletions
@@ -0,0 +1,441 @@
+# Ollama Capacity Analysis: ollama.internal.coutinho.io
+
+**Date**: 2025-10-30
+**Model**: nomic-embed-text:latest
+**Test Location**: From nextcloud-mcp-server host
+
+## Summary
+
+✅ **Ollama instance is operational and performing well**
+- Embedding generation working correctly
+- Reasonable latency for small-medium workloads
+- Good parallelism support
+- Suitable for development and small production deployments
+
+## Test Results
+
+### Model Configuration
+
+```json
+{
+  "model": "nomic-embed-text",
+  "dimensions": 768,
+  "status": "operational"
+}
+```
+
+### Performance Metrics
+
+#### 1. Single Embedding Latency
+
+**Result**: ~553ms per embedding
+- **Total time**: 0.553 seconds
+- **Includes**: Network + processing + model inference
+- **Quality**: Full 768-dimensional vector
+
+**Analysis**:
+- Higher than bare-metal benchmarks (~100ms) due to network latency
+- Acceptable for interactive search queries
+- Within expected range for remote Ollama instance
+
+#### 2. Batch Processing (5 items)
+
+**Result**: ~1.02 seconds for 5 embeddings
+- **Per-item average**: 204ms
+- **Throughput**: ~4.9 embeddings/sec
+- **Batch efficiency**: 2.7x faster than sequential
+
+**Analysis**:
+- Good batching efficiency (2.7x speedup vs 5x theoretical)
+- Optimal for background indexing
+- Network overhead amortized across batch
+
+#### 3. Batch Processing (20 items)
+
+**Result**: ~6.71 seconds for 20 embeddings
+- **Per-item average**: 336ms
+- **Throughput**: ~3.0 embeddings/sec
+- **Batch efficiency**: 1.65x faster than sequential
+
+**Analysis**:
+- Performance degrades slightly with larger batches
+- Still faster than sequential processing
+- Matches reported Ollama behavior (quality issues at batch >16)
+- **Recommendation**: Keep batch size ≤16 for best quality
+
+#### 4. Concurrent Requests (5 parallel)
+
+**Result**: ~1.27 seconds for 5 parallel requests
+- **Effective parallelism**: ~4x speedup (vs 2.77s sequential)
+- **Per-request average**: 254ms
+- **Throughput**: ~3.9 requests/sec
+
+**Analysis**:
+- Excellent parallelism support
+- Server handles concurrent requests efficiently
+- Network and compute overlap effectively
+- Good for multi-user scenarios
+
+## Capacity Planning
+
+### Current Performance Profile
+
+| Metric | Value | Rating |
+|--------|-------|--------|
+| Single embedding latency | 553ms | ⚠️ Moderate |
+| Batch (5) throughput | 4.9/sec | ✅ Good |
+| Batch (20) throughput | 3.0/sec | ⚠️ Moderate |
+| Concurrent throughput | 3.9/sec | ✅ Good |
+| Network latency | ~300-400ms | ⚠️ Significant |
+
+### Bottleneck Analysis
+
+**Primary Bottleneck**: Network latency (~300-400ms per request)
+- Model inference: ~100-200ms (estimated)
+- Network round-trip: ~300-400ms (measured overhead)
+- **Impact**: 60-70% of total latency is network
+
+**Secondary Bottleneck**: CPU/GPU capacity (unknown hardware)
+- Batch performance degrades at >16 items
+- Suggests resource constraints
+- Likely CPU-only (no GPU metrics available)
+
+### Recommended Usage Patterns
+
+#### ✅ **Excellent For:**
+
+**1. Background Indexing**
+- Use batch size of 10-15 items
+- Expected throughput: 3-5 embeddings/sec
+- **10,000 notes**: ~30-55 minutes to index
+- **1,000 notes**: ~3-5 minutes to index
+
+**2. Interactive Search**
+- Single query embedding: ~550ms
+- Acceptable for user-facing search
+- Add 100-200ms for vector search + verification
+- **Total search time**: ~650-750ms (reasonable UX)
+
+**3. Multi-User Development**
+- 5-10 concurrent users: Comfortable
+- Good parallelism support
+- Network latency dominates (shared)
+
+#### ⚠️ **Consider Alternatives For:**
+
+**1. Real-Time Applications**
+- Sub-100ms latency requirements
+- High-frequency queries (>10/sec sustained)
+- Consider: Local embeddings or Infinity
+
+**2. Large-Scale Batch Processing**
+- >100,000 documents to index
+- >10 embeddings/sec sustained
+- Consider: GPU-accelerated TEI
+
+**3. Production with >50 Users**
+- High concurrent load
+- Latency sensitivity
+- Consider: Dedicated embedding service
+
+### Deployment Scenarios
+
+#### Scenario 1: Development Environment
+
+**Profile**:
+- 1-3 developers
+- 1,000-5,000 notes total
+- Occasional searches/indexing
+
+**Verdict**: ✅ **Perfect fit**
+- Initial index: ~5-15 minutes (one-time)
+- Incremental updates: <1 minute
+- Search latency: Acceptable
+- No infrastructure changes needed
+
+**Configuration**:
+```bash
+OLLAMA_URL=https://ollama.internal.coutinho.io
+OLLAMA_MODEL=nomic-embed-text
+VECTOR_SYNC_INTERVAL=600  # 10 minutes
+VECTOR_SYNC_BATCH_SIZE=10
+```
+
+#### Scenario 2: Small Production (10-20 users)
+
+**Profile**:
+- 10-20 active users
+- 10,000-50,000 notes total
+- 50-200 searches/day
+- Nightly incremental indexing
+
+**Verdict**: ✅ **Suitable with optimizations**
+- Initial index: 1-3 hours (run overnight)
+- Incremental: 5-15 minutes/night
+- Search: Acceptable for most users
+- Monitor network latency
+
+**Configuration**:
+```bash
+OLLAMA_URL=https://ollama.internal.coutinho.io
+OLLAMA_MODEL=nomic-embed-text
+VECTOR_SYNC_INTERVAL=86400  # Daily at night
+VECTOR_SYNC_BATCH_SIZE=12  # Conservative for quality
+SEARCH_TIMEOUT_MS=1000  # Account for 550ms latency
+```
+
+**Optimizations**:
+- Run sync during off-hours
+- Cache query embeddings (common searches)
+- Use hybrid search (keyword + semantic)
+
+#### Scenario 3: Medium Production (50-100 users)
+
+**Profile**:
+- 50-100 active users
+- 100,000+ notes
+- 500-1000 searches/day
+- Real-time indexing desired
+
+**Verdict**: ⚠️ **Marginal - monitor closely**
+- Initial index: 5-10 hours
+- Search latency: May feel slow for some users
+- Concurrent load: Approaching limits
+- **Recommendation**: Plan migration to Infinity
+
+**Configuration**:
+```bash
+OLLAMA_URL=https://ollama.internal.coutinho.io
+OLLAMA_MODEL=nomic-embed-text
+VECTOR_SYNC_INTERVAL=3600  # Hourly
+VECTOR_SYNC_BATCH_SIZE=10
+SEMANTIC_WEIGHT=0.5  # Rely more on keyword search
+SEARCH_TIMEOUT_MS=2000  # Generous timeout
+```
+
+**Migration Path**:
+- Start with Ollama
+- Monitor latency metrics
+- When p95 latency >1s, migrate to Infinity
+- Keep Ollama as fallback
+
+#### Scenario 4: Large Production (>100 users)
+
+**Profile**:
+- >100 active users
+- >500,000 notes
+- >1000 searches/day
+- Real-time expectations
+
+**Verdict**: ❌ **Not recommended**
+- Latency too high for scale
+- Throughput insufficient
+- Network becomes bottleneck
+- **Recommendation**: Use Infinity or TEI from start
+
+## Network Latency Optimization
+
+### Current Overhead: ~300-400ms
+
+**If MCP server runs closer to Ollama**:
+```
+Same VPC/network: ~1-5ms (300-400ms savings!)
+Same host: <1ms (300-400ms savings!)
+```
+
+### Recommendation
+
+**Option A: Co-locate MCP server with Ollama**
+- Reduces latency from 550ms → 150-200ms
+- 2.5-3x improvement
+- Makes Ollama competitive with cloud APIs
+
+**Option B: Keep separate (current)**
+- Simpler deployment
+- Better security isolation
+- Accept 550ms latency
+
+**Option C: Add Infinity container to MCP server**
+- Best of both worlds
+- Use Infinity for speed (local)
+- Fallback to Ollama if needed
+
+## Capacity Estimates
+
+### Indexing Capacity
+
+**Sustained Throughput**: 3-4 embeddings/sec (conservative)
+
+| Document Count | Index Time | Notes |
+|----------------|------------|-------|
+| 1,000 | 4-5 min | Quick |
+| 5,000 | 20-25 min | Reasonable |
+| 10,000 | 40-50 min | Acceptable |
+| 50,000 | 3.5-4.5 hours | Overnight job |
+| 100,000 | 7-9 hours | Long batch |
+| 500,000 | 35-45 hours | Not recommended |
+
+**Incremental Updates** (10% change daily):
+- 1,000 docs: ~30 sec
+- 10,000 docs: ~5 min
+- 50,000 docs: ~25 min
+
+### Search Capacity
+
+**Query Latency Budget**:
+- Embedding: 550ms
+- Vector search: 50-100ms
+- Permission verification: 50-100ms
+- **Total**: 650-750ms
+
+**Concurrent Users** (assuming 1 search every 5 minutes):
+- 10 users: 2 queries/min → Comfortable
+- 50 users: 10 queries/min → Near limit
+- 100 users: 20 queries/min → Over capacity
+
+**Peak Load** (all users search at once):
+- Parallelism: ~4 concurrent
+- Queue time: Proportional to position
+- 10 simultaneous: ~1.5-2 sec for last user
+- 50 simultaneous: ~7-10 sec for last user
+
+## Recommendations
+
+### Immediate Actions (Development)
+
+1. **✅ Use Ollama as-is**
+   - Current setup is perfect for dev/testing
+   - No changes needed
+   - Start building semantic search
+
+2. **Configuration**:
+   ```bash
+   OLLAMA_URL=https://ollama.internal.coutinho.io
+   OLLAMA_MODEL=nomic-embed-text
+   VECTOR_SYNC_BATCH_SIZE=10
+   ```
+
+3. **Add Monitoring**:
+   ```python
+   # Track these metrics
+   - embedding_latency_seconds (histogram)
+   - embedding_batch_size (gauge)
+   - embedding_errors_total (counter)
+   ```
+
+### Short-Term (Small Production)
+
+1. **Optimize Batching**:
+   - Use batch size 10-12 (quality sweet spot)
+   - Process during off-hours
+   - Implement incremental sync
+
+2. **Add Caching**:
+   ```python
+   # Cache common query embeddings
+   @lru_cache(maxsize=1000)
+   async def embed_with_cache(query: str):
+       return await ollama.embed(query)
+   ```
+
+3. **Monitor Metrics**:
+   - P50, P95, P99 latency
+   - Throughput (embeddings/sec)
+   - Error rates
+
+### Medium-Term (If Scaling Up)
+
+1. **Add Infinity Container** (when >50 users or latency issues):
+   ```yaml
+   services:
+     infinity:
+       image: michaelf34/infinity:latest
+       # Local to MCP server - ~10-20ms latency
+   ```
+
+2. **Implement Tiered Fallback**:
+   ```
+   Infinity (local, fast) → Ollama (remote, slower) → Local model
+   ```
+
+3. **Load Testing**:
+   - Simulate 50-100 concurrent users
+   - Measure actual throughput limits
+   - Identify breaking points
+
+### Long-Term (Enterprise Scale)
+
+1. **Migrate to TEI Cluster** (when >100 users):
+   - GPU-accelerated
+   - Horizontal scaling
+   - <20ms latency
+
+2. **Consider Managed Services**:
+   - Pinecone, Qdrant Cloud
+   - Removes operational burden
+   - Better SLAs
+
+## Testing Recommendations
+
+### Load Testing Script
+
+```bash
+# Test sustained load
+for i in {1..100}; do
+  curl -s https://ollama.internal.coutinho.io/api/embed \
+    -d "{\"model\": \"nomic-embed-text\", \"input\": \"Test $i\"}" &
+
+  # Rate limit: 5 concurrent
+  if [ $(($i % 5)) -eq 0 ]; then
+    wait
+    sleep 1
+  fi
+done
+```
+
+### Metrics to Collect
+
+1. **Latency Distribution**:
+   - P50 (median)
+   - P95 (acceptable)
+   - P99 (outliers)
+
+2. **Throughput**:
+   - Embeddings/second
+   - Peak vs sustained
+
+3. **Error Rates**:
+   - Timeouts
+   - Server errors
+   - Quality issues
+
+## Conclusion
+
+**Your Ollama instance is ready for development and small production use!**
+
+**Current Capacity**:
+- ✅ Development: Unlimited
+- ✅ Small prod (10-20 users, 10k docs): Comfortable
+- ⚠️ Medium prod (50 users, 50k docs): Monitoring needed
+- ❌ Large prod (>100 users): Migrate to Infinity/TEI
+
+**Key Strengths**:
+- Fully operational
+- Good parallelism
+- Acceptable latency for most use cases
+- Easy to integrate
+
+**Key Limitations**:
+- Network latency adds 300-400ms overhead
+- Batch quality issues at >16 items
+- Limited scalability beyond 50 users
+
+**Recommendation**:
+Start using Ollama immediately for development. Add monitoring and plan for Infinity when you approach 50 users or experience latency issues. The abstraction layer in ADR-003 makes migration seamless.
+
+**Next Steps**:
+1. Configure MCP server with Ollama URL
+2. Implement semantic search tools
+3. Add basic monitoring
+4. Test with real workload
+5. Scale up as needed
@@ -0,0 +1,796 @@
+# Ollama Embeddings Investigation
+
+**Date**: 2025-10-30
+**Status**: Recommendation for Integration
+
+## Executive Summary
+
+Ollama provides a **local, self-hosted embedding solution** that is excellent for **development and small-scale deployments** but has **performance limitations** compared to specialized embedding inference engines (TEI, Infinity).
+
+**Recommendation**: Include Ollama as **Tier 2 fallback** in our embedding strategy (after cloud APIs, before local sentence-transformers), prioritizing ease of setup over maximum performance.
+
+## Overview
+
+Ollama is primarily known as a local LLM runner but added embedding model support in version 0.1.26, making it a convenient option for generating vector embeddings without external API dependencies.
+
+### Key Characteristics
+
+- **Local & Self-Hosted**: No external API calls, full privacy
+- **Easy Setup**: Single binary, simple model downloads (`ollama pull nomic-embed-text`)
+- **Unified Platform**: Same tool for both LLMs and embeddings
+- **OpenAI Compatible**: `/v1/embeddings` endpoint for drop-in replacement
+- **Multi-Platform**: Linux, macOS, Windows support
+- **GPU Support**: CUDA, ROCm, Metal acceleration
+
+## API Details
+
+### Endpoint Structure
+
+**New API** (recommended):
+```bash
+POST http://localhost:11434/api/embed
+```
+
+**OpenAI Compatible**:
+```bash
+POST http://localhost:11434/v1/embeddings
+```
+
+**Legacy API** (deprecated):
+```bash
+POST http://localhost:11434/api/embeddings
+```
+
+### Request Format
+
+**Single Text Embedding**:
+```json
+{
+  "model": "nomic-embed-text",
+  "input": "Text to embed"
+}
+```
+
+**Batch Embedding** (since v0.2.0):
+```json
+{
+  "model": "nomic-embed-text",
+  "input": [
+    "First text to embed",
+    "Second text to embed",
+    "Third text to embed"
+  ]
+}
+```
+
+### Response Format
+
+```json
+{
+  "model": "nomic-embed-text",
+  "embeddings": [
+    [0.123, -0.456, 0.789, ...],  // 768 dimensions for nomic-embed-text
+    [0.234, -0.567, 0.890, ...]
+  ]
+}
+```
+
+### Python Integration
+
+```python
+import ollama
+
+# Single embedding
+response = ollama.embed(
+    model='nomic-embed-text',
+    input='Text to embed'
+)
+embedding = response['embeddings'][0]
+
+# Batch embeddings (more efficient)
+response = ollama.embed(
+    model='nomic-embed-text',
+    input=[
+        'First text',
+        'Second text',
+        'Third text'
+    ]
+)
+embeddings = response['embeddings']
+```
+
+## Available Models
+
+### 1. nomic-embed-text (Recommended)
+
+**Specifications**:
+- **Parameters**: 137M
+- **Dimensions**: 768
+- **Context Length**: 8,192 tokens (2K effective)
+- **Size**: 274MB
+- **Architecture**: BERT-based
+
+**Performance**:
+- Outperforms OpenAI `text-embedding-ada-002` and `text-embedding-3-small`
+- Excellent for long-context tasks
+- Strong general-purpose performance
+
+**Use Cases**:
+- General RAG applications
+- Long document processing
+- Semantic search
+- Document clustering
+
+**Pull Command**:
+```bash
+ollama pull nomic-embed-text
+```
+
+### 2. mxbai-embed-large
+
+**Specifications**:
+- **Parameters**: 334M
+- **Dimensions**: 1,024
+- **Context Length**: 512 tokens
+- **Architecture**: BERT-large optimized
+
+**Performance**:
+- Claims to outperform commercial models
+- Higher precision for complex queries
+- Best quality but slower
+
+**Use Cases**:
+- High-precision semantic search
+- Enterprise knowledge bases
+- Multilingual content
+
+**Pull Command**:
+```bash
+ollama pull mxbai-embed-large
+```
+
+### 3. all-minilm
+
+**Specifications**:
+- **Parameters**: 23M
+- **Dimensions**: 384
+- **Context Length**: 256 tokens
+- **Size**: Smallest footprint
+
+**Performance**:
+- Fastest processing speed
+- Good for sentence-level tasks
+- Limited context window
+
+**Use Cases**:
+- Real-time applications
+- Resource-constrained environments
+- High-throughput scenarios
+- Development/testing
+
+**Pull Command**:
+```bash
+ollama pull all-minilm
+```
+
+## Performance Benchmarks
+
+### Throughput Comparison
+
+| Hardware | Model | Batch Size | Throughput | Notes |
+|----------|-------|------------|------------|-------|
+| RTX 4090 (24GB) | nomic-embed-text | 256 | 12,450 tok/sec | GPU-accelerated |
+| RTX 4090 (24GB) | mxbai-embed-large | 128 | 8,920 tok/sec | GPU-accelerated |
+| Intel i9-13900K (CPU) | nomic-embed-text | 32 | 3,250 tok/sec | CPU-only |
+| Intel i9-13900K (CPU) | mxbai-embed-large | 16 | 2,180 tok/sec | CPU-only |
+
+### Latency Comparison
+
+**Single Request Latency** (RTX 4060):
+- Ollama: ~99ms
+- TEI: ~20ms (5x faster)
+- Infinity: ~30-40ms (2.5-3x faster)
+
+**Batch Processing**:
+- Optimal batch size: 32-64 (model dependent)
+- Performance degrades with batches >16 (quality issues reported)
+- 2x slower than direct sentence-transformers usage
+
+### Engine Comparison
+
+Based on benchmarks from Baseten (2024):
+
+| Engine | Relative Throughput | Notes |
+|--------|---------------------|-------|
+| BEI | 9.0x (baseline) | Fastest (proprietary) |
+| TEI | 4.5x | Open source, Rust-based |
+| Infinity | 3.5x | PyTorch/ONNX optimized |
+| vLLM | 3.0x | General LLM inference |
+| **Ollama** | **1.0x** | Slowest for embeddings |
+
+**Key Insight**: Ollama is **5-9x slower** than specialized embedding engines but trades performance for ease of use and unified platform.
+
+## Integration Implementation
+
+### Python Client Wrapper
+
+```python
+# nextcloud_mcp_server/embeddings/ollama.py
+import httpx
+from typing import List
+
+
+class OllamaEmbedding:
+    """Ollama embedding provider"""
+
+    def __init__(
+        self,
+        base_url: str = "http://localhost:11434",
+        model: str = "nomic-embed-text"
+    ):
+        self.base_url = base_url.rstrip("/")
+        self.model = model
+        self.client = httpx.AsyncClient(timeout=60.0)
+
+        # Model dimension mapping
+        self.dimensions = {
+            "nomic-embed-text": 768,
+            "mxbai-embed-large": 1024,
+            "all-minilm": 384
+        }
+        self.dimension = self.dimensions.get(model, 768)
+
+    async def embed(self, text: str) -> List[float]:
+        """Generate embedding for single text"""
+        response = await self.client.post(
+            f"{self.base_url}/api/embed",
+            json={
+                "model": self.model,
+                "input": text
+            }
+        )
+        response.raise_for_status()
+        data = response.json()
+        return data["embeddings"][0]
+
+    async def embed_batch(
+        self,
+        texts: List[str],
+        batch_size: int = 32
+    ) -> List[List[float]]:
+        """
+        Generate embeddings for multiple texts in batches.
+
+        Note: Ollama has reported quality issues with batch sizes >16.
+        We use batch_size=32 as default but allow configuration.
+        """
+        all_embeddings = []
+
+        # Process in chunks to avoid batch size issues
+        for i in range(0, len(texts), batch_size):
+            batch = texts[i:i + batch_size]
+
+            response = await self.client.post(
+                f"{self.base_url}/api/embed",
+                json={
+                    "model": self.model,
+                    "input": batch
+                }
+            )
+            response.raise_for_status()
+            data = response.json()
+            all_embeddings.extend(data["embeddings"])
+
+        return all_embeddings
+
+    async def check_health(self) -> bool:
+        """Check if Ollama server is running and model is available"""
+        try:
+            # Check if server is up
+            response = await self.client.get(f"{self.base_url}/api/tags")
+            response.raise_for_status()
+
+            # Check if model is pulled
+            models = response.json().get("models", [])
+            model_names = [m["name"] for m in models]
+
+            if self.model not in model_names:
+                raise ValueError(
+                    f"Model '{self.model}' not found. "
+                    f"Run: ollama pull {self.model}"
+                )
+
+            return True
+
+        except Exception as e:
+            raise ConnectionError(f"Ollama health check failed: {e}")
+
+    async def close(self):
+        """Close HTTP client"""
+        await self.client.aclose()
+```
+
+### Auto-Detection in Embedding Service
+
+```python
+# nextcloud_mcp_server/embeddings/service.py
+from typing import Optional
+import os
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class EmbeddingService:
+    """Unified embedding service with automatic provider detection"""
+
+    def __init__(self):
+        self.provider = None
+        self._detect_provider()
+
+    def _detect_provider(self):
+        """Auto-detect available embedding provider"""
+
+        # Tier 1: OpenAI API (best quality)
+        if os.getenv("OPENAI_API_KEY"):
+            from .openai import OpenAIEmbedding
+            self.provider = OpenAIEmbedding(
+                model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
+                api_key=os.getenv("OPENAI_API_KEY")
+            )
+            logger.info("✓ Using OpenAI embeddings")
+            return
+
+        # Tier 2a: Infinity (optimized self-hosted)
+        if os.getenv("INFINITY_URL"):
+            from .infinity import InfinityEmbedding
+            try:
+                self.provider = InfinityEmbedding(
+                    url=os.getenv("INFINITY_URL"),
+                    model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
+                )
+                logger.info("✓ Using Infinity embeddings (optimized)")
+                return
+            except Exception as e:
+                logger.warning(f"Infinity unavailable: {e}")
+
+        # Tier 2b: Ollama (easy self-hosted)
+        if os.getenv("OLLAMA_URL"):
+            from .ollama import OllamaEmbedding
+            try:
+                self.provider = OllamaEmbedding(
+                    base_url=os.getenv("OLLAMA_URL", "http://localhost:11434"),
+                    model=os.getenv("OLLAMA_MODEL", "nomic-embed-text")
+                )
+                # Verify Ollama is running and model is available
+                import asyncio
+                asyncio.run(self.provider.check_health())
+                logger.info("✓ Using Ollama embeddings (easy setup)")
+                return
+            except Exception as e:
+                logger.warning(f"Ollama unavailable: {e}")
+
+        # Tier 3: Local model (fallback)
+        logger.warning("No cloud/hosted embeddings available, using local model")
+        from .local import LocalEmbedding
+        self.provider = LocalEmbedding(
+            model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
+        )
+        logger.info("✓ Using local embeddings (CPU fallback)")
+
+    async def embed(self, text: str):
+        """Generate embedding for text"""
+        return await self.provider.embed(text)
+
+    async def embed_batch(self, texts: list[str]):
+        """Generate embeddings for multiple texts"""
+        return await self.provider.embed_batch(texts)
+
+    @property
+    def dimension(self) -> int:
+        """Get embedding dimension"""
+        return self.provider.dimension
+```
+
+### Docker Compose Configuration
+
+```yaml
+services:
+  # Ollama embedding service
+  ollama:
+    image: ollama/ollama:latest
+    restart: always
+    ports:
+      - 127.0.0.1:11434:11434
+    volumes:
+      - ollama_models:/root/.ollama
+    # Optional: GPU support
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    # Pull models on startup
+    entrypoint: ["/bin/sh", "-c"]
+    command:
+      - |
+        ollama serve &
+        sleep 5
+        ollama pull nomic-embed-text
+        wait
+
+  # MCP Server with Ollama embeddings
+  mcp:
+    build: .
+    depends_on:
+      - ollama
+    environment:
+      # ... other vars ...
+      - OLLAMA_URL=http://ollama:11434
+      - OLLAMA_MODEL=nomic-embed-text
+
+  # Vector sync worker
+  mcp-vector-sync:
+    build: .
+    command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
+    depends_on:
+      - ollama
+      - qdrant
+    environment:
+      # ... other vars ...
+      - OLLAMA_URL=http://ollama:11434
+      - OLLAMA_MODEL=nomic-embed-text
+
+volumes:
+  ollama_models:
+```
+
+## Advantages of Ollama
+
+### 1. **Ease of Setup**
+
+```bash
+# Install Ollama
+curl -fsSL https://ollama.com/install.sh | sh
+
+# Pull embedding model
+ollama pull nomic-embed-text
+
+# Done! API available at localhost:11434
+```
+
+No complex configuration, no Docker registries, no model conversion.
+
+### 2. **Privacy & Data Sovereignty**
+
+- All processing happens locally
+- No data leaves your infrastructure
+- No API keys or external dependencies
+- Ideal for sensitive content (medical, legal, financial)
+
+### 3. **Unified Platform**
+
+- Same tool for LLMs and embeddings
+- Consistent API across model types
+- Single point of management
+- Simplified operations
+
+### 4. **Developer Experience**
+
+- Simple API (similar to OpenAI)
+- Good documentation
+- Active community
+- Framework integrations (LangChain, LlamaIndex)
+
+### 5. **Cost**
+
+- Free and open source
+- No per-token API costs
+- Only infrastructure costs (compute)
+
+### 6. **Model Variety**
+
+Growing library of embedding models:
+- nomic-embed-text (general purpose)
+- mxbai-embed-large (high quality)
+- all-minilm (fast)
+- More models added regularly
+
+## Limitations of Ollama
+
+### 1. **Performance**
+
+- **5-9x slower** than specialized engines (TEI, Infinity)
+- Not optimized specifically for embedding inference
+- Batch processing issues at larger batch sizes (>16)
+- Higher latency compared to alternatives
+
+### 2. **Scalability**
+
+- Single-instance deployment (no native clustering)
+- Limited concurrent request handling
+- Not designed for high-throughput production
+- Resource usage per request is higher
+
+### 3. **Batch Processing Issues**
+
+- Quality degradation reported with large batches
+- Optimal batch size: 32-64 (conservative)
+- Less efficient than specialized engines
+- GitHub issues tracking batch problems (#6262)
+
+### 4. **Resource Usage**
+
+- Models stay loaded in memory (VRAM/RAM)
+- Higher memory footprint per model
+- GPU context switching overhead
+- Not as memory-efficient as specialized engines
+
+### 5. **Production Features**
+
+- No built-in load balancing
+- Limited monitoring/metrics
+- No automatic scaling
+- Basic error handling
+
+## Use Case Recommendations
+
+### ✅ **Excellent For:**
+
+1. **Development & Testing**
+   - Quick setup for prototyping
+   - Local development environments
+   - Testing embedding pipelines
+
+2. **Small Deployments**
+   - <10 users
+   - <10,000 documents
+   - Infrequent searches (<100/day)
+   - Hobbyist/personal projects
+
+3. **Privacy-Critical Applications**
+   - Medical/healthcare records
+   - Legal documents
+   - Financial data
+   - Air-gapped environments
+
+4. **Unified LLM Stack**
+   - Projects already using Ollama for LLMs
+   - Simplified operations
+   - Consistent tooling
+
+5. **Educational/Learning**
+   - Teaching RAG concepts
+   - Learning embeddings
+   - Hackathons/workshops
+
+### ⚠️ **Consider Alternatives For:**
+
+1. **Production at Scale**
+   - >100 users
+   - >100,000 documents
+   - High query volume (>1000/day)
+   - Use: TEI or Infinity
+
+2. **Performance-Critical**
+   - Real-time search (<50ms latency)
+   - High-throughput batch processing
+   - Use: TEI with GPU
+
+3. **Enterprise Deployments**
+   - Need for high availability
+   - Load balancing requirements
+   - Advanced monitoring
+   - Use: Managed services or TEI cluster
+
+4. **Large-Scale Indexing**
+   - Millions of documents
+   - Continuous high-volume ingestion
+   - Use: Infinity or commercial solutions
+
+## Integration Strategy
+
+### Recommended Tier Placement
+
+**Update ADR-003 embedding strategy:**
+
+```
+Tier 1: OpenAI API (best quality, requires API key)
+  ↓ fallback
+Tier 2a: Infinity (optimized self-hosted, complex setup)
+  ↓ fallback
+Tier 2b: Ollama (easy self-hosted, moderate performance) ← NEW
+  ↓ fallback
+Tier 3: Local sentence-transformers (CPU fallback, simplest)
+```
+
+### Configuration
+
+```bash
+# Option 1: Use Infinity (if available)
+INFINITY_URL=http://infinity:7997
+EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
+
+# Option 2: Use Ollama (if Infinity unavailable)
+OLLAMA_URL=http://ollama:11434
+OLLAMA_MODEL=nomic-embed-text
+
+# Option 3: Use local model (automatic fallback)
+# No configuration needed
+```
+
+### When to Choose Ollama
+
+**Choose Ollama if**:
+- You're already using Ollama for LLMs
+- You need privacy/data sovereignty
+- You have <10k documents and <100 users
+- Ease of setup is more important than max performance
+- You're in development/testing phase
+
+**Choose Infinity/TEI if**:
+- You need maximum throughput (>1000 embeddings/sec)
+- You have >100k documents
+- Latency is critical (<50ms)
+- You're in production with >100 users
+
+**Choose OpenAI API if**:
+- You're okay with cloud dependencies
+- You need best-in-class quality
+- Cost is not a concern (~$0.02 per 1M tokens)
+
+## Production Deployment Guidance
+
+### Small Production (Ollama Acceptable)
+
+**Profile**:
+- 5-20 users
+- 1,000-10,000 documents
+- 50-200 searches/day
+- <2 sec acceptable latency
+
+**Configuration**:
+```yaml
+ollama:
+  image: ollama/ollama:latest
+  deploy:
+    resources:
+      limits:
+        memory: 4GB
+        cpus: "2.0"
+      reservations:
+        devices:
+          - driver: nvidia  # GPU if available
+            count: 1
+            capabilities: [gpu]
+  environment:
+    - OLLAMA_NUM_PARALLEL=2  # Concurrent requests
+```
+
+**Expected Performance**:
+- Embedding latency: 100-200ms
+- Throughput: 5-10 embeddings/sec
+- Memory: 2-3GB (model loaded)
+
+### Medium Production (Use Infinity/TEI)
+
+**Profile**:
+- 20-200 users
+- 10,000-1M documents
+- 500-5,000 searches/day
+- <500ms acceptable latency
+
+**Recommendation**: Migrate to Infinity or TEI
+```yaml
+infinity:
+  image: michaelf34/infinity:latest
+  # Better throughput and latency
+```
+
+### Large Production (Use Specialized Solution)
+
+**Profile**:
+- >200 users
+- >1M documents
+- >5,000 searches/day
+- <100ms required latency
+
+**Recommendation**: Use TEI cluster or commercial service
+
+## Monitoring Considerations
+
+### Key Metrics to Track
+
+```python
+# Add Ollama-specific metrics
+from prometheus_client import Histogram, Counter, Gauge
+
+ollama_embedding_latency = Histogram(
+    'ollama_embedding_duration_seconds',
+    'Ollama embedding generation time',
+    ['model', 'batch_size']
+)
+
+ollama_batch_size = Gauge(
+    'ollama_batch_size',
+    'Current batch size being processed'
+)
+
+ollama_errors = Counter(
+    'ollama_errors_total',
+    'Ollama embedding errors',
+    ['error_type']
+)
+```
+
+### Health Checks
+
+```python
+async def ollama_health_check():
+    """Check Ollama availability"""
+    try:
+        async with httpx.AsyncClient() as client:
+            # Check server
+            response = await client.get("http://ollama:11434/api/tags")
+            response.raise_for_status()
+
+            # Verify model loaded
+            models = response.json().get("models", [])
+            if "nomic-embed-text" not in [m["name"] for m in models]:
+                return False, "Model not pulled"
+
+            return True, "OK"
+    except Exception as e:
+        return False, str(e)
+```
+
+## Migration Path
+
+### Starting with Ollama
+
+**Phase 1: Development** (Ollama)
+- Use Ollama for initial development
+- Validate embedding pipeline
+- Test search quality
+
+**Phase 2: Growth** (Ollama → Infinity)
+- Monitor performance metrics
+- When >50 users or >10k docs, migrate to Infinity
+- Simple config change, no code changes
+
+**Phase 3: Scale** (Infinity → TEI/Commercial)
+- When >200 users or performance issues
+- Consider TEI cluster or managed services
+
+### Code Compatibility
+
+All embedding providers use the same interface:
+```python
+# Works with Ollama, Infinity, OpenAI, Local
+embedding = await embedding_service.embed(text)
+embeddings = await embedding_service.embed_batch(texts)
+```
+
+**Migration is a configuration change only** - no code rewrite needed.
+
+## Conclusion
+
+**Ollama is a solid choice for:**
+- Early-stage projects
+- Development/testing
+- Privacy-critical applications
+- Small deployments (<10 users, <10k docs)
+- Unified LLM + embedding stack
+
+**But recognize its limitations:**
+- 5-9x slower than specialized engines
+- Not designed for high-throughput production
+- Batch processing can be problematic
+- Limited scalability
+
+**Recommendation**:
+✅ **Include Ollama as Tier 2b** (after Infinity, before local models) in the embedding strategy. It provides a good balance of ease-of-use and privacy for small-to-medium deployments while allowing seamless migration to more performant engines as needs grow.
+
+The key is designing the abstraction layer (as done in ADR-003) so migration between engines requires only configuration changes, not code rewrites.
@@ -3,8 +3,8 @@ Tests for Dynamic Client Registration (DCR) token_type parameter.

 These tests verify that the Nextcloud OIDC server properly honors the token_type
 parameter during client registration, issuing the correct type of access tokens:
- token_type="JWT" → JWT-formatted tokens (RFC 9068)
- token_type="Bearer" → Opaque tokens (standard OAuth2)
+- token_type="jwt" → JWT-formatted tokens (RFC 9068)
+- token_type="opaque" → Opaque tokens (standard OAuth2)

 This is critical for ensuring:
 1. Client choice is respected by the OIDC server
@@ -208,12 +208,14 @@ async def test_dcr_respects_jwt_token_type(
    oauth_callback_server,
 ):
    """
-    Test that DCR honors token_type=JWT and issues JWT-formatted tokens.
+    Test that DCR honors token_type=jwt and issues JWT-formatted tokens.

    This verifies:
-    1. Client registration with token_type="JWT" succeeds
+    1. Client registration with token_type="jwt" succeeds
    2. Tokens obtained via this client are JWT format (base64.base64.signature)
    3. JWT payload contains expected claims (sub, iss, scope, etc.)
+
+    Note: The OIDC app uses lowercase 'jwt' (not 'JWT').
    """
    nextcloud_host = os.getenv("NEXTCLOUD_HOST")
    if not nextcloud_host:
@@ -232,15 +234,15 @@ async def test_dcr_respects_jwt_token_type(
        token_endpoint = oidc_config.get("token_endpoint")
        authorization_endpoint = oidc_config.get("authorization_endpoint")

-    # Register client with token_type="JWT"
-    logger.info("Registering OAuth client with token_type=JWT...")
+    # Register client with token_type="jwt"
+    logger.info("Registering OAuth client with token_type=jwt...")
    client_info = await register_client(
        nextcloud_url=nextcloud_host,
        registration_endpoint=registration_endpoint,
        client_name="DCR Test - JWT Token Type",
        redirect_uris=[callback_url],
        scopes="openid profile email notes:read notes:write",
-        token_type="JWT",
+        token_type="jwt",
    )

    logger.info(f"Registered JWT client: {client_info.client_id[:16]}...")
@@ -278,7 +280,7 @@ async def test_dcr_respects_jwt_token_type(
    assert "notes:write" in scopes, "JWT scope claim missing notes:write"

    logger.info(
-        f"✅ DCR with token_type=JWT works correctly! "
+        f"✅ DCR with token_type=jwt works correctly! "
        f"Token is JWT format with scope claim: {payload['scope']}"
    )

@@ -290,12 +292,14 @@ async def test_dcr_respects_bearer_token_type(
    oauth_callback_server,
 ):
    """
-    Test that DCR honors token_type=Bearer and issues opaque tokens.
+    Test that DCR honors token_type=opaque and issues opaque tokens.

    This verifies:
-    1. Client registration with token_type="Bearer" succeeds
+    1. Client registration with token_type="opaque" succeeds
    2. Tokens obtained via this client are opaque (NOT JWT format)
    3. Opaque tokens are simple strings, not base64-encoded structures
+
+    Note: The OIDC app uses 'opaque' or 'jwt' as token_type values (not 'Bearer').
    """
    nextcloud_host = os.getenv("NEXTCLOUD_HOST")
    if not nextcloud_host:
@@ -314,18 +318,18 @@ async def test_dcr_respects_bearer_token_type(
        token_endpoint = oidc_config.get("token_endpoint")
        authorization_endpoint = oidc_config.get("authorization_endpoint")

-    # Register client with token_type="Bearer" (opaque tokens)
-    logger.info("Registering OAuth client with token_type=Bearer...")
+    # Register client with token_type="opaque" (opaque tokens)
+    logger.info("Registering OAuth client with token_type=opaque...")
    client_info = await register_client(
        nextcloud_url=nextcloud_host,
        registration_endpoint=registration_endpoint,
-        client_name="DCR Test - Bearer Token Type",
+        client_name="DCR Test - Opaque Token Type",
        redirect_uris=[callback_url],
        scopes="openid profile email notes:read notes:write",
-        token_type="Bearer",
+        token_type="opaque",
    )

-    logger.info(f"Registered Bearer client: {client_info.client_id[:16]}...")
+    logger.info(f"Registered Opaque token client: {client_info.client_id[:16]}...")

    # Obtain token via OAuth flow
    access_token = await get_oauth_token_with_client(
@@ -353,7 +357,7 @@ async def test_dcr_respects_bearer_token_type(
        pass

    logger.info(
-        f"✅ DCR with token_type=Bearer works correctly! "
+        f"✅ DCR with token_type=opaque works correctly! "
        f"Token is opaque (not JWT format): {access_token[:30]}..."
    )