Compare commits

...

2 Commits

Author SHA1 Message Date
Chris Coutinho 4a3b80cb98 fix: Update DCR token_type tests for OIDC app changes
The Nextcloud OIDC app has updated token_type parameter values:
- Changed from "Bearer" → "opaque" for opaque tokens
- Changed from "JWT" → "jwt" for JWT tokens

Updated test_dcr_token_type.py to use lowercase token_type values:
- token_type="jwt" for JWT-formatted tokens
- token_type="opaque" for opaque/bearer tokens

This fixes test failures where tests were using the old "Bearer" and
"JWT" (uppercase) values which are no longer recognized by the OIDC app.

Fixes test: test_dcr_respects_bearer_token_type

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-31 22:30:58 +01:00
Chris Coutinho fc3ab8d0ac docs: Add Ollama embeddings capacity analysis and investigation
Documents Ollama embedding service evaluation for ADR-003 semantic search
implementation, including performance benchmarks and capacity analysis.

## Documentation

### Ollama Capacity Analysis
- Performance metrics for ollama.internal.coutinho.io
- Model: nomic-embed-text:latest
- Embedding generation benchmarks (single, batch, parallel)
- Latency analysis and throughput measurements
- Resource usage and capacity recommendations

### Ollama Embeddings Investigation
- Evaluation of Ollama for semantic search use case
- Comparison with other embedding providers
- Integration considerations with ADR-003 architecture
- Deployment scenarios and operational requirements

## Key Findings

 Ollama instance operational and performing well
 Reasonable latency for small-medium workloads
 Good parallelism support
 Suitable for development and small production deployments

## References

- ADR-003: Vector Database Semantic Search
- Ollama API documentation
- nomic-embed-text model specifications

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-31 03:07:44 +01:00
3 changed files with 1257 additions and 16 deletions
+441
View File
@@ -0,0 +1,441 @@
# Ollama Capacity Analysis: ollama.internal.coutinho.io
**Date**: 2025-10-30
**Model**: nomic-embed-text:latest
**Test Location**: From nextcloud-mcp-server host
## Summary
**Ollama instance is operational and performing well**
- Embedding generation working correctly
- Reasonable latency for small-medium workloads
- Good parallelism support
- Suitable for development and small production deployments
## Test Results
### Model Configuration
```json
{
"model": "nomic-embed-text",
"dimensions": 768,
"status": "operational"
}
```
### Performance Metrics
#### 1. Single Embedding Latency
**Result**: ~553ms per embedding
- **Total time**: 0.553 seconds
- **Includes**: Network + processing + model inference
- **Quality**: Full 768-dimensional vector
**Analysis**:
- Higher than bare-metal benchmarks (~100ms) due to network latency
- Acceptable for interactive search queries
- Within expected range for remote Ollama instance
#### 2. Batch Processing (5 items)
**Result**: ~1.02 seconds for 5 embeddings
- **Per-item average**: 204ms
- **Throughput**: ~4.9 embeddings/sec
- **Batch efficiency**: 2.7x faster than sequential
**Analysis**:
- Good batching efficiency (2.7x speedup vs 5x theoretical)
- Optimal for background indexing
- Network overhead amortized across batch
#### 3. Batch Processing (20 items)
**Result**: ~6.71 seconds for 20 embeddings
- **Per-item average**: 336ms
- **Throughput**: ~3.0 embeddings/sec
- **Batch efficiency**: 1.65x faster than sequential
**Analysis**:
- Performance degrades slightly with larger batches
- Still faster than sequential processing
- Matches reported Ollama behavior (quality issues at batch >16)
- **Recommendation**: Keep batch size ≤16 for best quality
#### 4. Concurrent Requests (5 parallel)
**Result**: ~1.27 seconds for 5 parallel requests
- **Effective parallelism**: ~4x speedup (vs 2.77s sequential)
- **Per-request average**: 254ms
- **Throughput**: ~3.9 requests/sec
**Analysis**:
- Excellent parallelism support
- Server handles concurrent requests efficiently
- Network and compute overlap effectively
- Good for multi-user scenarios
## Capacity Planning
### Current Performance Profile
| Metric | Value | Rating |
|--------|-------|--------|
| Single embedding latency | 553ms | ⚠️ Moderate |
| Batch (5) throughput | 4.9/sec | ✅ Good |
| Batch (20) throughput | 3.0/sec | ⚠️ Moderate |
| Concurrent throughput | 3.9/sec | ✅ Good |
| Network latency | ~300-400ms | ⚠️ Significant |
### Bottleneck Analysis
**Primary Bottleneck**: Network latency (~300-400ms per request)
- Model inference: ~100-200ms (estimated)
- Network round-trip: ~300-400ms (measured overhead)
- **Impact**: 60-70% of total latency is network
**Secondary Bottleneck**: CPU/GPU capacity (unknown hardware)
- Batch performance degrades at >16 items
- Suggests resource constraints
- Likely CPU-only (no GPU metrics available)
### Recommended Usage Patterns
#### ✅ **Excellent For:**
**1. Background Indexing**
- Use batch size of 10-15 items
- Expected throughput: 3-5 embeddings/sec
- **10,000 notes**: ~30-55 minutes to index
- **1,000 notes**: ~3-5 minutes to index
**2. Interactive Search**
- Single query embedding: ~550ms
- Acceptable for user-facing search
- Add 100-200ms for vector search + verification
- **Total search time**: ~650-750ms (reasonable UX)
**3. Multi-User Development**
- 5-10 concurrent users: Comfortable
- Good parallelism support
- Network latency dominates (shared)
#### ⚠️ **Consider Alternatives For:**
**1. Real-Time Applications**
- Sub-100ms latency requirements
- High-frequency queries (>10/sec sustained)
- Consider: Local embeddings or Infinity
**2. Large-Scale Batch Processing**
- >100,000 documents to index
- >10 embeddings/sec sustained
- Consider: GPU-accelerated TEI
**3. Production with >50 Users**
- High concurrent load
- Latency sensitivity
- Consider: Dedicated embedding service
### Deployment Scenarios
#### Scenario 1: Development Environment
**Profile**:
- 1-3 developers
- 1,000-5,000 notes total
- Occasional searches/indexing
**Verdict**: ✅ **Perfect fit**
- Initial index: ~5-15 minutes (one-time)
- Incremental updates: <1 minute
- Search latency: Acceptable
- No infrastructure changes needed
**Configuration**:
```bash
OLLAMA_URL=https://ollama.internal.coutinho.io
OLLAMA_MODEL=nomic-embed-text
VECTOR_SYNC_INTERVAL=600 # 10 minutes
VECTOR_SYNC_BATCH_SIZE=10
```
#### Scenario 2: Small Production (10-20 users)
**Profile**:
- 10-20 active users
- 10,000-50,000 notes total
- 50-200 searches/day
- Nightly incremental indexing
**Verdict**: ✅ **Suitable with optimizations**
- Initial index: 1-3 hours (run overnight)
- Incremental: 5-15 minutes/night
- Search: Acceptable for most users
- Monitor network latency
**Configuration**:
```bash
OLLAMA_URL=https://ollama.internal.coutinho.io
OLLAMA_MODEL=nomic-embed-text
VECTOR_SYNC_INTERVAL=86400 # Daily at night
VECTOR_SYNC_BATCH_SIZE=12 # Conservative for quality
SEARCH_TIMEOUT_MS=1000 # Account for 550ms latency
```
**Optimizations**:
- Run sync during off-hours
- Cache query embeddings (common searches)
- Use hybrid search (keyword + semantic)
#### Scenario 3: Medium Production (50-100 users)
**Profile**:
- 50-100 active users
- 100,000+ notes
- 500-1000 searches/day
- Real-time indexing desired
**Verdict**: ⚠️ **Marginal - monitor closely**
- Initial index: 5-10 hours
- Search latency: May feel slow for some users
- Concurrent load: Approaching limits
- **Recommendation**: Plan migration to Infinity
**Configuration**:
```bash
OLLAMA_URL=https://ollama.internal.coutinho.io
OLLAMA_MODEL=nomic-embed-text
VECTOR_SYNC_INTERVAL=3600 # Hourly
VECTOR_SYNC_BATCH_SIZE=10
SEMANTIC_WEIGHT=0.5 # Rely more on keyword search
SEARCH_TIMEOUT_MS=2000 # Generous timeout
```
**Migration Path**:
- Start with Ollama
- Monitor latency metrics
- When p95 latency >1s, migrate to Infinity
- Keep Ollama as fallback
#### Scenario 4: Large Production (>100 users)
**Profile**:
- >100 active users
- >500,000 notes
- >1000 searches/day
- Real-time expectations
**Verdict**: ❌ **Not recommended**
- Latency too high for scale
- Throughput insufficient
- Network becomes bottleneck
- **Recommendation**: Use Infinity or TEI from start
## Network Latency Optimization
### Current Overhead: ~300-400ms
**If MCP server runs closer to Ollama**:
```
Same VPC/network: ~1-5ms (300-400ms savings!)
Same host: <1ms (300-400ms savings!)
```
### Recommendation
**Option A: Co-locate MCP server with Ollama**
- Reduces latency from 550ms → 150-200ms
- 2.5-3x improvement
- Makes Ollama competitive with cloud APIs
**Option B: Keep separate (current)**
- Simpler deployment
- Better security isolation
- Accept 550ms latency
**Option C: Add Infinity container to MCP server**
- Best of both worlds
- Use Infinity for speed (local)
- Fallback to Ollama if needed
## Capacity Estimates
### Indexing Capacity
**Sustained Throughput**: 3-4 embeddings/sec (conservative)
| Document Count | Index Time | Notes |
|----------------|------------|-------|
| 1,000 | 4-5 min | Quick |
| 5,000 | 20-25 min | Reasonable |
| 10,000 | 40-50 min | Acceptable |
| 50,000 | 3.5-4.5 hours | Overnight job |
| 100,000 | 7-9 hours | Long batch |
| 500,000 | 35-45 hours | Not recommended |
**Incremental Updates** (10% change daily):
- 1,000 docs: ~30 sec
- 10,000 docs: ~5 min
- 50,000 docs: ~25 min
### Search Capacity
**Query Latency Budget**:
- Embedding: 550ms
- Vector search: 50-100ms
- Permission verification: 50-100ms
- **Total**: 650-750ms
**Concurrent Users** (assuming 1 search every 5 minutes):
- 10 users: 2 queries/min → Comfortable
- 50 users: 10 queries/min → Near limit
- 100 users: 20 queries/min → Over capacity
**Peak Load** (all users search at once):
- Parallelism: ~4 concurrent
- Queue time: Proportional to position
- 10 simultaneous: ~1.5-2 sec for last user
- 50 simultaneous: ~7-10 sec for last user
## Recommendations
### Immediate Actions (Development)
1. **✅ Use Ollama as-is**
- Current setup is perfect for dev/testing
- No changes needed
- Start building semantic search
2. **Configuration**:
```bash
OLLAMA_URL=https://ollama.internal.coutinho.io
OLLAMA_MODEL=nomic-embed-text
VECTOR_SYNC_BATCH_SIZE=10
```
3. **Add Monitoring**:
```python
# Track these metrics
- embedding_latency_seconds (histogram)
- embedding_batch_size (gauge)
- embedding_errors_total (counter)
```
### Short-Term (Small Production)
1. **Optimize Batching**:
- Use batch size 10-12 (quality sweet spot)
- Process during off-hours
- Implement incremental sync
2. **Add Caching**:
```python
# Cache common query embeddings
@lru_cache(maxsize=1000)
async def embed_with_cache(query: str):
return await ollama.embed(query)
```
3. **Monitor Metrics**:
- P50, P95, P99 latency
- Throughput (embeddings/sec)
- Error rates
### Medium-Term (If Scaling Up)
1. **Add Infinity Container** (when >50 users or latency issues):
```yaml
services:
infinity:
image: michaelf34/infinity:latest
# Local to MCP server - ~10-20ms latency
```
2. **Implement Tiered Fallback**:
```
Infinity (local, fast) → Ollama (remote, slower) → Local model
```
3. **Load Testing**:
- Simulate 50-100 concurrent users
- Measure actual throughput limits
- Identify breaking points
### Long-Term (Enterprise Scale)
1. **Migrate to TEI Cluster** (when >100 users):
- GPU-accelerated
- Horizontal scaling
- <20ms latency
2. **Consider Managed Services**:
- Pinecone, Qdrant Cloud
- Removes operational burden
- Better SLAs
## Testing Recommendations
### Load Testing Script
```bash
# Test sustained load
for i in {1..100}; do
curl -s https://ollama.internal.coutinho.io/api/embed \
-d "{\"model\": \"nomic-embed-text\", \"input\": \"Test $i\"}" &
# Rate limit: 5 concurrent
if [ $(($i % 5)) -eq 0 ]; then
wait
sleep 1
fi
done
```
### Metrics to Collect
1. **Latency Distribution**:
- P50 (median)
- P95 (acceptable)
- P99 (outliers)
2. **Throughput**:
- Embeddings/second
- Peak vs sustained
3. **Error Rates**:
- Timeouts
- Server errors
- Quality issues
## Conclusion
**Your Ollama instance is ready for development and small production use!**
**Current Capacity**:
- ✅ Development: Unlimited
- ✅ Small prod (10-20 users, 10k docs): Comfortable
- ⚠️ Medium prod (50 users, 50k docs): Monitoring needed
- ❌ Large prod (>100 users): Migrate to Infinity/TEI
**Key Strengths**:
- Fully operational
- Good parallelism
- Acceptable latency for most use cases
- Easy to integrate
**Key Limitations**:
- Network latency adds 300-400ms overhead
- Batch quality issues at >16 items
- Limited scalability beyond 50 users
**Recommendation**:
Start using Ollama immediately for development. Add monitoring and plan for Infinity when you approach 50 users or experience latency issues. The abstraction layer in ADR-003 makes migration seamless.
**Next Steps**:
1. Configure MCP server with Ollama URL
2. Implement semantic search tools
3. Add basic monitoring
4. Test with real workload
5. Scale up as needed
+796
View File
@@ -0,0 +1,796 @@
# Ollama Embeddings Investigation
**Date**: 2025-10-30
**Status**: Recommendation for Integration
## Executive Summary
Ollama provides a **local, self-hosted embedding solution** that is excellent for **development and small-scale deployments** but has **performance limitations** compared to specialized embedding inference engines (TEI, Infinity).
**Recommendation**: Include Ollama as **Tier 2 fallback** in our embedding strategy (after cloud APIs, before local sentence-transformers), prioritizing ease of setup over maximum performance.
## Overview
Ollama is primarily known as a local LLM runner but added embedding model support in version 0.1.26, making it a convenient option for generating vector embeddings without external API dependencies.
### Key Characteristics
- **Local & Self-Hosted**: No external API calls, full privacy
- **Easy Setup**: Single binary, simple model downloads (`ollama pull nomic-embed-text`)
- **Unified Platform**: Same tool for both LLMs and embeddings
- **OpenAI Compatible**: `/v1/embeddings` endpoint for drop-in replacement
- **Multi-Platform**: Linux, macOS, Windows support
- **GPU Support**: CUDA, ROCm, Metal acceleration
## API Details
### Endpoint Structure
**New API** (recommended):
```bash
POST http://localhost:11434/api/embed
```
**OpenAI Compatible**:
```bash
POST http://localhost:11434/v1/embeddings
```
**Legacy API** (deprecated):
```bash
POST http://localhost:11434/api/embeddings
```
### Request Format
**Single Text Embedding**:
```json
{
"model": "nomic-embed-text",
"input": "Text to embed"
}
```
**Batch Embedding** (since v0.2.0):
```json
{
"model": "nomic-embed-text",
"input": [
"First text to embed",
"Second text to embed",
"Third text to embed"
]
}
```
### Response Format
```json
{
"model": "nomic-embed-text",
"embeddings": [
[0.123, -0.456, 0.789, ...], // 768 dimensions for nomic-embed-text
[0.234, -0.567, 0.890, ...]
]
}
```
### Python Integration
```python
import ollama
# Single embedding
response = ollama.embed(
model='nomic-embed-text',
input='Text to embed'
)
embedding = response['embeddings'][0]
# Batch embeddings (more efficient)
response = ollama.embed(
model='nomic-embed-text',
input=[
'First text',
'Second text',
'Third text'
]
)
embeddings = response['embeddings']
```
## Available Models
### 1. nomic-embed-text (Recommended)
**Specifications**:
- **Parameters**: 137M
- **Dimensions**: 768
- **Context Length**: 8,192 tokens (2K effective)
- **Size**: 274MB
- **Architecture**: BERT-based
**Performance**:
- Outperforms OpenAI `text-embedding-ada-002` and `text-embedding-3-small`
- Excellent for long-context tasks
- Strong general-purpose performance
**Use Cases**:
- General RAG applications
- Long document processing
- Semantic search
- Document clustering
**Pull Command**:
```bash
ollama pull nomic-embed-text
```
### 2. mxbai-embed-large
**Specifications**:
- **Parameters**: 334M
- **Dimensions**: 1,024
- **Context Length**: 512 tokens
- **Architecture**: BERT-large optimized
**Performance**:
- Claims to outperform commercial models
- Higher precision for complex queries
- Best quality but slower
**Use Cases**:
- High-precision semantic search
- Enterprise knowledge bases
- Multilingual content
**Pull Command**:
```bash
ollama pull mxbai-embed-large
```
### 3. all-minilm
**Specifications**:
- **Parameters**: 23M
- **Dimensions**: 384
- **Context Length**: 256 tokens
- **Size**: Smallest footprint
**Performance**:
- Fastest processing speed
- Good for sentence-level tasks
- Limited context window
**Use Cases**:
- Real-time applications
- Resource-constrained environments
- High-throughput scenarios
- Development/testing
**Pull Command**:
```bash
ollama pull all-minilm
```
## Performance Benchmarks
### Throughput Comparison
| Hardware | Model | Batch Size | Throughput | Notes |
|----------|-------|------------|------------|-------|
| RTX 4090 (24GB) | nomic-embed-text | 256 | 12,450 tok/sec | GPU-accelerated |
| RTX 4090 (24GB) | mxbai-embed-large | 128 | 8,920 tok/sec | GPU-accelerated |
| Intel i9-13900K (CPU) | nomic-embed-text | 32 | 3,250 tok/sec | CPU-only |
| Intel i9-13900K (CPU) | mxbai-embed-large | 16 | 2,180 tok/sec | CPU-only |
### Latency Comparison
**Single Request Latency** (RTX 4060):
- Ollama: ~99ms
- TEI: ~20ms (5x faster)
- Infinity: ~30-40ms (2.5-3x faster)
**Batch Processing**:
- Optimal batch size: 32-64 (model dependent)
- Performance degrades with batches >16 (quality issues reported)
- 2x slower than direct sentence-transformers usage
### Engine Comparison
Based on benchmarks from Baseten (2024):
| Engine | Relative Throughput | Notes |
|--------|---------------------|-------|
| BEI | 9.0x (baseline) | Fastest (proprietary) |
| TEI | 4.5x | Open source, Rust-based |
| Infinity | 3.5x | PyTorch/ONNX optimized |
| vLLM | 3.0x | General LLM inference |
| **Ollama** | **1.0x** | Slowest for embeddings |
**Key Insight**: Ollama is **5-9x slower** than specialized embedding engines but trades performance for ease of use and unified platform.
## Integration Implementation
### Python Client Wrapper
```python
# nextcloud_mcp_server/embeddings/ollama.py
import httpx
from typing import List
class OllamaEmbedding:
"""Ollama embedding provider"""
def __init__(
self,
base_url: str = "http://localhost:11434",
model: str = "nomic-embed-text"
):
self.base_url = base_url.rstrip("/")
self.model = model
self.client = httpx.AsyncClient(timeout=60.0)
# Model dimension mapping
self.dimensions = {
"nomic-embed-text": 768,
"mxbai-embed-large": 1024,
"all-minilm": 384
}
self.dimension = self.dimensions.get(model, 768)
async def embed(self, text: str) -> List[float]:
"""Generate embedding for single text"""
response = await self.client.post(
f"{self.base_url}/api/embed",
json={
"model": self.model,
"input": text
}
)
response.raise_for_status()
data = response.json()
return data["embeddings"][0]
async def embed_batch(
self,
texts: List[str],
batch_size: int = 32
) -> List[List[float]]:
"""
Generate embeddings for multiple texts in batches.
Note: Ollama has reported quality issues with batch sizes >16.
We use batch_size=32 as default but allow configuration.
"""
all_embeddings = []
# Process in chunks to avoid batch size issues
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = await self.client.post(
f"{self.base_url}/api/embed",
json={
"model": self.model,
"input": batch
}
)
response.raise_for_status()
data = response.json()
all_embeddings.extend(data["embeddings"])
return all_embeddings
async def check_health(self) -> bool:
"""Check if Ollama server is running and model is available"""
try:
# Check if server is up
response = await self.client.get(f"{self.base_url}/api/tags")
response.raise_for_status()
# Check if model is pulled
models = response.json().get("models", [])
model_names = [m["name"] for m in models]
if self.model not in model_names:
raise ValueError(
f"Model '{self.model}' not found. "
f"Run: ollama pull {self.model}"
)
return True
except Exception as e:
raise ConnectionError(f"Ollama health check failed: {e}")
async def close(self):
"""Close HTTP client"""
await self.client.aclose()
```
### Auto-Detection in Embedding Service
```python
# nextcloud_mcp_server/embeddings/service.py
from typing import Optional
import os
import logging
logger = logging.getLogger(__name__)
class EmbeddingService:
"""Unified embedding service with automatic provider detection"""
def __init__(self):
self.provider = None
self._detect_provider()
def _detect_provider(self):
"""Auto-detect available embedding provider"""
# Tier 1: OpenAI API (best quality)
if os.getenv("OPENAI_API_KEY"):
from .openai import OpenAIEmbedding
self.provider = OpenAIEmbedding(
model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
api_key=os.getenv("OPENAI_API_KEY")
)
logger.info("✓ Using OpenAI embeddings")
return
# Tier 2a: Infinity (optimized self-hosted)
if os.getenv("INFINITY_URL"):
from .infinity import InfinityEmbedding
try:
self.provider = InfinityEmbedding(
url=os.getenv("INFINITY_URL"),
model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
)
logger.info("✓ Using Infinity embeddings (optimized)")
return
except Exception as e:
logger.warning(f"Infinity unavailable: {e}")
# Tier 2b: Ollama (easy self-hosted)
if os.getenv("OLLAMA_URL"):
from .ollama import OllamaEmbedding
try:
self.provider = OllamaEmbedding(
base_url=os.getenv("OLLAMA_URL", "http://localhost:11434"),
model=os.getenv("OLLAMA_MODEL", "nomic-embed-text")
)
# Verify Ollama is running and model is available
import asyncio
asyncio.run(self.provider.check_health())
logger.info("✓ Using Ollama embeddings (easy setup)")
return
except Exception as e:
logger.warning(f"Ollama unavailable: {e}")
# Tier 3: Local model (fallback)
logger.warning("No cloud/hosted embeddings available, using local model")
from .local import LocalEmbedding
self.provider = LocalEmbedding(
model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
)
logger.info("✓ Using local embeddings (CPU fallback)")
async def embed(self, text: str):
"""Generate embedding for text"""
return await self.provider.embed(text)
async def embed_batch(self, texts: list[str]):
"""Generate embeddings for multiple texts"""
return await self.provider.embed_batch(texts)
@property
def dimension(self) -> int:
"""Get embedding dimension"""
return self.provider.dimension
```
### Docker Compose Configuration
```yaml
services:
# Ollama embedding service
ollama:
image: ollama/ollama:latest
restart: always
ports:
- 127.0.0.1:11434:11434
volumes:
- ollama_models:/root/.ollama
# Optional: GPU support
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Pull models on startup
entrypoint: ["/bin/sh", "-c"]
command:
- |
ollama serve &
sleep 5
ollama pull nomic-embed-text
wait
# MCP Server with Ollama embeddings
mcp:
build: .
depends_on:
- ollama
environment:
# ... other vars ...
- OLLAMA_URL=http://ollama:11434
- OLLAMA_MODEL=nomic-embed-text
# Vector sync worker
mcp-vector-sync:
build: .
command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
depends_on:
- ollama
- qdrant
environment:
# ... other vars ...
- OLLAMA_URL=http://ollama:11434
- OLLAMA_MODEL=nomic-embed-text
volumes:
ollama_models:
```
## Advantages of Ollama
### 1. **Ease of Setup**
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull embedding model
ollama pull nomic-embed-text
# Done! API available at localhost:11434
```
No complex configuration, no Docker registries, no model conversion.
### 2. **Privacy & Data Sovereignty**
- All processing happens locally
- No data leaves your infrastructure
- No API keys or external dependencies
- Ideal for sensitive content (medical, legal, financial)
### 3. **Unified Platform**
- Same tool for LLMs and embeddings
- Consistent API across model types
- Single point of management
- Simplified operations
### 4. **Developer Experience**
- Simple API (similar to OpenAI)
- Good documentation
- Active community
- Framework integrations (LangChain, LlamaIndex)
### 5. **Cost**
- Free and open source
- No per-token API costs
- Only infrastructure costs (compute)
### 6. **Model Variety**
Growing library of embedding models:
- nomic-embed-text (general purpose)
- mxbai-embed-large (high quality)
- all-minilm (fast)
- More models added regularly
## Limitations of Ollama
### 1. **Performance**
- **5-9x slower** than specialized engines (TEI, Infinity)
- Not optimized specifically for embedding inference
- Batch processing issues at larger batch sizes (>16)
- Higher latency compared to alternatives
### 2. **Scalability**
- Single-instance deployment (no native clustering)
- Limited concurrent request handling
- Not designed for high-throughput production
- Resource usage per request is higher
### 3. **Batch Processing Issues**
- Quality degradation reported with large batches
- Optimal batch size: 32-64 (conservative)
- Less efficient than specialized engines
- GitHub issues tracking batch problems (#6262)
### 4. **Resource Usage**
- Models stay loaded in memory (VRAM/RAM)
- Higher memory footprint per model
- GPU context switching overhead
- Not as memory-efficient as specialized engines
### 5. **Production Features**
- No built-in load balancing
- Limited monitoring/metrics
- No automatic scaling
- Basic error handling
## Use Case Recommendations
### ✅ **Excellent For:**
1. **Development & Testing**
- Quick setup for prototyping
- Local development environments
- Testing embedding pipelines
2. **Small Deployments**
- <10 users
- <10,000 documents
- Infrequent searches (<100/day)
- Hobbyist/personal projects
3. **Privacy-Critical Applications**
- Medical/healthcare records
- Legal documents
- Financial data
- Air-gapped environments
4. **Unified LLM Stack**
- Projects already using Ollama for LLMs
- Simplified operations
- Consistent tooling
5. **Educational/Learning**
- Teaching RAG concepts
- Learning embeddings
- Hackathons/workshops
### ⚠️ **Consider Alternatives For:**
1. **Production at Scale**
- >100 users
- >100,000 documents
- High query volume (>1000/day)
- Use: TEI or Infinity
2. **Performance-Critical**
- Real-time search (<50ms latency)
- High-throughput batch processing
- Use: TEI with GPU
3. **Enterprise Deployments**
- Need for high availability
- Load balancing requirements
- Advanced monitoring
- Use: Managed services or TEI cluster
4. **Large-Scale Indexing**
- Millions of documents
- Continuous high-volume ingestion
- Use: Infinity or commercial solutions
## Integration Strategy
### Recommended Tier Placement
**Update ADR-003 embedding strategy:**
```
Tier 1: OpenAI API (best quality, requires API key)
↓ fallback
Tier 2a: Infinity (optimized self-hosted, complex setup)
↓ fallback
Tier 2b: Ollama (easy self-hosted, moderate performance) ← NEW
↓ fallback
Tier 3: Local sentence-transformers (CPU fallback, simplest)
```
### Configuration
```bash
# Option 1: Use Infinity (if available)
INFINITY_URL=http://infinity:7997
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
# Option 2: Use Ollama (if Infinity unavailable)
OLLAMA_URL=http://ollama:11434
OLLAMA_MODEL=nomic-embed-text
# Option 3: Use local model (automatic fallback)
# No configuration needed
```
### When to Choose Ollama
**Choose Ollama if**:
- You're already using Ollama for LLMs
- You need privacy/data sovereignty
- You have <10k documents and <100 users
- Ease of setup is more important than max performance
- You're in development/testing phase
**Choose Infinity/TEI if**:
- You need maximum throughput (>1000 embeddings/sec)
- You have >100k documents
- Latency is critical (<50ms)
- You're in production with >100 users
**Choose OpenAI API if**:
- You're okay with cloud dependencies
- You need best-in-class quality
- Cost is not a concern (~$0.02 per 1M tokens)
## Production Deployment Guidance
### Small Production (Ollama Acceptable)
**Profile**:
- 5-20 users
- 1,000-10,000 documents
- 50-200 searches/day
- <2 sec acceptable latency
**Configuration**:
```yaml
ollama:
image: ollama/ollama:latest
deploy:
resources:
limits:
memory: 4GB
cpus: "2.0"
reservations:
devices:
- driver: nvidia # GPU if available
count: 1
capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=2 # Concurrent requests
```
**Expected Performance**:
- Embedding latency: 100-200ms
- Throughput: 5-10 embeddings/sec
- Memory: 2-3GB (model loaded)
### Medium Production (Use Infinity/TEI)
**Profile**:
- 20-200 users
- 10,000-1M documents
- 500-5,000 searches/day
- <500ms acceptable latency
**Recommendation**: Migrate to Infinity or TEI
```yaml
infinity:
image: michaelf34/infinity:latest
# Better throughput and latency
```
### Large Production (Use Specialized Solution)
**Profile**:
- >200 users
- >1M documents
- >5,000 searches/day
- <100ms required latency
**Recommendation**: Use TEI cluster or commercial service
## Monitoring Considerations
### Key Metrics to Track
```python
# Add Ollama-specific metrics
from prometheus_client import Histogram, Counter, Gauge
ollama_embedding_latency = Histogram(
'ollama_embedding_duration_seconds',
'Ollama embedding generation time',
['model', 'batch_size']
)
ollama_batch_size = Gauge(
'ollama_batch_size',
'Current batch size being processed'
)
ollama_errors = Counter(
'ollama_errors_total',
'Ollama embedding errors',
['error_type']
)
```
### Health Checks
```python
async def ollama_health_check():
"""Check Ollama availability"""
try:
async with httpx.AsyncClient() as client:
# Check server
response = await client.get("http://ollama:11434/api/tags")
response.raise_for_status()
# Verify model loaded
models = response.json().get("models", [])
if "nomic-embed-text" not in [m["name"] for m in models]:
return False, "Model not pulled"
return True, "OK"
except Exception as e:
return False, str(e)
```
## Migration Path
### Starting with Ollama
**Phase 1: Development** (Ollama)
- Use Ollama for initial development
- Validate embedding pipeline
- Test search quality
**Phase 2: Growth** (Ollama → Infinity)
- Monitor performance metrics
- When >50 users or >10k docs, migrate to Infinity
- Simple config change, no code changes
**Phase 3: Scale** (Infinity → TEI/Commercial)
- When >200 users or performance issues
- Consider TEI cluster or managed services
### Code Compatibility
All embedding providers use the same interface:
```python
# Works with Ollama, Infinity, OpenAI, Local
embedding = await embedding_service.embed(text)
embeddings = await embedding_service.embed_batch(texts)
```
**Migration is a configuration change only** - no code rewrite needed.
## Conclusion
**Ollama is a solid choice for:**
- Early-stage projects
- Development/testing
- Privacy-critical applications
- Small deployments (<10 users, <10k docs)
- Unified LLM + embedding stack
**But recognize its limitations:**
- 5-9x slower than specialized engines
- Not designed for high-throughput production
- Batch processing can be problematic
- Limited scalability
**Recommendation**:
**Include Ollama as Tier 2b** (after Infinity, before local models) in the embedding strategy. It provides a good balance of ease-of-use and privacy for small-to-medium deployments while allowing seamless migration to more performant engines as needs grow.
The key is designing the abstraction layer (as done in ADR-003) so migration between engines requires only configuration changes, not code rewrites.
+20 -16
View File
@@ -3,8 +3,8 @@ Tests for Dynamic Client Registration (DCR) token_type parameter.
These tests verify that the Nextcloud OIDC server properly honors the token_type These tests verify that the Nextcloud OIDC server properly honors the token_type
parameter during client registration, issuing the correct type of access tokens: parameter during client registration, issuing the correct type of access tokens:
- token_type="JWT" → JWT-formatted tokens (RFC 9068) - token_type="jwt" → JWT-formatted tokens (RFC 9068)
- token_type="Bearer" → Opaque tokens (standard OAuth2) - token_type="opaque" → Opaque tokens (standard OAuth2)
This is critical for ensuring: This is critical for ensuring:
1. Client choice is respected by the OIDC server 1. Client choice is respected by the OIDC server
@@ -208,12 +208,14 @@ async def test_dcr_respects_jwt_token_type(
oauth_callback_server, oauth_callback_server,
): ):
""" """
Test that DCR honors token_type=JWT and issues JWT-formatted tokens. Test that DCR honors token_type=jwt and issues JWT-formatted tokens.
This verifies: This verifies:
1. Client registration with token_type="JWT" succeeds 1. Client registration with token_type="jwt" succeeds
2. Tokens obtained via this client are JWT format (base64.base64.signature) 2. Tokens obtained via this client are JWT format (base64.base64.signature)
3. JWT payload contains expected claims (sub, iss, scope, etc.) 3. JWT payload contains expected claims (sub, iss, scope, etc.)
Note: The OIDC app uses lowercase 'jwt' (not 'JWT').
""" """
nextcloud_host = os.getenv("NEXTCLOUD_HOST") nextcloud_host = os.getenv("NEXTCLOUD_HOST")
if not nextcloud_host: if not nextcloud_host:
@@ -232,15 +234,15 @@ async def test_dcr_respects_jwt_token_type(
token_endpoint = oidc_config.get("token_endpoint") token_endpoint = oidc_config.get("token_endpoint")
authorization_endpoint = oidc_config.get("authorization_endpoint") authorization_endpoint = oidc_config.get("authorization_endpoint")
# Register client with token_type="JWT" # Register client with token_type="jwt"
logger.info("Registering OAuth client with token_type=JWT...") logger.info("Registering OAuth client with token_type=jwt...")
client_info = await register_client( client_info = await register_client(
nextcloud_url=nextcloud_host, nextcloud_url=nextcloud_host,
registration_endpoint=registration_endpoint, registration_endpoint=registration_endpoint,
client_name="DCR Test - JWT Token Type", client_name="DCR Test - JWT Token Type",
redirect_uris=[callback_url], redirect_uris=[callback_url],
scopes="openid profile email notes:read notes:write", scopes="openid profile email notes:read notes:write",
token_type="JWT", token_type="jwt",
) )
logger.info(f"Registered JWT client: {client_info.client_id[:16]}...") logger.info(f"Registered JWT client: {client_info.client_id[:16]}...")
@@ -278,7 +280,7 @@ async def test_dcr_respects_jwt_token_type(
assert "notes:write" in scopes, "JWT scope claim missing notes:write" assert "notes:write" in scopes, "JWT scope claim missing notes:write"
logger.info( logger.info(
f"✅ DCR with token_type=JWT works correctly! " f"✅ DCR with token_type=jwt works correctly! "
f"Token is JWT format with scope claim: {payload['scope']}" f"Token is JWT format with scope claim: {payload['scope']}"
) )
@@ -290,12 +292,14 @@ async def test_dcr_respects_bearer_token_type(
oauth_callback_server, oauth_callback_server,
): ):
""" """
Test that DCR honors token_type=Bearer and issues opaque tokens. Test that DCR honors token_type=opaque and issues opaque tokens.
This verifies: This verifies:
1. Client registration with token_type="Bearer" succeeds 1. Client registration with token_type="opaque" succeeds
2. Tokens obtained via this client are opaque (NOT JWT format) 2. Tokens obtained via this client are opaque (NOT JWT format)
3. Opaque tokens are simple strings, not base64-encoded structures 3. Opaque tokens are simple strings, not base64-encoded structures
Note: The OIDC app uses 'opaque' or 'jwt' as token_type values (not 'Bearer').
""" """
nextcloud_host = os.getenv("NEXTCLOUD_HOST") nextcloud_host = os.getenv("NEXTCLOUD_HOST")
if not nextcloud_host: if not nextcloud_host:
@@ -314,18 +318,18 @@ async def test_dcr_respects_bearer_token_type(
token_endpoint = oidc_config.get("token_endpoint") token_endpoint = oidc_config.get("token_endpoint")
authorization_endpoint = oidc_config.get("authorization_endpoint") authorization_endpoint = oidc_config.get("authorization_endpoint")
# Register client with token_type="Bearer" (opaque tokens) # Register client with token_type="opaque" (opaque tokens)
logger.info("Registering OAuth client with token_type=Bearer...") logger.info("Registering OAuth client with token_type=opaque...")
client_info = await register_client( client_info = await register_client(
nextcloud_url=nextcloud_host, nextcloud_url=nextcloud_host,
registration_endpoint=registration_endpoint, registration_endpoint=registration_endpoint,
client_name="DCR Test - Bearer Token Type", client_name="DCR Test - Opaque Token Type",
redirect_uris=[callback_url], redirect_uris=[callback_url],
scopes="openid profile email notes:read notes:write", scopes="openid profile email notes:read notes:write",
token_type="Bearer", token_type="opaque",
) )
logger.info(f"Registered Bearer client: {client_info.client_id[:16]}...") logger.info(f"Registered Opaque token client: {client_info.client_id[:16]}...")
# Obtain token via OAuth flow # Obtain token via OAuth flow
access_token = await get_oauth_token_with_client( access_token = await get_oauth_token_with_client(
@@ -353,7 +357,7 @@ async def test_dcr_respects_bearer_token_type(
pass pass
logger.info( logger.info(
f"✅ DCR with token_type=Bearer works correctly! " f"✅ DCR with token_type=opaque works correctly! "
f"Token is opaque (not JWT format): {access_token[:30]}..." f"Token is opaque (not JWT format): {access_token[:30]}..."
) )