Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 4a3b80cb98 | |||
| fc3ab8d0ac |
@@ -0,0 +1,441 @@
|
|||||||
|
# Ollama Capacity Analysis: ollama.internal.coutinho.io
|
||||||
|
|
||||||
|
**Date**: 2025-10-30
|
||||||
|
**Model**: nomic-embed-text:latest
|
||||||
|
**Test Location**: From nextcloud-mcp-server host
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
✅ **Ollama instance is operational and performing well**
|
||||||
|
- Embedding generation working correctly
|
||||||
|
- Reasonable latency for small-medium workloads
|
||||||
|
- Good parallelism support
|
||||||
|
- Suitable for development and small production deployments
|
||||||
|
|
||||||
|
## Test Results
|
||||||
|
|
||||||
|
### Model Configuration
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
"dimensions": 768,
|
||||||
|
"status": "operational"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance Metrics
|
||||||
|
|
||||||
|
#### 1. Single Embedding Latency
|
||||||
|
|
||||||
|
**Result**: ~553ms per embedding
|
||||||
|
- **Total time**: 0.553 seconds
|
||||||
|
- **Includes**: Network + processing + model inference
|
||||||
|
- **Quality**: Full 768-dimensional vector
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Higher than bare-metal benchmarks (~100ms) due to network latency
|
||||||
|
- Acceptable for interactive search queries
|
||||||
|
- Within expected range for remote Ollama instance
|
||||||
|
|
||||||
|
#### 2. Batch Processing (5 items)
|
||||||
|
|
||||||
|
**Result**: ~1.02 seconds for 5 embeddings
|
||||||
|
- **Per-item average**: 204ms
|
||||||
|
- **Throughput**: ~4.9 embeddings/sec
|
||||||
|
- **Batch efficiency**: 2.7x faster than sequential
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Good batching efficiency (2.7x speedup vs 5x theoretical)
|
||||||
|
- Optimal for background indexing
|
||||||
|
- Network overhead amortized across batch
|
||||||
|
|
||||||
|
#### 3. Batch Processing (20 items)
|
||||||
|
|
||||||
|
**Result**: ~6.71 seconds for 20 embeddings
|
||||||
|
- **Per-item average**: 336ms
|
||||||
|
- **Throughput**: ~3.0 embeddings/sec
|
||||||
|
- **Batch efficiency**: 1.65x faster than sequential
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Performance degrades slightly with larger batches
|
||||||
|
- Still faster than sequential processing
|
||||||
|
- Matches reported Ollama behavior (quality issues at batch >16)
|
||||||
|
- **Recommendation**: Keep batch size ≤16 for best quality
|
||||||
|
|
||||||
|
#### 4. Concurrent Requests (5 parallel)
|
||||||
|
|
||||||
|
**Result**: ~1.27 seconds for 5 parallel requests
|
||||||
|
- **Effective parallelism**: ~4x speedup (vs 2.77s sequential)
|
||||||
|
- **Per-request average**: 254ms
|
||||||
|
- **Throughput**: ~3.9 requests/sec
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- Excellent parallelism support
|
||||||
|
- Server handles concurrent requests efficiently
|
||||||
|
- Network and compute overlap effectively
|
||||||
|
- Good for multi-user scenarios
|
||||||
|
|
||||||
|
## Capacity Planning
|
||||||
|
|
||||||
|
### Current Performance Profile
|
||||||
|
|
||||||
|
| Metric | Value | Rating |
|
||||||
|
|--------|-------|--------|
|
||||||
|
| Single embedding latency | 553ms | ⚠️ Moderate |
|
||||||
|
| Batch (5) throughput | 4.9/sec | ✅ Good |
|
||||||
|
| Batch (20) throughput | 3.0/sec | ⚠️ Moderate |
|
||||||
|
| Concurrent throughput | 3.9/sec | ✅ Good |
|
||||||
|
| Network latency | ~300-400ms | ⚠️ Significant |
|
||||||
|
|
||||||
|
### Bottleneck Analysis
|
||||||
|
|
||||||
|
**Primary Bottleneck**: Network latency (~300-400ms per request)
|
||||||
|
- Model inference: ~100-200ms (estimated)
|
||||||
|
- Network round-trip: ~300-400ms (measured overhead)
|
||||||
|
- **Impact**: 60-70% of total latency is network
|
||||||
|
|
||||||
|
**Secondary Bottleneck**: CPU/GPU capacity (unknown hardware)
|
||||||
|
- Batch performance degrades at >16 items
|
||||||
|
- Suggests resource constraints
|
||||||
|
- Likely CPU-only (no GPU metrics available)
|
||||||
|
|
||||||
|
### Recommended Usage Patterns
|
||||||
|
|
||||||
|
#### ✅ **Excellent For:**
|
||||||
|
|
||||||
|
**1. Background Indexing**
|
||||||
|
- Use batch size of 10-15 items
|
||||||
|
- Expected throughput: 3-5 embeddings/sec
|
||||||
|
- **10,000 notes**: ~30-55 minutes to index
|
||||||
|
- **1,000 notes**: ~3-5 minutes to index
|
||||||
|
|
||||||
|
**2. Interactive Search**
|
||||||
|
- Single query embedding: ~550ms
|
||||||
|
- Acceptable for user-facing search
|
||||||
|
- Add 100-200ms for vector search + verification
|
||||||
|
- **Total search time**: ~650-750ms (reasonable UX)
|
||||||
|
|
||||||
|
**3. Multi-User Development**
|
||||||
|
- 5-10 concurrent users: Comfortable
|
||||||
|
- Good parallelism support
|
||||||
|
- Network latency dominates (shared)
|
||||||
|
|
||||||
|
#### ⚠️ **Consider Alternatives For:**
|
||||||
|
|
||||||
|
**1. Real-Time Applications**
|
||||||
|
- Sub-100ms latency requirements
|
||||||
|
- High-frequency queries (>10/sec sustained)
|
||||||
|
- Consider: Local embeddings or Infinity
|
||||||
|
|
||||||
|
**2. Large-Scale Batch Processing**
|
||||||
|
- >100,000 documents to index
|
||||||
|
- >10 embeddings/sec sustained
|
||||||
|
- Consider: GPU-accelerated TEI
|
||||||
|
|
||||||
|
**3. Production with >50 Users**
|
||||||
|
- High concurrent load
|
||||||
|
- Latency sensitivity
|
||||||
|
- Consider: Dedicated embedding service
|
||||||
|
|
||||||
|
### Deployment Scenarios
|
||||||
|
|
||||||
|
#### Scenario 1: Development Environment
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- 1-3 developers
|
||||||
|
- 1,000-5,000 notes total
|
||||||
|
- Occasional searches/indexing
|
||||||
|
|
||||||
|
**Verdict**: ✅ **Perfect fit**
|
||||||
|
- Initial index: ~5-15 minutes (one-time)
|
||||||
|
- Incremental updates: <1 minute
|
||||||
|
- Search latency: Acceptable
|
||||||
|
- No infrastructure changes needed
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```bash
|
||||||
|
OLLAMA_URL=https://ollama.internal.coutinho.io
|
||||||
|
OLLAMA_MODEL=nomic-embed-text
|
||||||
|
VECTOR_SYNC_INTERVAL=600 # 10 minutes
|
||||||
|
VECTOR_SYNC_BATCH_SIZE=10
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Scenario 2: Small Production (10-20 users)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- 10-20 active users
|
||||||
|
- 10,000-50,000 notes total
|
||||||
|
- 50-200 searches/day
|
||||||
|
- Nightly incremental indexing
|
||||||
|
|
||||||
|
**Verdict**: ✅ **Suitable with optimizations**
|
||||||
|
- Initial index: 1-3 hours (run overnight)
|
||||||
|
- Incremental: 5-15 minutes/night
|
||||||
|
- Search: Acceptable for most users
|
||||||
|
- Monitor network latency
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```bash
|
||||||
|
OLLAMA_URL=https://ollama.internal.coutinho.io
|
||||||
|
OLLAMA_MODEL=nomic-embed-text
|
||||||
|
VECTOR_SYNC_INTERVAL=86400 # Daily at night
|
||||||
|
VECTOR_SYNC_BATCH_SIZE=12 # Conservative for quality
|
||||||
|
SEARCH_TIMEOUT_MS=1000 # Account for 550ms latency
|
||||||
|
```
|
||||||
|
|
||||||
|
**Optimizations**:
|
||||||
|
- Run sync during off-hours
|
||||||
|
- Cache query embeddings (common searches)
|
||||||
|
- Use hybrid search (keyword + semantic)
|
||||||
|
|
||||||
|
#### Scenario 3: Medium Production (50-100 users)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- 50-100 active users
|
||||||
|
- 100,000+ notes
|
||||||
|
- 500-1000 searches/day
|
||||||
|
- Real-time indexing desired
|
||||||
|
|
||||||
|
**Verdict**: ⚠️ **Marginal - monitor closely**
|
||||||
|
- Initial index: 5-10 hours
|
||||||
|
- Search latency: May feel slow for some users
|
||||||
|
- Concurrent load: Approaching limits
|
||||||
|
- **Recommendation**: Plan migration to Infinity
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```bash
|
||||||
|
OLLAMA_URL=https://ollama.internal.coutinho.io
|
||||||
|
OLLAMA_MODEL=nomic-embed-text
|
||||||
|
VECTOR_SYNC_INTERVAL=3600 # Hourly
|
||||||
|
VECTOR_SYNC_BATCH_SIZE=10
|
||||||
|
SEMANTIC_WEIGHT=0.5 # Rely more on keyword search
|
||||||
|
SEARCH_TIMEOUT_MS=2000 # Generous timeout
|
||||||
|
```
|
||||||
|
|
||||||
|
**Migration Path**:
|
||||||
|
- Start with Ollama
|
||||||
|
- Monitor latency metrics
|
||||||
|
- When p95 latency >1s, migrate to Infinity
|
||||||
|
- Keep Ollama as fallback
|
||||||
|
|
||||||
|
#### Scenario 4: Large Production (>100 users)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- >100 active users
|
||||||
|
- >500,000 notes
|
||||||
|
- >1000 searches/day
|
||||||
|
- Real-time expectations
|
||||||
|
|
||||||
|
**Verdict**: ❌ **Not recommended**
|
||||||
|
- Latency too high for scale
|
||||||
|
- Throughput insufficient
|
||||||
|
- Network becomes bottleneck
|
||||||
|
- **Recommendation**: Use Infinity or TEI from start
|
||||||
|
|
||||||
|
## Network Latency Optimization
|
||||||
|
|
||||||
|
### Current Overhead: ~300-400ms
|
||||||
|
|
||||||
|
**If MCP server runs closer to Ollama**:
|
||||||
|
```
|
||||||
|
Same VPC/network: ~1-5ms (300-400ms savings!)
|
||||||
|
Same host: <1ms (300-400ms savings!)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**Option A: Co-locate MCP server with Ollama**
|
||||||
|
- Reduces latency from 550ms → 150-200ms
|
||||||
|
- 2.5-3x improvement
|
||||||
|
- Makes Ollama competitive with cloud APIs
|
||||||
|
|
||||||
|
**Option B: Keep separate (current)**
|
||||||
|
- Simpler deployment
|
||||||
|
- Better security isolation
|
||||||
|
- Accept 550ms latency
|
||||||
|
|
||||||
|
**Option C: Add Infinity container to MCP server**
|
||||||
|
- Best of both worlds
|
||||||
|
- Use Infinity for speed (local)
|
||||||
|
- Fallback to Ollama if needed
|
||||||
|
|
||||||
|
## Capacity Estimates
|
||||||
|
|
||||||
|
### Indexing Capacity
|
||||||
|
|
||||||
|
**Sustained Throughput**: 3-4 embeddings/sec (conservative)
|
||||||
|
|
||||||
|
| Document Count | Index Time | Notes |
|
||||||
|
|----------------|------------|-------|
|
||||||
|
| 1,000 | 4-5 min | Quick |
|
||||||
|
| 5,000 | 20-25 min | Reasonable |
|
||||||
|
| 10,000 | 40-50 min | Acceptable |
|
||||||
|
| 50,000 | 3.5-4.5 hours | Overnight job |
|
||||||
|
| 100,000 | 7-9 hours | Long batch |
|
||||||
|
| 500,000 | 35-45 hours | Not recommended |
|
||||||
|
|
||||||
|
**Incremental Updates** (10% change daily):
|
||||||
|
- 1,000 docs: ~30 sec
|
||||||
|
- 10,000 docs: ~5 min
|
||||||
|
- 50,000 docs: ~25 min
|
||||||
|
|
||||||
|
### Search Capacity
|
||||||
|
|
||||||
|
**Query Latency Budget**:
|
||||||
|
- Embedding: 550ms
|
||||||
|
- Vector search: 50-100ms
|
||||||
|
- Permission verification: 50-100ms
|
||||||
|
- **Total**: 650-750ms
|
||||||
|
|
||||||
|
**Concurrent Users** (assuming 1 search every 5 minutes):
|
||||||
|
- 10 users: 2 queries/min → Comfortable
|
||||||
|
- 50 users: 10 queries/min → Near limit
|
||||||
|
- 100 users: 20 queries/min → Over capacity
|
||||||
|
|
||||||
|
**Peak Load** (all users search at once):
|
||||||
|
- Parallelism: ~4 concurrent
|
||||||
|
- Queue time: Proportional to position
|
||||||
|
- 10 simultaneous: ~1.5-2 sec for last user
|
||||||
|
- 50 simultaneous: ~7-10 sec for last user
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
### Immediate Actions (Development)
|
||||||
|
|
||||||
|
1. **✅ Use Ollama as-is**
|
||||||
|
- Current setup is perfect for dev/testing
|
||||||
|
- No changes needed
|
||||||
|
- Start building semantic search
|
||||||
|
|
||||||
|
2. **Configuration**:
|
||||||
|
```bash
|
||||||
|
OLLAMA_URL=https://ollama.internal.coutinho.io
|
||||||
|
OLLAMA_MODEL=nomic-embed-text
|
||||||
|
VECTOR_SYNC_BATCH_SIZE=10
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Add Monitoring**:
|
||||||
|
```python
|
||||||
|
# Track these metrics
|
||||||
|
- embedding_latency_seconds (histogram)
|
||||||
|
- embedding_batch_size (gauge)
|
||||||
|
- embedding_errors_total (counter)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Short-Term (Small Production)
|
||||||
|
|
||||||
|
1. **Optimize Batching**:
|
||||||
|
- Use batch size 10-12 (quality sweet spot)
|
||||||
|
- Process during off-hours
|
||||||
|
- Implement incremental sync
|
||||||
|
|
||||||
|
2. **Add Caching**:
|
||||||
|
```python
|
||||||
|
# Cache common query embeddings
|
||||||
|
@lru_cache(maxsize=1000)
|
||||||
|
async def embed_with_cache(query: str):
|
||||||
|
return await ollama.embed(query)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Monitor Metrics**:
|
||||||
|
- P50, P95, P99 latency
|
||||||
|
- Throughput (embeddings/sec)
|
||||||
|
- Error rates
|
||||||
|
|
||||||
|
### Medium-Term (If Scaling Up)
|
||||||
|
|
||||||
|
1. **Add Infinity Container** (when >50 users or latency issues):
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
infinity:
|
||||||
|
image: michaelf34/infinity:latest
|
||||||
|
# Local to MCP server - ~10-20ms latency
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Implement Tiered Fallback**:
|
||||||
|
```
|
||||||
|
Infinity (local, fast) → Ollama (remote, slower) → Local model
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Load Testing**:
|
||||||
|
- Simulate 50-100 concurrent users
|
||||||
|
- Measure actual throughput limits
|
||||||
|
- Identify breaking points
|
||||||
|
|
||||||
|
### Long-Term (Enterprise Scale)
|
||||||
|
|
||||||
|
1. **Migrate to TEI Cluster** (when >100 users):
|
||||||
|
- GPU-accelerated
|
||||||
|
- Horizontal scaling
|
||||||
|
- <20ms latency
|
||||||
|
|
||||||
|
2. **Consider Managed Services**:
|
||||||
|
- Pinecone, Qdrant Cloud
|
||||||
|
- Removes operational burden
|
||||||
|
- Better SLAs
|
||||||
|
|
||||||
|
## Testing Recommendations
|
||||||
|
|
||||||
|
### Load Testing Script
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test sustained load
|
||||||
|
for i in {1..100}; do
|
||||||
|
curl -s https://ollama.internal.coutinho.io/api/embed \
|
||||||
|
-d "{\"model\": \"nomic-embed-text\", \"input\": \"Test $i\"}" &
|
||||||
|
|
||||||
|
# Rate limit: 5 concurrent
|
||||||
|
if [ $(($i % 5)) -eq 0 ]; then
|
||||||
|
wait
|
||||||
|
sleep 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics to Collect
|
||||||
|
|
||||||
|
1. **Latency Distribution**:
|
||||||
|
- P50 (median)
|
||||||
|
- P95 (acceptable)
|
||||||
|
- P99 (outliers)
|
||||||
|
|
||||||
|
2. **Throughput**:
|
||||||
|
- Embeddings/second
|
||||||
|
- Peak vs sustained
|
||||||
|
|
||||||
|
3. **Error Rates**:
|
||||||
|
- Timeouts
|
||||||
|
- Server errors
|
||||||
|
- Quality issues
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
**Your Ollama instance is ready for development and small production use!**
|
||||||
|
|
||||||
|
**Current Capacity**:
|
||||||
|
- ✅ Development: Unlimited
|
||||||
|
- ✅ Small prod (10-20 users, 10k docs): Comfortable
|
||||||
|
- ⚠️ Medium prod (50 users, 50k docs): Monitoring needed
|
||||||
|
- ❌ Large prod (>100 users): Migrate to Infinity/TEI
|
||||||
|
|
||||||
|
**Key Strengths**:
|
||||||
|
- Fully operational
|
||||||
|
- Good parallelism
|
||||||
|
- Acceptable latency for most use cases
|
||||||
|
- Easy to integrate
|
||||||
|
|
||||||
|
**Key Limitations**:
|
||||||
|
- Network latency adds 300-400ms overhead
|
||||||
|
- Batch quality issues at >16 items
|
||||||
|
- Limited scalability beyond 50 users
|
||||||
|
|
||||||
|
**Recommendation**:
|
||||||
|
Start using Ollama immediately for development. Add monitoring and plan for Infinity when you approach 50 users or experience latency issues. The abstraction layer in ADR-003 makes migration seamless.
|
||||||
|
|
||||||
|
**Next Steps**:
|
||||||
|
1. Configure MCP server with Ollama URL
|
||||||
|
2. Implement semantic search tools
|
||||||
|
3. Add basic monitoring
|
||||||
|
4. Test with real workload
|
||||||
|
5. Scale up as needed
|
||||||
@@ -0,0 +1,796 @@
|
|||||||
|
# Ollama Embeddings Investigation
|
||||||
|
|
||||||
|
**Date**: 2025-10-30
|
||||||
|
**Status**: Recommendation for Integration
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Ollama provides a **local, self-hosted embedding solution** that is excellent for **development and small-scale deployments** but has **performance limitations** compared to specialized embedding inference engines (TEI, Infinity).
|
||||||
|
|
||||||
|
**Recommendation**: Include Ollama as **Tier 2 fallback** in our embedding strategy (after cloud APIs, before local sentence-transformers), prioritizing ease of setup over maximum performance.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Ollama is primarily known as a local LLM runner but added embedding model support in version 0.1.26, making it a convenient option for generating vector embeddings without external API dependencies.
|
||||||
|
|
||||||
|
### Key Characteristics
|
||||||
|
|
||||||
|
- **Local & Self-Hosted**: No external API calls, full privacy
|
||||||
|
- **Easy Setup**: Single binary, simple model downloads (`ollama pull nomic-embed-text`)
|
||||||
|
- **Unified Platform**: Same tool for both LLMs and embeddings
|
||||||
|
- **OpenAI Compatible**: `/v1/embeddings` endpoint for drop-in replacement
|
||||||
|
- **Multi-Platform**: Linux, macOS, Windows support
|
||||||
|
- **GPU Support**: CUDA, ROCm, Metal acceleration
|
||||||
|
|
||||||
|
## API Details
|
||||||
|
|
||||||
|
### Endpoint Structure
|
||||||
|
|
||||||
|
**New API** (recommended):
|
||||||
|
```bash
|
||||||
|
POST http://localhost:11434/api/embed
|
||||||
|
```
|
||||||
|
|
||||||
|
**OpenAI Compatible**:
|
||||||
|
```bash
|
||||||
|
POST http://localhost:11434/v1/embeddings
|
||||||
|
```
|
||||||
|
|
||||||
|
**Legacy API** (deprecated):
|
||||||
|
```bash
|
||||||
|
POST http://localhost:11434/api/embeddings
|
||||||
|
```
|
||||||
|
|
||||||
|
### Request Format
|
||||||
|
|
||||||
|
**Single Text Embedding**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
"input": "Text to embed"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Batch Embedding** (since v0.2.0):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
"input": [
|
||||||
|
"First text to embed",
|
||||||
|
"Second text to embed",
|
||||||
|
"Third text to embed"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Response Format
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
"embeddings": [
|
||||||
|
[0.123, -0.456, 0.789, ...], // 768 dimensions for nomic-embed-text
|
||||||
|
[0.234, -0.567, 0.890, ...]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python Integration
|
||||||
|
|
||||||
|
```python
|
||||||
|
import ollama
|
||||||
|
|
||||||
|
# Single embedding
|
||||||
|
response = ollama.embed(
|
||||||
|
model='nomic-embed-text',
|
||||||
|
input='Text to embed'
|
||||||
|
)
|
||||||
|
embedding = response['embeddings'][0]
|
||||||
|
|
||||||
|
# Batch embeddings (more efficient)
|
||||||
|
response = ollama.embed(
|
||||||
|
model='nomic-embed-text',
|
||||||
|
input=[
|
||||||
|
'First text',
|
||||||
|
'Second text',
|
||||||
|
'Third text'
|
||||||
|
]
|
||||||
|
)
|
||||||
|
embeddings = response['embeddings']
|
||||||
|
```
|
||||||
|
|
||||||
|
## Available Models
|
||||||
|
|
||||||
|
### 1. nomic-embed-text (Recommended)
|
||||||
|
|
||||||
|
**Specifications**:
|
||||||
|
- **Parameters**: 137M
|
||||||
|
- **Dimensions**: 768
|
||||||
|
- **Context Length**: 8,192 tokens (2K effective)
|
||||||
|
- **Size**: 274MB
|
||||||
|
- **Architecture**: BERT-based
|
||||||
|
|
||||||
|
**Performance**:
|
||||||
|
- Outperforms OpenAI `text-embedding-ada-002` and `text-embedding-3-small`
|
||||||
|
- Excellent for long-context tasks
|
||||||
|
- Strong general-purpose performance
|
||||||
|
|
||||||
|
**Use Cases**:
|
||||||
|
- General RAG applications
|
||||||
|
- Long document processing
|
||||||
|
- Semantic search
|
||||||
|
- Document clustering
|
||||||
|
|
||||||
|
**Pull Command**:
|
||||||
|
```bash
|
||||||
|
ollama pull nomic-embed-text
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. mxbai-embed-large
|
||||||
|
|
||||||
|
**Specifications**:
|
||||||
|
- **Parameters**: 334M
|
||||||
|
- **Dimensions**: 1,024
|
||||||
|
- **Context Length**: 512 tokens
|
||||||
|
- **Architecture**: BERT-large optimized
|
||||||
|
|
||||||
|
**Performance**:
|
||||||
|
- Claims to outperform commercial models
|
||||||
|
- Higher precision for complex queries
|
||||||
|
- Best quality but slower
|
||||||
|
|
||||||
|
**Use Cases**:
|
||||||
|
- High-precision semantic search
|
||||||
|
- Enterprise knowledge bases
|
||||||
|
- Multilingual content
|
||||||
|
|
||||||
|
**Pull Command**:
|
||||||
|
```bash
|
||||||
|
ollama pull mxbai-embed-large
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. all-minilm
|
||||||
|
|
||||||
|
**Specifications**:
|
||||||
|
- **Parameters**: 23M
|
||||||
|
- **Dimensions**: 384
|
||||||
|
- **Context Length**: 256 tokens
|
||||||
|
- **Size**: Smallest footprint
|
||||||
|
|
||||||
|
**Performance**:
|
||||||
|
- Fastest processing speed
|
||||||
|
- Good for sentence-level tasks
|
||||||
|
- Limited context window
|
||||||
|
|
||||||
|
**Use Cases**:
|
||||||
|
- Real-time applications
|
||||||
|
- Resource-constrained environments
|
||||||
|
- High-throughput scenarios
|
||||||
|
- Development/testing
|
||||||
|
|
||||||
|
**Pull Command**:
|
||||||
|
```bash
|
||||||
|
ollama pull all-minilm
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Benchmarks
|
||||||
|
|
||||||
|
### Throughput Comparison
|
||||||
|
|
||||||
|
| Hardware | Model | Batch Size | Throughput | Notes |
|
||||||
|
|----------|-------|------------|------------|-------|
|
||||||
|
| RTX 4090 (24GB) | nomic-embed-text | 256 | 12,450 tok/sec | GPU-accelerated |
|
||||||
|
| RTX 4090 (24GB) | mxbai-embed-large | 128 | 8,920 tok/sec | GPU-accelerated |
|
||||||
|
| Intel i9-13900K (CPU) | nomic-embed-text | 32 | 3,250 tok/sec | CPU-only |
|
||||||
|
| Intel i9-13900K (CPU) | mxbai-embed-large | 16 | 2,180 tok/sec | CPU-only |
|
||||||
|
|
||||||
|
### Latency Comparison
|
||||||
|
|
||||||
|
**Single Request Latency** (RTX 4060):
|
||||||
|
- Ollama: ~99ms
|
||||||
|
- TEI: ~20ms (5x faster)
|
||||||
|
- Infinity: ~30-40ms (2.5-3x faster)
|
||||||
|
|
||||||
|
**Batch Processing**:
|
||||||
|
- Optimal batch size: 32-64 (model dependent)
|
||||||
|
- Performance degrades with batches >16 (quality issues reported)
|
||||||
|
- 2x slower than direct sentence-transformers usage
|
||||||
|
|
||||||
|
### Engine Comparison
|
||||||
|
|
||||||
|
Based on benchmarks from Baseten (2024):
|
||||||
|
|
||||||
|
| Engine | Relative Throughput | Notes |
|
||||||
|
|--------|---------------------|-------|
|
||||||
|
| BEI | 9.0x (baseline) | Fastest (proprietary) |
|
||||||
|
| TEI | 4.5x | Open source, Rust-based |
|
||||||
|
| Infinity | 3.5x | PyTorch/ONNX optimized |
|
||||||
|
| vLLM | 3.0x | General LLM inference |
|
||||||
|
| **Ollama** | **1.0x** | Slowest for embeddings |
|
||||||
|
|
||||||
|
**Key Insight**: Ollama is **5-9x slower** than specialized embedding engines but trades performance for ease of use and unified platform.
|
||||||
|
|
||||||
|
## Integration Implementation
|
||||||
|
|
||||||
|
### Python Client Wrapper
|
||||||
|
|
||||||
|
```python
|
||||||
|
# nextcloud_mcp_server/embeddings/ollama.py
|
||||||
|
import httpx
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
|
||||||
|
class OllamaEmbedding:
|
||||||
|
"""Ollama embedding provider"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
base_url: str = "http://localhost:11434",
|
||||||
|
model: str = "nomic-embed-text"
|
||||||
|
):
|
||||||
|
self.base_url = base_url.rstrip("/")
|
||||||
|
self.model = model
|
||||||
|
self.client = httpx.AsyncClient(timeout=60.0)
|
||||||
|
|
||||||
|
# Model dimension mapping
|
||||||
|
self.dimensions = {
|
||||||
|
"nomic-embed-text": 768,
|
||||||
|
"mxbai-embed-large": 1024,
|
||||||
|
"all-minilm": 384
|
||||||
|
}
|
||||||
|
self.dimension = self.dimensions.get(model, 768)
|
||||||
|
|
||||||
|
async def embed(self, text: str) -> List[float]:
|
||||||
|
"""Generate embedding for single text"""
|
||||||
|
response = await self.client.post(
|
||||||
|
f"{self.base_url}/api/embed",
|
||||||
|
json={
|
||||||
|
"model": self.model,
|
||||||
|
"input": text
|
||||||
|
}
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
return data["embeddings"][0]
|
||||||
|
|
||||||
|
async def embed_batch(
|
||||||
|
self,
|
||||||
|
texts: List[str],
|
||||||
|
batch_size: int = 32
|
||||||
|
) -> List[List[float]]:
|
||||||
|
"""
|
||||||
|
Generate embeddings for multiple texts in batches.
|
||||||
|
|
||||||
|
Note: Ollama has reported quality issues with batch sizes >16.
|
||||||
|
We use batch_size=32 as default but allow configuration.
|
||||||
|
"""
|
||||||
|
all_embeddings = []
|
||||||
|
|
||||||
|
# Process in chunks to avoid batch size issues
|
||||||
|
for i in range(0, len(texts), batch_size):
|
||||||
|
batch = texts[i:i + batch_size]
|
||||||
|
|
||||||
|
response = await self.client.post(
|
||||||
|
f"{self.base_url}/api/embed",
|
||||||
|
json={
|
||||||
|
"model": self.model,
|
||||||
|
"input": batch
|
||||||
|
}
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
all_embeddings.extend(data["embeddings"])
|
||||||
|
|
||||||
|
return all_embeddings
|
||||||
|
|
||||||
|
async def check_health(self) -> bool:
|
||||||
|
"""Check if Ollama server is running and model is available"""
|
||||||
|
try:
|
||||||
|
# Check if server is up
|
||||||
|
response = await self.client.get(f"{self.base_url}/api/tags")
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
# Check if model is pulled
|
||||||
|
models = response.json().get("models", [])
|
||||||
|
model_names = [m["name"] for m in models]
|
||||||
|
|
||||||
|
if self.model not in model_names:
|
||||||
|
raise ValueError(
|
||||||
|
f"Model '{self.model}' not found. "
|
||||||
|
f"Run: ollama pull {self.model}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
raise ConnectionError(f"Ollama health check failed: {e}")
|
||||||
|
|
||||||
|
async def close(self):
|
||||||
|
"""Close HTTP client"""
|
||||||
|
await self.client.aclose()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Auto-Detection in Embedding Service
|
||||||
|
|
||||||
|
```python
|
||||||
|
# nextcloud_mcp_server/embeddings/service.py
|
||||||
|
from typing import Optional
|
||||||
|
import os
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class EmbeddingService:
|
||||||
|
"""Unified embedding service with automatic provider detection"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.provider = None
|
||||||
|
self._detect_provider()
|
||||||
|
|
||||||
|
def _detect_provider(self):
|
||||||
|
"""Auto-detect available embedding provider"""
|
||||||
|
|
||||||
|
# Tier 1: OpenAI API (best quality)
|
||||||
|
if os.getenv("OPENAI_API_KEY"):
|
||||||
|
from .openai import OpenAIEmbedding
|
||||||
|
self.provider = OpenAIEmbedding(
|
||||||
|
model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
|
||||||
|
api_key=os.getenv("OPENAI_API_KEY")
|
||||||
|
)
|
||||||
|
logger.info("✓ Using OpenAI embeddings")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Tier 2a: Infinity (optimized self-hosted)
|
||||||
|
if os.getenv("INFINITY_URL"):
|
||||||
|
from .infinity import InfinityEmbedding
|
||||||
|
try:
|
||||||
|
self.provider = InfinityEmbedding(
|
||||||
|
url=os.getenv("INFINITY_URL"),
|
||||||
|
model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
|
||||||
|
)
|
||||||
|
logger.info("✓ Using Infinity embeddings (optimized)")
|
||||||
|
return
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Infinity unavailable: {e}")
|
||||||
|
|
||||||
|
# Tier 2b: Ollama (easy self-hosted)
|
||||||
|
if os.getenv("OLLAMA_URL"):
|
||||||
|
from .ollama import OllamaEmbedding
|
||||||
|
try:
|
||||||
|
self.provider = OllamaEmbedding(
|
||||||
|
base_url=os.getenv("OLLAMA_URL", "http://localhost:11434"),
|
||||||
|
model=os.getenv("OLLAMA_MODEL", "nomic-embed-text")
|
||||||
|
)
|
||||||
|
# Verify Ollama is running and model is available
|
||||||
|
import asyncio
|
||||||
|
asyncio.run(self.provider.check_health())
|
||||||
|
logger.info("✓ Using Ollama embeddings (easy setup)")
|
||||||
|
return
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Ollama unavailable: {e}")
|
||||||
|
|
||||||
|
# Tier 3: Local model (fallback)
|
||||||
|
logger.warning("No cloud/hosted embeddings available, using local model")
|
||||||
|
from .local import LocalEmbedding
|
||||||
|
self.provider = LocalEmbedding(
|
||||||
|
model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
|
||||||
|
)
|
||||||
|
logger.info("✓ Using local embeddings (CPU fallback)")
|
||||||
|
|
||||||
|
async def embed(self, text: str):
|
||||||
|
"""Generate embedding for text"""
|
||||||
|
return await self.provider.embed(text)
|
||||||
|
|
||||||
|
async def embed_batch(self, texts: list[str]):
|
||||||
|
"""Generate embeddings for multiple texts"""
|
||||||
|
return await self.provider.embed_batch(texts)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def dimension(self) -> int:
|
||||||
|
"""Get embedding dimension"""
|
||||||
|
return self.provider.dimension
|
||||||
|
```
|
||||||
|
|
||||||
|
### Docker Compose Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
# Ollama embedding service
|
||||||
|
ollama:
|
||||||
|
image: ollama/ollama:latest
|
||||||
|
restart: always
|
||||||
|
ports:
|
||||||
|
- 127.0.0.1:11434:11434
|
||||||
|
volumes:
|
||||||
|
- ollama_models:/root/.ollama
|
||||||
|
# Optional: GPU support
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: 1
|
||||||
|
capabilities: [gpu]
|
||||||
|
# Pull models on startup
|
||||||
|
entrypoint: ["/bin/sh", "-c"]
|
||||||
|
command:
|
||||||
|
- |
|
||||||
|
ollama serve &
|
||||||
|
sleep 5
|
||||||
|
ollama pull nomic-embed-text
|
||||||
|
wait
|
||||||
|
|
||||||
|
# MCP Server with Ollama embeddings
|
||||||
|
mcp:
|
||||||
|
build: .
|
||||||
|
depends_on:
|
||||||
|
- ollama
|
||||||
|
environment:
|
||||||
|
# ... other vars ...
|
||||||
|
- OLLAMA_URL=http://ollama:11434
|
||||||
|
- OLLAMA_MODEL=nomic-embed-text
|
||||||
|
|
||||||
|
# Vector sync worker
|
||||||
|
mcp-vector-sync:
|
||||||
|
build: .
|
||||||
|
command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
|
||||||
|
depends_on:
|
||||||
|
- ollama
|
||||||
|
- qdrant
|
||||||
|
environment:
|
||||||
|
# ... other vars ...
|
||||||
|
- OLLAMA_URL=http://ollama:11434
|
||||||
|
- OLLAMA_MODEL=nomic-embed-text
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
ollama_models:
|
||||||
|
```
|
||||||
|
|
||||||
|
## Advantages of Ollama
|
||||||
|
|
||||||
|
### 1. **Ease of Setup**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Ollama
|
||||||
|
curl -fsSL https://ollama.com/install.sh | sh
|
||||||
|
|
||||||
|
# Pull embedding model
|
||||||
|
ollama pull nomic-embed-text
|
||||||
|
|
||||||
|
# Done! API available at localhost:11434
|
||||||
|
```
|
||||||
|
|
||||||
|
No complex configuration, no Docker registries, no model conversion.
|
||||||
|
|
||||||
|
### 2. **Privacy & Data Sovereignty**
|
||||||
|
|
||||||
|
- All processing happens locally
|
||||||
|
- No data leaves your infrastructure
|
||||||
|
- No API keys or external dependencies
|
||||||
|
- Ideal for sensitive content (medical, legal, financial)
|
||||||
|
|
||||||
|
### 3. **Unified Platform**
|
||||||
|
|
||||||
|
- Same tool for LLMs and embeddings
|
||||||
|
- Consistent API across model types
|
||||||
|
- Single point of management
|
||||||
|
- Simplified operations
|
||||||
|
|
||||||
|
### 4. **Developer Experience**
|
||||||
|
|
||||||
|
- Simple API (similar to OpenAI)
|
||||||
|
- Good documentation
|
||||||
|
- Active community
|
||||||
|
- Framework integrations (LangChain, LlamaIndex)
|
||||||
|
|
||||||
|
### 5. **Cost**
|
||||||
|
|
||||||
|
- Free and open source
|
||||||
|
- No per-token API costs
|
||||||
|
- Only infrastructure costs (compute)
|
||||||
|
|
||||||
|
### 6. **Model Variety**
|
||||||
|
|
||||||
|
Growing library of embedding models:
|
||||||
|
- nomic-embed-text (general purpose)
|
||||||
|
- mxbai-embed-large (high quality)
|
||||||
|
- all-minilm (fast)
|
||||||
|
- More models added regularly
|
||||||
|
|
||||||
|
## Limitations of Ollama
|
||||||
|
|
||||||
|
### 1. **Performance**
|
||||||
|
|
||||||
|
- **5-9x slower** than specialized engines (TEI, Infinity)
|
||||||
|
- Not optimized specifically for embedding inference
|
||||||
|
- Batch processing issues at larger batch sizes (>16)
|
||||||
|
- Higher latency compared to alternatives
|
||||||
|
|
||||||
|
### 2. **Scalability**
|
||||||
|
|
||||||
|
- Single-instance deployment (no native clustering)
|
||||||
|
- Limited concurrent request handling
|
||||||
|
- Not designed for high-throughput production
|
||||||
|
- Resource usage per request is higher
|
||||||
|
|
||||||
|
### 3. **Batch Processing Issues**
|
||||||
|
|
||||||
|
- Quality degradation reported with large batches
|
||||||
|
- Optimal batch size: 32-64 (conservative)
|
||||||
|
- Less efficient than specialized engines
|
||||||
|
- GitHub issues tracking batch problems (#6262)
|
||||||
|
|
||||||
|
### 4. **Resource Usage**
|
||||||
|
|
||||||
|
- Models stay loaded in memory (VRAM/RAM)
|
||||||
|
- Higher memory footprint per model
|
||||||
|
- GPU context switching overhead
|
||||||
|
- Not as memory-efficient as specialized engines
|
||||||
|
|
||||||
|
### 5. **Production Features**
|
||||||
|
|
||||||
|
- No built-in load balancing
|
||||||
|
- Limited monitoring/metrics
|
||||||
|
- No automatic scaling
|
||||||
|
- Basic error handling
|
||||||
|
|
||||||
|
## Use Case Recommendations
|
||||||
|
|
||||||
|
### ✅ **Excellent For:**
|
||||||
|
|
||||||
|
1. **Development & Testing**
|
||||||
|
- Quick setup for prototyping
|
||||||
|
- Local development environments
|
||||||
|
- Testing embedding pipelines
|
||||||
|
|
||||||
|
2. **Small Deployments**
|
||||||
|
- <10 users
|
||||||
|
- <10,000 documents
|
||||||
|
- Infrequent searches (<100/day)
|
||||||
|
- Hobbyist/personal projects
|
||||||
|
|
||||||
|
3. **Privacy-Critical Applications**
|
||||||
|
- Medical/healthcare records
|
||||||
|
- Legal documents
|
||||||
|
- Financial data
|
||||||
|
- Air-gapped environments
|
||||||
|
|
||||||
|
4. **Unified LLM Stack**
|
||||||
|
- Projects already using Ollama for LLMs
|
||||||
|
- Simplified operations
|
||||||
|
- Consistent tooling
|
||||||
|
|
||||||
|
5. **Educational/Learning**
|
||||||
|
- Teaching RAG concepts
|
||||||
|
- Learning embeddings
|
||||||
|
- Hackathons/workshops
|
||||||
|
|
||||||
|
### ⚠️ **Consider Alternatives For:**
|
||||||
|
|
||||||
|
1. **Production at Scale**
|
||||||
|
- >100 users
|
||||||
|
- >100,000 documents
|
||||||
|
- High query volume (>1000/day)
|
||||||
|
- Use: TEI or Infinity
|
||||||
|
|
||||||
|
2. **Performance-Critical**
|
||||||
|
- Real-time search (<50ms latency)
|
||||||
|
- High-throughput batch processing
|
||||||
|
- Use: TEI with GPU
|
||||||
|
|
||||||
|
3. **Enterprise Deployments**
|
||||||
|
- Need for high availability
|
||||||
|
- Load balancing requirements
|
||||||
|
- Advanced monitoring
|
||||||
|
- Use: Managed services or TEI cluster
|
||||||
|
|
||||||
|
4. **Large-Scale Indexing**
|
||||||
|
- Millions of documents
|
||||||
|
- Continuous high-volume ingestion
|
||||||
|
- Use: Infinity or commercial solutions
|
||||||
|
|
||||||
|
## Integration Strategy
|
||||||
|
|
||||||
|
### Recommended Tier Placement
|
||||||
|
|
||||||
|
**Update ADR-003 embedding strategy:**
|
||||||
|
|
||||||
|
```
|
||||||
|
Tier 1: OpenAI API (best quality, requires API key)
|
||||||
|
↓ fallback
|
||||||
|
Tier 2a: Infinity (optimized self-hosted, complex setup)
|
||||||
|
↓ fallback
|
||||||
|
Tier 2b: Ollama (easy self-hosted, moderate performance) ← NEW
|
||||||
|
↓ fallback
|
||||||
|
Tier 3: Local sentence-transformers (CPU fallback, simplest)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Option 1: Use Infinity (if available)
|
||||||
|
INFINITY_URL=http://infinity:7997
|
||||||
|
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
|
||||||
|
|
||||||
|
# Option 2: Use Ollama (if Infinity unavailable)
|
||||||
|
OLLAMA_URL=http://ollama:11434
|
||||||
|
OLLAMA_MODEL=nomic-embed-text
|
||||||
|
|
||||||
|
# Option 3: Use local model (automatic fallback)
|
||||||
|
# No configuration needed
|
||||||
|
```
|
||||||
|
|
||||||
|
### When to Choose Ollama
|
||||||
|
|
||||||
|
**Choose Ollama if**:
|
||||||
|
- You're already using Ollama for LLMs
|
||||||
|
- You need privacy/data sovereignty
|
||||||
|
- You have <10k documents and <100 users
|
||||||
|
- Ease of setup is more important than max performance
|
||||||
|
- You're in development/testing phase
|
||||||
|
|
||||||
|
**Choose Infinity/TEI if**:
|
||||||
|
- You need maximum throughput (>1000 embeddings/sec)
|
||||||
|
- You have >100k documents
|
||||||
|
- Latency is critical (<50ms)
|
||||||
|
- You're in production with >100 users
|
||||||
|
|
||||||
|
**Choose OpenAI API if**:
|
||||||
|
- You're okay with cloud dependencies
|
||||||
|
- You need best-in-class quality
|
||||||
|
- Cost is not a concern (~$0.02 per 1M tokens)
|
||||||
|
|
||||||
|
## Production Deployment Guidance
|
||||||
|
|
||||||
|
### Small Production (Ollama Acceptable)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- 5-20 users
|
||||||
|
- 1,000-10,000 documents
|
||||||
|
- 50-200 searches/day
|
||||||
|
- <2 sec acceptable latency
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```yaml
|
||||||
|
ollama:
|
||||||
|
image: ollama/ollama:latest
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 4GB
|
||||||
|
cpus: "2.0"
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia # GPU if available
|
||||||
|
count: 1
|
||||||
|
capabilities: [gpu]
|
||||||
|
environment:
|
||||||
|
- OLLAMA_NUM_PARALLEL=2 # Concurrent requests
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected Performance**:
|
||||||
|
- Embedding latency: 100-200ms
|
||||||
|
- Throughput: 5-10 embeddings/sec
|
||||||
|
- Memory: 2-3GB (model loaded)
|
||||||
|
|
||||||
|
### Medium Production (Use Infinity/TEI)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- 20-200 users
|
||||||
|
- 10,000-1M documents
|
||||||
|
- 500-5,000 searches/day
|
||||||
|
- <500ms acceptable latency
|
||||||
|
|
||||||
|
**Recommendation**: Migrate to Infinity or TEI
|
||||||
|
```yaml
|
||||||
|
infinity:
|
||||||
|
image: michaelf34/infinity:latest
|
||||||
|
# Better throughput and latency
|
||||||
|
```
|
||||||
|
|
||||||
|
### Large Production (Use Specialized Solution)
|
||||||
|
|
||||||
|
**Profile**:
|
||||||
|
- >200 users
|
||||||
|
- >1M documents
|
||||||
|
- >5,000 searches/day
|
||||||
|
- <100ms required latency
|
||||||
|
|
||||||
|
**Recommendation**: Use TEI cluster or commercial service
|
||||||
|
|
||||||
|
## Monitoring Considerations
|
||||||
|
|
||||||
|
### Key Metrics to Track
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add Ollama-specific metrics
|
||||||
|
from prometheus_client import Histogram, Counter, Gauge
|
||||||
|
|
||||||
|
ollama_embedding_latency = Histogram(
|
||||||
|
'ollama_embedding_duration_seconds',
|
||||||
|
'Ollama embedding generation time',
|
||||||
|
['model', 'batch_size']
|
||||||
|
)
|
||||||
|
|
||||||
|
ollama_batch_size = Gauge(
|
||||||
|
'ollama_batch_size',
|
||||||
|
'Current batch size being processed'
|
||||||
|
)
|
||||||
|
|
||||||
|
ollama_errors = Counter(
|
||||||
|
'ollama_errors_total',
|
||||||
|
'Ollama embedding errors',
|
||||||
|
['error_type']
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health Checks
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def ollama_health_check():
|
||||||
|
"""Check Ollama availability"""
|
||||||
|
try:
|
||||||
|
async with httpx.AsyncClient() as client:
|
||||||
|
# Check server
|
||||||
|
response = await client.get("http://ollama:11434/api/tags")
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
# Verify model loaded
|
||||||
|
models = response.json().get("models", [])
|
||||||
|
if "nomic-embed-text" not in [m["name"] for m in models]:
|
||||||
|
return False, "Model not pulled"
|
||||||
|
|
||||||
|
return True, "OK"
|
||||||
|
except Exception as e:
|
||||||
|
return False, str(e)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Migration Path
|
||||||
|
|
||||||
|
### Starting with Ollama
|
||||||
|
|
||||||
|
**Phase 1: Development** (Ollama)
|
||||||
|
- Use Ollama for initial development
|
||||||
|
- Validate embedding pipeline
|
||||||
|
- Test search quality
|
||||||
|
|
||||||
|
**Phase 2: Growth** (Ollama → Infinity)
|
||||||
|
- Monitor performance metrics
|
||||||
|
- When >50 users or >10k docs, migrate to Infinity
|
||||||
|
- Simple config change, no code changes
|
||||||
|
|
||||||
|
**Phase 3: Scale** (Infinity → TEI/Commercial)
|
||||||
|
- When >200 users or performance issues
|
||||||
|
- Consider TEI cluster or managed services
|
||||||
|
|
||||||
|
### Code Compatibility
|
||||||
|
|
||||||
|
All embedding providers use the same interface:
|
||||||
|
```python
|
||||||
|
# Works with Ollama, Infinity, OpenAI, Local
|
||||||
|
embedding = await embedding_service.embed(text)
|
||||||
|
embeddings = await embedding_service.embed_batch(texts)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Migration is a configuration change only** - no code rewrite needed.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
**Ollama is a solid choice for:**
|
||||||
|
- Early-stage projects
|
||||||
|
- Development/testing
|
||||||
|
- Privacy-critical applications
|
||||||
|
- Small deployments (<10 users, <10k docs)
|
||||||
|
- Unified LLM + embedding stack
|
||||||
|
|
||||||
|
**But recognize its limitations:**
|
||||||
|
- 5-9x slower than specialized engines
|
||||||
|
- Not designed for high-throughput production
|
||||||
|
- Batch processing can be problematic
|
||||||
|
- Limited scalability
|
||||||
|
|
||||||
|
**Recommendation**:
|
||||||
|
✅ **Include Ollama as Tier 2b** (after Infinity, before local models) in the embedding strategy. It provides a good balance of ease-of-use and privacy for small-to-medium deployments while allowing seamless migration to more performant engines as needs grow.
|
||||||
|
|
||||||
|
The key is designing the abstraction layer (as done in ADR-003) so migration between engines requires only configuration changes, not code rewrites.
|
||||||
@@ -3,8 +3,8 @@ Tests for Dynamic Client Registration (DCR) token_type parameter.
|
|||||||
|
|
||||||
These tests verify that the Nextcloud OIDC server properly honors the token_type
|
These tests verify that the Nextcloud OIDC server properly honors the token_type
|
||||||
parameter during client registration, issuing the correct type of access tokens:
|
parameter during client registration, issuing the correct type of access tokens:
|
||||||
- token_type="JWT" → JWT-formatted tokens (RFC 9068)
|
- token_type="jwt" → JWT-formatted tokens (RFC 9068)
|
||||||
- token_type="Bearer" → Opaque tokens (standard OAuth2)
|
- token_type="opaque" → Opaque tokens (standard OAuth2)
|
||||||
|
|
||||||
This is critical for ensuring:
|
This is critical for ensuring:
|
||||||
1. Client choice is respected by the OIDC server
|
1. Client choice is respected by the OIDC server
|
||||||
@@ -208,12 +208,14 @@ async def test_dcr_respects_jwt_token_type(
|
|||||||
oauth_callback_server,
|
oauth_callback_server,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Test that DCR honors token_type=JWT and issues JWT-formatted tokens.
|
Test that DCR honors token_type=jwt and issues JWT-formatted tokens.
|
||||||
|
|
||||||
This verifies:
|
This verifies:
|
||||||
1. Client registration with token_type="JWT" succeeds
|
1. Client registration with token_type="jwt" succeeds
|
||||||
2. Tokens obtained via this client are JWT format (base64.base64.signature)
|
2. Tokens obtained via this client are JWT format (base64.base64.signature)
|
||||||
3. JWT payload contains expected claims (sub, iss, scope, etc.)
|
3. JWT payload contains expected claims (sub, iss, scope, etc.)
|
||||||
|
|
||||||
|
Note: The OIDC app uses lowercase 'jwt' (not 'JWT').
|
||||||
"""
|
"""
|
||||||
nextcloud_host = os.getenv("NEXTCLOUD_HOST")
|
nextcloud_host = os.getenv("NEXTCLOUD_HOST")
|
||||||
if not nextcloud_host:
|
if not nextcloud_host:
|
||||||
@@ -232,15 +234,15 @@ async def test_dcr_respects_jwt_token_type(
|
|||||||
token_endpoint = oidc_config.get("token_endpoint")
|
token_endpoint = oidc_config.get("token_endpoint")
|
||||||
authorization_endpoint = oidc_config.get("authorization_endpoint")
|
authorization_endpoint = oidc_config.get("authorization_endpoint")
|
||||||
|
|
||||||
# Register client with token_type="JWT"
|
# Register client with token_type="jwt"
|
||||||
logger.info("Registering OAuth client with token_type=JWT...")
|
logger.info("Registering OAuth client with token_type=jwt...")
|
||||||
client_info = await register_client(
|
client_info = await register_client(
|
||||||
nextcloud_url=nextcloud_host,
|
nextcloud_url=nextcloud_host,
|
||||||
registration_endpoint=registration_endpoint,
|
registration_endpoint=registration_endpoint,
|
||||||
client_name="DCR Test - JWT Token Type",
|
client_name="DCR Test - JWT Token Type",
|
||||||
redirect_uris=[callback_url],
|
redirect_uris=[callback_url],
|
||||||
scopes="openid profile email notes:read notes:write",
|
scopes="openid profile email notes:read notes:write",
|
||||||
token_type="JWT",
|
token_type="jwt",
|
||||||
)
|
)
|
||||||
|
|
||||||
logger.info(f"Registered JWT client: {client_info.client_id[:16]}...")
|
logger.info(f"Registered JWT client: {client_info.client_id[:16]}...")
|
||||||
@@ -278,7 +280,7 @@ async def test_dcr_respects_jwt_token_type(
|
|||||||
assert "notes:write" in scopes, "JWT scope claim missing notes:write"
|
assert "notes:write" in scopes, "JWT scope claim missing notes:write"
|
||||||
|
|
||||||
logger.info(
|
logger.info(
|
||||||
f"✅ DCR with token_type=JWT works correctly! "
|
f"✅ DCR with token_type=jwt works correctly! "
|
||||||
f"Token is JWT format with scope claim: {payload['scope']}"
|
f"Token is JWT format with scope claim: {payload['scope']}"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -290,12 +292,14 @@ async def test_dcr_respects_bearer_token_type(
|
|||||||
oauth_callback_server,
|
oauth_callback_server,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Test that DCR honors token_type=Bearer and issues opaque tokens.
|
Test that DCR honors token_type=opaque and issues opaque tokens.
|
||||||
|
|
||||||
This verifies:
|
This verifies:
|
||||||
1. Client registration with token_type="Bearer" succeeds
|
1. Client registration with token_type="opaque" succeeds
|
||||||
2. Tokens obtained via this client are opaque (NOT JWT format)
|
2. Tokens obtained via this client are opaque (NOT JWT format)
|
||||||
3. Opaque tokens are simple strings, not base64-encoded structures
|
3. Opaque tokens are simple strings, not base64-encoded structures
|
||||||
|
|
||||||
|
Note: The OIDC app uses 'opaque' or 'jwt' as token_type values (not 'Bearer').
|
||||||
"""
|
"""
|
||||||
nextcloud_host = os.getenv("NEXTCLOUD_HOST")
|
nextcloud_host = os.getenv("NEXTCLOUD_HOST")
|
||||||
if not nextcloud_host:
|
if not nextcloud_host:
|
||||||
@@ -314,18 +318,18 @@ async def test_dcr_respects_bearer_token_type(
|
|||||||
token_endpoint = oidc_config.get("token_endpoint")
|
token_endpoint = oidc_config.get("token_endpoint")
|
||||||
authorization_endpoint = oidc_config.get("authorization_endpoint")
|
authorization_endpoint = oidc_config.get("authorization_endpoint")
|
||||||
|
|
||||||
# Register client with token_type="Bearer" (opaque tokens)
|
# Register client with token_type="opaque" (opaque tokens)
|
||||||
logger.info("Registering OAuth client with token_type=Bearer...")
|
logger.info("Registering OAuth client with token_type=opaque...")
|
||||||
client_info = await register_client(
|
client_info = await register_client(
|
||||||
nextcloud_url=nextcloud_host,
|
nextcloud_url=nextcloud_host,
|
||||||
registration_endpoint=registration_endpoint,
|
registration_endpoint=registration_endpoint,
|
||||||
client_name="DCR Test - Bearer Token Type",
|
client_name="DCR Test - Opaque Token Type",
|
||||||
redirect_uris=[callback_url],
|
redirect_uris=[callback_url],
|
||||||
scopes="openid profile email notes:read notes:write",
|
scopes="openid profile email notes:read notes:write",
|
||||||
token_type="Bearer",
|
token_type="opaque",
|
||||||
)
|
)
|
||||||
|
|
||||||
logger.info(f"Registered Bearer client: {client_info.client_id[:16]}...")
|
logger.info(f"Registered Opaque token client: {client_info.client_id[:16]}...")
|
||||||
|
|
||||||
# Obtain token via OAuth flow
|
# Obtain token via OAuth flow
|
||||||
access_token = await get_oauth_token_with_client(
|
access_token = await get_oauth_token_with_client(
|
||||||
@@ -353,7 +357,7 @@ async def test_dcr_respects_bearer_token_type(
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
logger.info(
|
logger.info(
|
||||||
f"✅ DCR with token_type=Bearer works correctly! "
|
f"✅ DCR with token_type=opaque works correctly! "
|
||||||
f"Token is opaque (not JWT format): {access_token[:30]}..."
|
f"Token is opaque (not JWT format): {access_token[:30]}..."
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user