Files
nextcloud-mcp-server/docs/ADR-003-vector-database-semantic-search.md
Chris Coutinho dc93da2ea0 docs: add ADR-007 for background vector database synchronization
Add comprehensive ADR-007 documenting background vector database
synchronization architecture using anyio TaskGroups for in-process
concurrency. This supersedes ADR-003's conceptual background worker.

Key decisions:
- In-process architecture using anyio TaskGroups (not Celery)
- Scanner task runs hourly, detects changes via timestamp comparison
- In-memory asyncio.Queue for pending documents
- Pool of 3 concurrent processor tasks for I/O-bound embedding workloads
- Qdrant metadata as single source of truth for indexing state
- Simple user controls: enable/disable with status visibility

Benefits:
- Single container deployment (was 3: mcp, celery-worker, celery-beat)
- No distributed task queue infrastructure
- Shared process state (no volume coordination)
- Sufficient throughput for I/O-bound embedding APIs
- Simpler debugging and deployment

Update ADR-003 status to "Superseded by ADR-007" with reference link.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 20:32:49 +01:00

1119 lines
36 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-003: Vector Database and Semantic Search Architecture
## Status
Superseded by ADR-007
**Note**: This ADR was never implemented. The core technical decisions (Qdrant, embeddings, hybrid search) remain valid and are incorporated into ADR-007, which adds user-controlled background job management, task queuing, multi-user scheduling, and web UI integration. See [ADR-007: Background Vector Sync with User-Controlled Job Management](./ADR-007-background-vector-sync-job-management.md) for the implemented architecture.
## Context
### Current State
ADR-001 introduced token-based keyword search with relevance ranking, which improved upon simple substring matching. However, this approach still has fundamental limitations:
1. **Lexical Matching Only**: Requires exact word matches (e.g., "automobile" won't match "car")
2. **No Semantic Understanding**: Cannot understand intent or context (e.g., "how to bake bread" won't match "bread recipe")
3. **Language Barriers**: Poor support for synonyms, related terms, or multilingual content
4. **No Cross-Content Search**: Cannot find related content across different apps (notes, files, calendar)
5. **Scaling Issues**: Performance degrades with large content collections
### User Needs
LLM-powered applications (Claude via MCP) benefit significantly from semantic search capabilities:
- **Context Discovery**: Find relevant information based on meaning, not just keywords
- **Knowledge Retrieval**: Retrieve contextually relevant notes/files for task completion
- **Cross-Referencing**: Connect related information across different content types
- **Natural Language Queries**: Support conversational search patterns
### Technical Requirements
1. **Multi-User Environment**: OAuth-based with per-user isolation and permissions
2. **Multi-Tenant**: Single deployment serving multiple users with strict data isolation
3. **Real-Time Search**: Sub-second query latency for good UX
4. **Large Content**: Support for documents, PDFs, images with text extraction
5. **Privacy**: No external API calls for sensitive content (optionally self-hosted)
6. **Hybrid Search**: Combine semantic and keyword search for best results
## Decision
We will implement **semantic search using a vector database** with the following architecture:
### Core Components
1. **Vector Database**: Qdrant as external sidecar service
2. **Embedding Strategy**: Configurable (OpenAI API / local models / self-hosted)
3. **Search Pattern**: Hybrid search (semantic + keyword fusion)
4. **Multi-Tenancy**: Single collection with user_id filtering
5. **Authorization**: Dual-phase (vector search + Nextcloud API verification)
6. **Sync Strategy**: Background worker with incremental updates (see ADR-002)
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ User Request (OAuth) │
│ "find notes about baking" │
└───────────────────────────┬─────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ MCP Server (Semantic Search Tool) │
│ │
│ 1. Generate query embedding │
│ 2. Search vector DB (user_id filter) │
│ 3. Verify permissions via Nextcloud API │
│ 4. Return ranked results │
└──────────┬─────────────────────────────┬────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────┐
│ Embedding Service │ │ Qdrant Vector Database │
│ - OpenAI API │ │ │
│ - Local Model │ │ Collection: nextcloud_content │
│ - Self-hosted │ │ - User-filtered vectors │
└──────────────────────┘ │ - Metadata for auth │
│ - HNSW index │
└───────────────────────────────┘
│ Indexing
┌──────────┴────────────────────┐
│ Background Sync Worker │
│ (see ADR-002 for auth) │
│ │
│ 1. Fetch user content │
│ 2. Generate embeddings │
│ 3. Upsert to Qdrant │
│ 4. Update metadata │
└───────────────────────────────┘
```
## Implementation Details
### 1. Vector Database Selection: Qdrant
After evaluating multiple options, we select **Qdrant** for the following reasons:
**Qdrant Advantages:**
- ✅ Native async Python client (`qdrant-client`)
- ✅ Efficient multi-tenancy via filtered search (no collection-per-user needed)
- ✅ Built-in hybrid search support (dense + sparse vectors)
- ✅ HNSW index with excellent performance
- ✅ Lightweight Docker deployment
- ✅ Persistent storage with snapshots
- ✅ API key authentication
- ✅ Active development and documentation
**Comparison with Alternatives:**
| Feature | Qdrant | Chroma | Weaviate | pgvector |
|---------|--------|--------|----------|----------|
| Async Python | ✅ | ⚠️ Sync | ✅ | ✅ |
| Multi-tenant filtering | ✅ | ⚠️ Limited | ✅ | ✅ |
| Hybrid search | ✅ | ❌ | ✅ | ⚠️ Manual |
| Docker deployment | ✅ Easy | ✅ Easy | ✅ Complex | ⚠️ Postgres |
| Memory usage | ✅ Low | ⚠️ Medium | ⚠️ High | ✅ Low |
| Maturity | ✅ Production | ⚠️ Young | ✅ Production | ✅ Mature |
**Decision**: Qdrant provides the best balance of features, performance, and ease of deployment.
### 2. Embedding Strategy: Tiered Approach
Support multiple embedding backends with automatic fallback:
```python
class EmbeddingService:
"""Unified interface for embedding generation"""
def __init__(self):
self.provider = self._detect_provider()
def _detect_provider(self) -> EmbeddingProvider:
"""Auto-detect available embedding provider"""
# Tier 1: OpenAI API (best quality, requires API key)
if os.getenv("OPENAI_API_KEY"):
return OpenAIEmbedding(
model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
api_key=os.getenv("OPENAI_API_KEY")
)
# Tier 2: Self-hosted embedding service (good quality, privacy-preserving)
if os.getenv("EMBEDDING_SERVICE_URL"):
return HTTPEmbedding(
url=os.getenv("EMBEDDING_SERVICE_URL"),
model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
)
# Tier 3: Local model (fallback, CPU-only)
logger.warning("No cloud/hosted embeddings available, using local model")
return LocalEmbedding(
model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
)
async def embed(self, text: str) -> list[float]:
"""Generate embedding vector for text"""
return await self.provider.embed(text)
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
"""Generate embeddings for multiple texts (optimized)"""
return await self.provider.embed_batch(texts)
```
#### 2.1 OpenAI Embeddings (Tier 1)
```python
class OpenAIEmbedding(EmbeddingProvider):
"""OpenAI embedding API"""
def __init__(self, model: str, api_key: str):
self.client = AsyncOpenAI(api_key=api_key)
self.model = model
self.dimension = 1536 if "3-small" in model else 1536 # Model-dependent
async def embed(self, text: str) -> list[float]:
response = await self.client.embeddings.create(
model=self.model,
input=text
)
return response.data[0].embedding
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
# OpenAI supports batch up to 2048 inputs
response = await self.client.embeddings.create(
model=self.model,
input=texts
)
return [item.embedding for item in response.data]
```
**Costs**: text-embedding-3-small: $0.02 per 1M tokens (~4M characters)
- 10,000 notes × 500 words avg = ~$0.10 to index
- Searches are extremely cheap (~$0.00002 per query)
#### 2.2 Self-Hosted Embeddings (Tier 2)
```python
class HTTPEmbedding(EmbeddingProvider):
"""Self-hosted embedding service (Infinity, TEI, Ollama)"""
def __init__(self, url: str, model: str):
self.client = httpx.AsyncClient()
self.url = url
self.model = model
self.dimension = 384 # Model-dependent (bge-small: 384, bge-base: 768)
async def embed(self, text: str) -> list[float]:
response = await self.client.post(
f"{self.url}/embeddings",
json={"input": text, "model": self.model}
)
response.raise_for_status()
return response.json()["data"][0]["embedding"]
```
**Self-Hosted Options**:
- **Infinity**: Lightweight, OpenAI-compatible API, GPU support
- **Text Embeddings Inference (TEI)**: HuggingFace official, optimized, Rust-based
- **Ollama**: Easy setup, multi-model support, CPU/GPU
#### 2.3 Local Embeddings (Tier 3)
```python
class LocalEmbedding(EmbeddingProvider):
"""Local embedding using sentence-transformers (CPU fallback)"""
def __init__(self, model: str):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(model)
self.dimension = self.model.get_sentence_embedding_dimension()
async def embed(self, text: str) -> list[float]:
# Run in thread pool to avoid blocking
loop = asyncio.get_event_loop()
embedding = await loop.run_in_executor(
None,
self.model.encode,
text
)
return embedding.tolist()
```
**Recommended Local Models**:
- `all-MiniLM-L6-v2`: 384 dims, fast, good quality
- `all-mpnet-base-v2`: 768 dims, slower, better quality
- `paraphrase-multilingual-MiniLM-L12-v2`: Multilingual support
### 3. Vector Database Schema
```python
# Qdrant collection configuration
collection_config = {
"collection_name": "nextcloud_content",
"vectors_config": {
"size": 384, # Embedding dimension (model-dependent)
"distance": "Cosine" # Cosine similarity for semantic search
},
"optimizers_config": {
"indexing_threshold": 10000 # Start indexing after 10k vectors
},
"hnsw_config": {
"m": 16, # Number of edges per node (balance speed/accuracy)
"ef_construct": 100 # Quality of index construction
}
}
# Payload schema (metadata)
payload_schema = {
"user_id": str, # Required: owner of content
"content_type": str, # "note", "file", "calendar_event"
"content_id": str, # Source ID (note_id, file_path, event_id)
"title": str, # Searchable title
"excerpt": str, # First 200 chars for preview
"category": str, # Optional: category/folder
"mime_type": str, # Optional: file MIME type
"shared_with": list[str], # Optional: list of user_ids with access
"tags": list[str], # Optional: user tags
"created_at": int, # Unix timestamp
"modified_at": int, # Unix timestamp
"indexed_at": int # Unix timestamp (for sync tracking)
}
```
#### 3.1 Multi-Tenancy via Filtering
```python
# User-specific search with filtering
search_results = await qdrant_client.search(
collection_name="nextcloud_content",
query_vector=query_embedding,
query_filter=models.Filter(
must=[
# User owns the content OR it's shared with them
models.Filter(
should=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=current_user_id)
),
models.FieldCondition(
key="shared_with",
match=models.MatchAny(any=[current_user_id])
)
]
),
# Optional: filter by content type
models.FieldCondition(
key="content_type",
match=models.MatchValue(value="note")
)
]
),
limit=20,
score_threshold=0.7 # Only return confident matches
)
```
### 4. Hybrid Search Implementation
Combine semantic and keyword search for best results:
```python
@mcp.tool()
@require_scopes("notes:read")
async def nc_notes_hybrid_search(
query: str,
ctx: Context,
limit: int = 10,
semantic_weight: float = 0.7,
keyword_weight: float = 0.3
) -> SearchNotesResponse:
"""
Hybrid search combining semantic understanding with keyword precision.
Args:
query: Natural language search query
limit: Maximum results to return
semantic_weight: Weight for semantic similarity (0-1)
keyword_weight: Weight for keyword matching (0-1)
"""
client = get_client(ctx)
username = client.username
# Run searches in parallel
semantic_task = asyncio.create_task(
semantic_search(query, username, limit=limit * 2)
)
keyword_task = asyncio.create_task(
keyword_search(query, username, limit=limit * 2)
)
semantic_results, keyword_results = await asyncio.gather(
semantic_task, keyword_task
)
# Fusion: Combine and rerank results
fused_results = reciprocal_rank_fusion(
semantic_results,
keyword_results,
semantic_weight=semantic_weight,
keyword_weight=keyword_weight
)
# Verify permissions via Nextcloud API (dual-phase authorization)
verified_results = []
for result in fused_results[:limit * 2]: # Get extra for filtering
try:
note = await client.notes.get_note(result["note_id"])
verified_results.append({
"note": note,
"score": result["score"],
"match_type": result["match_type"] # "semantic", "keyword", "both"
})
if len(verified_results) >= limit:
break
except HTTPStatusError as e:
if e.response.status_code == 403:
continue # User lost access
raise
return SearchNotesResponse(
results=verified_results,
query=query,
total_found=len(verified_results),
search_method="hybrid"
)
def reciprocal_rank_fusion(
semantic_results: list[dict],
keyword_results: list[dict],
semantic_weight: float = 0.7,
keyword_weight: float = 0.3,
k: int = 60 # RRF constant
) -> list[dict]:
"""
Reciprocal Rank Fusion for combining search results.
RRF is more robust than score normalization because it only
depends on ranks, not absolute scores.
"""
# Build rank maps
semantic_ranks = {r["note_id"]: i for i, r in enumerate(semantic_results)}
keyword_ranks = {r["note_id"]: i for i, r in enumerate(keyword_results)}
# Get all unique note IDs
all_note_ids = set(semantic_ranks.keys()) | set(keyword_ranks.keys())
# Calculate fused scores
fused = []
for note_id in all_note_ids:
# RRF formula: score = sum(weight_i / (k + rank_i))
semantic_score = 0
keyword_score = 0
match_type = []
if note_id in semantic_ranks:
semantic_score = semantic_weight / (k + semantic_ranks[note_id])
match_type.append("semantic")
if note_id in keyword_ranks:
keyword_score = keyword_weight / (k + keyword_ranks[note_id])
match_type.append("keyword")
fused.append({
"note_id": note_id,
"score": semantic_score + keyword_score,
"match_type": "+".join(match_type)
})
# Sort by fused score
fused.sort(key=lambda x: x["score"], reverse=True)
return fused
```
### 5. Document Chunking Strategy
For large documents (>1000 tokens), implement semantic chunking:
```python
class DocumentChunker:
"""Chunk large documents for optimal embedding"""
def __init__(self, chunk_size: int = 512, overlap: int = 50):
self.chunk_size = chunk_size # tokens
self.overlap = overlap # overlapping tokens
def chunk_document(
self,
content: str,
metadata: dict
) -> list[tuple[str, dict]]:
"""
Split document into overlapping chunks with metadata.
Returns list of (chunk_text, chunk_metadata) tuples.
"""
# Tokenize (approximate with words for simplicity)
tokens = content.split()
if len(tokens) <= self.chunk_size:
# Document fits in single chunk
return [(content, metadata)]
chunks = []
start = 0
while start < len(tokens):
end = start + self.chunk_size
chunk_tokens = tokens[start:end]
chunk_text = " ".join(chunk_tokens)
# Add chunk metadata
chunk_metadata = {
**metadata,
"chunk_index": len(chunks),
"chunk_start": start,
"chunk_end": end,
"is_chunk": True
}
chunks.append((chunk_text, chunk_metadata))
# Move to next chunk with overlap
start = end - self.overlap
return chunks
# Usage in sync worker
async def index_document(doc: Document, user_id: str):
"""Index a document with chunking"""
chunker = DocumentChunker(chunk_size=512, overlap=50)
chunks = chunker.chunk_document(
content=doc.content,
metadata={
"user_id": user_id,
"content_type": "file",
"content_id": doc.path,
"title": doc.title,
"mime_type": doc.mime_type
}
)
# Generate embeddings in batch
chunk_texts = [chunk[0] for chunk in chunks]
embeddings = await embedding_service.embed_batch(chunk_texts)
# Upsert all chunks
points = []
for (chunk_text, chunk_metadata), embedding in zip(chunks, embeddings):
points.append(
models.PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
**chunk_metadata,
"excerpt": chunk_text[:200] # Preview
}
)
)
await qdrant_client.upsert(
collection_name="nextcloud_content",
points=points
)
```
### 6. Background Sync Worker
```python
# nextcloud_mcp_server/sync/vector_indexer.py
class VectorIndexer:
"""Indexes content into vector database"""
def __init__(
self,
qdrant_client: AsyncQdrantClient,
embedding_service: EmbeddingService,
auth_provider: SyncAuthProvider # From ADR-002
):
self.qdrant = qdrant_client
self.embeddings = embedding_service
self.auth = auth_provider
async def sync_user_notes(self, user_id: str):
"""Sync all notes for a user"""
# Get authenticated client for user
client = await self.auth.get_user_client(user_id)
# Fetch all notes
notes = await client.notes.list_notes()
logger.info(f"Syncing {len(notes)} notes for {user_id}")
# Check which notes need updating
existing_ids = await self._get_indexed_note_ids(user_id)
notes_to_update = [
n for n in notes
if f"note_{n.id}" not in existing_ids
or n.modified > existing_ids[f"note_{n.id}"]
]
if not notes_to_update:
logger.info(f"All notes up-to-date for {user_id}")
return
# Generate embeddings in batch
contents = [f"{n.title}\n\n{n.content}" for n in notes_to_update]
embeddings = await self.embeddings.embed_batch(contents)
# Prepare points for upsert
points = []
for note, embedding in zip(notes_to_update, embeddings):
points.append(
models.PointStruct(
id=f"note_{note.id}",
vector=embedding,
payload={
"user_id": user_id,
"content_type": "note",
"content_id": str(note.id),
"note_id": note.id,
"title": note.title,
"excerpt": note.content[:200],
"category": note.category,
"created_at": note.created,
"modified_at": note.modified,
"indexed_at": int(time.time())
}
)
)
# Upsert to Qdrant
await self.qdrant.upsert(
collection_name="nextcloud_content",
points=points
)
logger.info(f"Indexed {len(points)} notes for {user_id}")
async def _get_indexed_note_ids(self, user_id: str) -> dict[str, int]:
"""Get map of note_id -> modified_at for indexed notes"""
# Query Qdrant for existing notes
scroll_result = await self.qdrant.scroll(
collection_name="nextcloud_content",
scroll_filter=models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id)
),
models.FieldCondition(
key="content_type",
match=models.MatchValue(value="note")
)
]
),
with_payload=["content_id", "modified_at"],
limit=10000
)
return {
point.payload["content_id"]: point.payload["modified_at"]
for point, _ in scroll_result
}
async def delete_note(self, user_id: str, note_id: int):
"""Remove deleted note from index"""
await self.qdrant.delete(
collection_name="nextcloud_content",
points_selector=models.FilterSelector(
filter=models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id)
),
models.FieldCondition(
key="note_id",
match=models.MatchValue(value=note_id)
)
]
)
)
)
```
### 7. Configuration
#### 7.1 Environment Variables
```bash
# Vector Database
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY=<secure-api-key>
QDRANT_COLLECTION=nextcloud_content
# Embedding Strategy (choose one)
# Option 1: OpenAI
OPENAI_API_KEY=sk-...
OPENAI_EMBEDDING_MODEL=text-embedding-3-small # or text-embedding-3-large
# Option 2: Self-hosted
EMBEDDING_SERVICE_URL=http://embeddings:7997
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
# Option 3: Local (fallback, no config needed)
# Search Configuration
SEMANTIC_SEARCH_ENABLED=true
HYBRID_SEARCH_DEFAULT_SEMANTIC_WEIGHT=0.7
HYBRID_SEARCH_DEFAULT_KEYWORD_WEIGHT=0.3
SEARCH_SCORE_THRESHOLD=0.7
# Sync Configuration
VECTOR_SYNC_INTERVAL=300 # seconds
VECTOR_SYNC_BATCH_SIZE=100
```
#### 7.2 Docker Compose
```yaml
services:
# Vector Database
qdrant:
image: qdrant/qdrant:latest
restart: always
ports:
- 127.0.0.1:6333:6333 # REST API
- 127.0.0.1:6334:6334 # gRPC
volumes:
- qdrant_storage:/qdrant/storage
environment:
- QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY}
- QDRANT__SERVICE__HTTP_PORT=6333
- QDRANT__SERVICE__GRPC_PORT=6334
# Embedding Service (optional - for self-hosted)
embeddings:
image: michaelf34/infinity:latest
restart: always
ports:
- 127.0.0.1:7997:7997
volumes:
- embedding_models:/app/.cache
environment:
- MODEL_ID=BAAI/bge-small-en-v1.5
- BATCH_SIZE=32
- ENGINE=torch # or optimum for better CPU performance
# Optional: GPU support
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# MCP Server with vector search
mcp:
build: .
command: ["--transport", "streamable-http"]
depends_on:
- app
- qdrant
- embeddings # optional
environment:
# ... existing env vars ...
- SEMANTIC_SEARCH_ENABLED=true
- QDRANT_URL=http://qdrant:6333
- QDRANT_API_KEY=${QDRANT_API_KEY}
# Choose embedding strategy
- EMBEDDING_SERVICE_URL=http://embeddings:7997
# OR
# - OPENAI_API_KEY=${OPENAI_API_KEY}
# Vector Sync Worker
mcp-vector-sync:
build: .
command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
depends_on:
- app
- qdrant
- embeddings # optional
environment:
# Nextcloud + Auth (from ADR-002)
- NEXTCLOUD_HOST=http://app:80
- ENABLE_OFFLINE_ACCESS=true
- TOKEN_ENCRYPTION_KEY=${TOKEN_ENCRYPTION_KEY}
# Vector Database
- QDRANT_URL=http://qdrant:6333
- QDRANT_API_KEY=${QDRANT_API_KEY}
# Embeddings
- EMBEDDING_SERVICE_URL=http://embeddings:7997
volumes:
- sync-tokens:/app/data
volumes:
qdrant_storage:
embedding_models:
sync-tokens:
```
### 8. Performance Optimization
#### 8.1 Indexing Performance
```python
# Batch embedding generation
async def embed_batch_chunked(
texts: list[str],
batch_size: int = 100
) -> list[list[float]]:
"""Generate embeddings in chunks to avoid memory issues"""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = await embedding_service.embed_batch(batch)
embeddings.extend(batch_embeddings)
await asyncio.sleep(0.1) # Rate limiting
return embeddings
# Parallel upsert with batching
async def upsert_points_batched(
points: list[models.PointStruct],
batch_size: int = 100
):
"""Upsert points in batches"""
for i in range(0, len(points), batch_size):
batch = points[i:i + batch_size]
await qdrant_client.upsert(
collection_name="nextcloud_content",
points=batch,
wait=False # Don't wait for indexing
)
```
#### 8.2 Search Performance
```python
# Search with prefetch for better accuracy
search_results = await qdrant_client.search(
collection_name="nextcloud_content",
query_vector=query_embedding,
query_filter=user_filter,
limit=20,
with_payload=True,
with_vectors=False, # Don't return vectors (saves bandwidth)
search_params=models.SearchParams(
hnsw_ef=128, # Higher = more accurate but slower
exact=False # Use HNSW index
)
)
```
#### 8.3 Caching
```python
# Cache embeddings for common queries
from functools import lru_cache
@lru_cache(maxsize=1000)
def cache_key(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
async def embed_with_cache(text: str) -> list[float]:
"""Generate embedding with caching"""
key = cache_key(text)
# Check Redis cache
cached = await redis.get(f"embedding:{key}")
if cached:
return json.loads(cached)
# Generate embedding
embedding = await embedding_service.embed(text)
# Cache for 1 hour
await redis.setex(
f"embedding:{key}",
3600,
json.dumps(embedding)
)
return embedding
```
### 9. Monitoring and Metrics
```python
# Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge
# Search metrics
semantic_search_count = Counter(
'semantic_search_total',
'Total semantic searches',
['user_id', 'content_type']
)
semantic_search_latency = Histogram(
'semantic_search_duration_seconds',
'Semantic search latency',
['phase'] # 'embedding', 'vector_search', 'verification'
)
# Indexing metrics
documents_indexed = Counter(
'documents_indexed_total',
'Total documents indexed',
['user_id', 'content_type']
)
index_queue_size = Gauge(
'index_queue_size',
'Number of documents waiting to be indexed'
)
# Usage
async def semantic_search(query: str, user_id: str):
semantic_search_count.labels(user_id=user_id, content_type='note').inc()
with semantic_search_latency.labels(phase='embedding').time():
embedding = await embed(query)
with semantic_search_latency.labels(phase='vector_search').time():
results = await qdrant.search(...)
with semantic_search_latency.labels(phase='verification').time():
verified = await verify_access(results)
return verified
```
## Consequences
### Benefits
1. **Semantic Understanding**
- Find content by meaning, not just keywords
- Support for natural language queries
- Cross-lingual search potential
- Better context discovery for LLMs
2. **User Experience**
- More relevant search results
- Discover related content across apps
- Fast sub-second query latency
- Hybrid search combines best of both worlds
3. **Architecture**
- External sidecar (doesn't bloat MCP server)
- Configurable embedding backend (cloud/self-hosted/local)
- Multi-tenant with strict isolation
- Scales horizontally (Qdrant cluster)
4. **Privacy & Security**
- Self-hosted option available
- Dual-phase authorization enforces permissions
- Vector DB is cache, not source of truth
- Per-user audit trail
5. **Developer Experience**
- Simple async Python API
- Comprehensive monitoring
- Clear upgrade path (better embeddings, reranking)
### Limitations
1. **Complexity**
- Additional infrastructure (Qdrant + embeddings)
- More monitoring required
- Embedding generation latency
- Initial indexing time for large collections
2. **Cost**
- Storage: ~4KB per document (embedding + metadata)
- Compute: Embedding generation (API costs or GPU)
- Memory: Qdrant keeps vectors in RAM for speed
3. **Operational**
- Index maintenance and updates
- Embedding model versioning
- Handling deleted/moved content
- Cold start indexing for new users
4. **Search Accuracy**
- Quality depends on embedding model
- May miss exact keyword matches (mitigated by hybrid search)
- Cultural/domain-specific terms may not embed well
- Requires tuning score thresholds
### Performance Characteristics
| Metric | Target | Notes |
|--------|--------|-------|
| Search latency | <200ms | Embedding + vector search + verification |
| Indexing throughput | >100 docs/sec | With batch embeddings |
| Memory per 10k docs | ~40MB | Qdrant vectors + metadata |
| Disk per 10k docs | ~40MB | Persistent storage |
| Search accuracy | >90% | At score_threshold=0.7 |
### Cost Estimates
**Small Deployment** (10 users, 1000 notes each):
- Initial indexing: 10,000 notes × $0.00002 = $0.20 (OpenAI)
- Monthly searches: 1000 queries × $0.00002 = $0.02
- Infrastructure: Qdrant (40MB RAM), Embeddings (optional)
- **Total**: ~$0.25/month (API) or self-hosted (negligible)
**Medium Deployment** (100 users, 500 notes each):
- Initial indexing: 50,000 notes × $0.00002 = $1.00
- Monthly searches: 10,000 queries × $0.00002 = $0.20
- Infrastructure: Qdrant (200MB RAM)
- **Total**: ~$1.20/month or self-hosted
**Self-Hosted** (any size):
- GPU instance: ~$0.50/hour (~$360/month for 24/7)
- Or CPU-only: negligible cost, slower embeddings
### Future Enhancements
1. **Multimodal Search**
- Image embeddings (CLIP)
- PDF/document layout understanding
- Audio transcription + embedding
2. **Advanced Ranking**
- Cross-encoder reranking
- Learning-to-rank models
- User feedback signals
3. **Query Understanding**
- Query expansion
- Spell correction
- Entity extraction
4. **Performance**
- Query result caching
- Approximate nearest neighbor improvements
- Quantization for reduced memory
5. **Features**
- Saved searches
- Search analytics
- Recommended content
## Alternatives Considered
### Alternative 1: Elasticsearch/OpenSearch
**Approach**: Use traditional full-text search engine with vector plugin
**Pros**:
- Mature ecosystem
- Excellent keyword search
- Rich query DSL
**Cons**:
- Heavy infrastructure (JVM-based)
- Complex setup and tuning
- Vector search is plugin/add-on (not native)
- Higher resource usage
**Decision**: Rejected; Qdrant is purpose-built for vectors
### Alternative 2: ChromaDB
**Approach**: Embedded or client-server vector database
**Pros**:
- Simple Python API
- Easy to get started
- Good for prototyping
**Cons**:
- Sync-only Python client (no async)
- Limited multi-tenancy features
- Less mature than Qdrant
- Scaling concerns
**Decision**: Rejected; async and multi-tenancy are critical
### Alternative 3: Weaviate
**Approach**: Full-featured vector database with GraphQL
**Pros**:
- Very feature-rich
- Built-in vectorization
- Good documentation
**Cons**:
- More complex architecture
- Higher resource usage
- GraphQL adds complexity
- Overkill for our use case
**Decision**: Rejected; Qdrant provides better balance
### Alternative 4: pgvector (PostgreSQL Extension)
**Approach**: Add vector search to existing PostgreSQL
**Pros**:
- Leverages existing PostgreSQL expertise
- Transactional consistency
- Mature database ecosystem
**Cons**:
- This deployment uses MariaDB (would need PostgreSQL)
- Performance not as optimized as purpose-built vector DB
- Manual hybrid search implementation
- HNSW index limitations
**Decision**: Rejected; dedicated vector DB is better fit
### Alternative 5: Pinecone / Vertex AI Vector Search
**Approach**: Managed cloud vector database
**Pros**:
- Fully managed
- Excellent performance
- No infrastructure management
**Cons**:
- Cloud-only (no self-hosting)
- Recurring costs
- Vendor lock-in
- Data leaves premises
**Decision**: Rejected; self-hosting is important for privacy
## Related Decisions
- ADR-001: Enhanced Note Search (establishes need for better search)
- ADR-002: Vector Sync Authentication (defines how sync workers authenticate)
- [Future] ADR-004: Content Extraction and Document Processing
- [Future] ADR-005: Cross-App Semantic Search
## References
- [Qdrant Documentation](https://qdrant.tech/documentation/)
- [Sentence Transformers](https://www.sbert.net/)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [Hybrid Search with RRF](https://qdrant.tech/articles/hybrid-search/)
- [HNSW Algorithm](https://arxiv.org/abs/1603.09320)
- [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)