dc93da2ea0
Add comprehensive ADR-007 documenting background vector database synchronization architecture using anyio TaskGroups for in-process concurrency. This supersedes ADR-003's conceptual background worker. Key decisions: - In-process architecture using anyio TaskGroups (not Celery) - Scanner task runs hourly, detects changes via timestamp comparison - In-memory asyncio.Queue for pending documents - Pool of 3 concurrent processor tasks for I/O-bound embedding workloads - Qdrant metadata as single source of truth for indexing state - Simple user controls: enable/disable with status visibility Benefits: - Single container deployment (was 3: mcp, celery-worker, celery-beat) - No distributed task queue infrastructure - Shared process state (no volume coordination) - Sufficient throughput for I/O-bound embedding APIs - Simpler debugging and deployment Update ADR-003 status to "Superseded by ADR-007" with reference link. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1119 lines
36 KiB
Markdown
1119 lines
36 KiB
Markdown
# ADR-003: Vector Database and Semantic Search Architecture
|
||
|
||
## Status
|
||
Superseded by ADR-007
|
||
|
||
**Note**: This ADR was never implemented. The core technical decisions (Qdrant, embeddings, hybrid search) remain valid and are incorporated into ADR-007, which adds user-controlled background job management, task queuing, multi-user scheduling, and web UI integration. See [ADR-007: Background Vector Sync with User-Controlled Job Management](./ADR-007-background-vector-sync-job-management.md) for the implemented architecture.
|
||
|
||
## Context
|
||
|
||
### Current State
|
||
|
||
ADR-001 introduced token-based keyword search with relevance ranking, which improved upon simple substring matching. However, this approach still has fundamental limitations:
|
||
|
||
1. **Lexical Matching Only**: Requires exact word matches (e.g., "automobile" won't match "car")
|
||
2. **No Semantic Understanding**: Cannot understand intent or context (e.g., "how to bake bread" won't match "bread recipe")
|
||
3. **Language Barriers**: Poor support for synonyms, related terms, or multilingual content
|
||
4. **No Cross-Content Search**: Cannot find related content across different apps (notes, files, calendar)
|
||
5. **Scaling Issues**: Performance degrades with large content collections
|
||
|
||
### User Needs
|
||
|
||
LLM-powered applications (Claude via MCP) benefit significantly from semantic search capabilities:
|
||
|
||
- **Context Discovery**: Find relevant information based on meaning, not just keywords
|
||
- **Knowledge Retrieval**: Retrieve contextually relevant notes/files for task completion
|
||
- **Cross-Referencing**: Connect related information across different content types
|
||
- **Natural Language Queries**: Support conversational search patterns
|
||
|
||
### Technical Requirements
|
||
|
||
1. **Multi-User Environment**: OAuth-based with per-user isolation and permissions
|
||
2. **Multi-Tenant**: Single deployment serving multiple users with strict data isolation
|
||
3. **Real-Time Search**: Sub-second query latency for good UX
|
||
4. **Large Content**: Support for documents, PDFs, images with text extraction
|
||
5. **Privacy**: No external API calls for sensitive content (optionally self-hosted)
|
||
6. **Hybrid Search**: Combine semantic and keyword search for best results
|
||
|
||
## Decision
|
||
|
||
We will implement **semantic search using a vector database** with the following architecture:
|
||
|
||
### Core Components
|
||
|
||
1. **Vector Database**: Qdrant as external sidecar service
|
||
2. **Embedding Strategy**: Configurable (OpenAI API / local models / self-hosted)
|
||
3. **Search Pattern**: Hybrid search (semantic + keyword fusion)
|
||
4. **Multi-Tenancy**: Single collection with user_id filtering
|
||
5. **Authorization**: Dual-phase (vector search + Nextcloud API verification)
|
||
6. **Sync Strategy**: Background worker with incremental updates (see ADR-002)
|
||
|
||
### Architecture Diagram
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ User Request (OAuth) │
|
||
│ "find notes about baking" │
|
||
└───────────────────────────┬─────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ MCP Server (Semantic Search Tool) │
|
||
│ │
|
||
│ 1. Generate query embedding │
|
||
│ 2. Search vector DB (user_id filter) │
|
||
│ 3. Verify permissions via Nextcloud API │
|
||
│ 4. Return ranked results │
|
||
└──────────┬─────────────────────────────┬────────────────────┘
|
||
│ │
|
||
▼ ▼
|
||
┌──────────────────────┐ ┌──────────────────────────────┐
|
||
│ Embedding Service │ │ Qdrant Vector Database │
|
||
│ - OpenAI API │ │ │
|
||
│ - Local Model │ │ Collection: nextcloud_content │
|
||
│ - Self-hosted │ │ - User-filtered vectors │
|
||
└──────────────────────┘ │ - Metadata for auth │
|
||
│ - HNSW index │
|
||
└───────────────────────────────┘
|
||
▲
|
||
│
|
||
│ Indexing
|
||
│
|
||
┌──────────┴────────────────────┐
|
||
│ Background Sync Worker │
|
||
│ (see ADR-002 for auth) │
|
||
│ │
|
||
│ 1. Fetch user content │
|
||
│ 2. Generate embeddings │
|
||
│ 3. Upsert to Qdrant │
|
||
│ 4. Update metadata │
|
||
└───────────────────────────────┘
|
||
```
|
||
|
||
## Implementation Details
|
||
|
||
### 1. Vector Database Selection: Qdrant
|
||
|
||
After evaluating multiple options, we select **Qdrant** for the following reasons:
|
||
|
||
**Qdrant Advantages:**
|
||
- ✅ Native async Python client (`qdrant-client`)
|
||
- ✅ Efficient multi-tenancy via filtered search (no collection-per-user needed)
|
||
- ✅ Built-in hybrid search support (dense + sparse vectors)
|
||
- ✅ HNSW index with excellent performance
|
||
- ✅ Lightweight Docker deployment
|
||
- ✅ Persistent storage with snapshots
|
||
- ✅ API key authentication
|
||
- ✅ Active development and documentation
|
||
|
||
**Comparison with Alternatives:**
|
||
|
||
| Feature | Qdrant | Chroma | Weaviate | pgvector |
|
||
|---------|--------|--------|----------|----------|
|
||
| Async Python | ✅ | ⚠️ Sync | ✅ | ✅ |
|
||
| Multi-tenant filtering | ✅ | ⚠️ Limited | ✅ | ✅ |
|
||
| Hybrid search | ✅ | ❌ | ✅ | ⚠️ Manual |
|
||
| Docker deployment | ✅ Easy | ✅ Easy | ✅ Complex | ⚠️ Postgres |
|
||
| Memory usage | ✅ Low | ⚠️ Medium | ⚠️ High | ✅ Low |
|
||
| Maturity | ✅ Production | ⚠️ Young | ✅ Production | ✅ Mature |
|
||
|
||
**Decision**: Qdrant provides the best balance of features, performance, and ease of deployment.
|
||
|
||
### 2. Embedding Strategy: Tiered Approach
|
||
|
||
Support multiple embedding backends with automatic fallback:
|
||
|
||
```python
|
||
class EmbeddingService:
|
||
"""Unified interface for embedding generation"""
|
||
|
||
def __init__(self):
|
||
self.provider = self._detect_provider()
|
||
|
||
def _detect_provider(self) -> EmbeddingProvider:
|
||
"""Auto-detect available embedding provider"""
|
||
|
||
# Tier 1: OpenAI API (best quality, requires API key)
|
||
if os.getenv("OPENAI_API_KEY"):
|
||
return OpenAIEmbedding(
|
||
model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
|
||
api_key=os.getenv("OPENAI_API_KEY")
|
||
)
|
||
|
||
# Tier 2: Self-hosted embedding service (good quality, privacy-preserving)
|
||
if os.getenv("EMBEDDING_SERVICE_URL"):
|
||
return HTTPEmbedding(
|
||
url=os.getenv("EMBEDDING_SERVICE_URL"),
|
||
model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
|
||
)
|
||
|
||
# Tier 3: Local model (fallback, CPU-only)
|
||
logger.warning("No cloud/hosted embeddings available, using local model")
|
||
return LocalEmbedding(
|
||
model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
|
||
)
|
||
|
||
async def embed(self, text: str) -> list[float]:
|
||
"""Generate embedding vector for text"""
|
||
return await self.provider.embed(text)
|
||
|
||
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
|
||
"""Generate embeddings for multiple texts (optimized)"""
|
||
return await self.provider.embed_batch(texts)
|
||
```
|
||
|
||
#### 2.1 OpenAI Embeddings (Tier 1)
|
||
|
||
```python
|
||
class OpenAIEmbedding(EmbeddingProvider):
|
||
"""OpenAI embedding API"""
|
||
|
||
def __init__(self, model: str, api_key: str):
|
||
self.client = AsyncOpenAI(api_key=api_key)
|
||
self.model = model
|
||
self.dimension = 1536 if "3-small" in model else 1536 # Model-dependent
|
||
|
||
async def embed(self, text: str) -> list[float]:
|
||
response = await self.client.embeddings.create(
|
||
model=self.model,
|
||
input=text
|
||
)
|
||
return response.data[0].embedding
|
||
|
||
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
|
||
# OpenAI supports batch up to 2048 inputs
|
||
response = await self.client.embeddings.create(
|
||
model=self.model,
|
||
input=texts
|
||
)
|
||
return [item.embedding for item in response.data]
|
||
```
|
||
|
||
**Costs**: text-embedding-3-small: $0.02 per 1M tokens (~4M characters)
|
||
- 10,000 notes × 500 words avg = ~$0.10 to index
|
||
- Searches are extremely cheap (~$0.00002 per query)
|
||
|
||
#### 2.2 Self-Hosted Embeddings (Tier 2)
|
||
|
||
```python
|
||
class HTTPEmbedding(EmbeddingProvider):
|
||
"""Self-hosted embedding service (Infinity, TEI, Ollama)"""
|
||
|
||
def __init__(self, url: str, model: str):
|
||
self.client = httpx.AsyncClient()
|
||
self.url = url
|
||
self.model = model
|
||
self.dimension = 384 # Model-dependent (bge-small: 384, bge-base: 768)
|
||
|
||
async def embed(self, text: str) -> list[float]:
|
||
response = await self.client.post(
|
||
f"{self.url}/embeddings",
|
||
json={"input": text, "model": self.model}
|
||
)
|
||
response.raise_for_status()
|
||
return response.json()["data"][0]["embedding"]
|
||
```
|
||
|
||
**Self-Hosted Options**:
|
||
- **Infinity**: Lightweight, OpenAI-compatible API, GPU support
|
||
- **Text Embeddings Inference (TEI)**: HuggingFace official, optimized, Rust-based
|
||
- **Ollama**: Easy setup, multi-model support, CPU/GPU
|
||
|
||
#### 2.3 Local Embeddings (Tier 3)
|
||
|
||
```python
|
||
class LocalEmbedding(EmbeddingProvider):
|
||
"""Local embedding using sentence-transformers (CPU fallback)"""
|
||
|
||
def __init__(self, model: str):
|
||
from sentence_transformers import SentenceTransformer
|
||
self.model = SentenceTransformer(model)
|
||
self.dimension = self.model.get_sentence_embedding_dimension()
|
||
|
||
async def embed(self, text: str) -> list[float]:
|
||
# Run in thread pool to avoid blocking
|
||
loop = asyncio.get_event_loop()
|
||
embedding = await loop.run_in_executor(
|
||
None,
|
||
self.model.encode,
|
||
text
|
||
)
|
||
return embedding.tolist()
|
||
```
|
||
|
||
**Recommended Local Models**:
|
||
- `all-MiniLM-L6-v2`: 384 dims, fast, good quality
|
||
- `all-mpnet-base-v2`: 768 dims, slower, better quality
|
||
- `paraphrase-multilingual-MiniLM-L12-v2`: Multilingual support
|
||
|
||
### 3. Vector Database Schema
|
||
|
||
```python
|
||
# Qdrant collection configuration
|
||
collection_config = {
|
||
"collection_name": "nextcloud_content",
|
||
"vectors_config": {
|
||
"size": 384, # Embedding dimension (model-dependent)
|
||
"distance": "Cosine" # Cosine similarity for semantic search
|
||
},
|
||
"optimizers_config": {
|
||
"indexing_threshold": 10000 # Start indexing after 10k vectors
|
||
},
|
||
"hnsw_config": {
|
||
"m": 16, # Number of edges per node (balance speed/accuracy)
|
||
"ef_construct": 100 # Quality of index construction
|
||
}
|
||
}
|
||
|
||
# Payload schema (metadata)
|
||
payload_schema = {
|
||
"user_id": str, # Required: owner of content
|
||
"content_type": str, # "note", "file", "calendar_event"
|
||
"content_id": str, # Source ID (note_id, file_path, event_id)
|
||
"title": str, # Searchable title
|
||
"excerpt": str, # First 200 chars for preview
|
||
"category": str, # Optional: category/folder
|
||
"mime_type": str, # Optional: file MIME type
|
||
"shared_with": list[str], # Optional: list of user_ids with access
|
||
"tags": list[str], # Optional: user tags
|
||
"created_at": int, # Unix timestamp
|
||
"modified_at": int, # Unix timestamp
|
||
"indexed_at": int # Unix timestamp (for sync tracking)
|
||
}
|
||
```
|
||
|
||
#### 3.1 Multi-Tenancy via Filtering
|
||
|
||
```python
|
||
# User-specific search with filtering
|
||
search_results = await qdrant_client.search(
|
||
collection_name="nextcloud_content",
|
||
query_vector=query_embedding,
|
||
query_filter=models.Filter(
|
||
must=[
|
||
# User owns the content OR it's shared with them
|
||
models.Filter(
|
||
should=[
|
||
models.FieldCondition(
|
||
key="user_id",
|
||
match=models.MatchValue(value=current_user_id)
|
||
),
|
||
models.FieldCondition(
|
||
key="shared_with",
|
||
match=models.MatchAny(any=[current_user_id])
|
||
)
|
||
]
|
||
),
|
||
# Optional: filter by content type
|
||
models.FieldCondition(
|
||
key="content_type",
|
||
match=models.MatchValue(value="note")
|
||
)
|
||
]
|
||
),
|
||
limit=20,
|
||
score_threshold=0.7 # Only return confident matches
|
||
)
|
||
```
|
||
|
||
### 4. Hybrid Search Implementation
|
||
|
||
Combine semantic and keyword search for best results:
|
||
|
||
```python
|
||
@mcp.tool()
|
||
@require_scopes("notes:read")
|
||
async def nc_notes_hybrid_search(
|
||
query: str,
|
||
ctx: Context,
|
||
limit: int = 10,
|
||
semantic_weight: float = 0.7,
|
||
keyword_weight: float = 0.3
|
||
) -> SearchNotesResponse:
|
||
"""
|
||
Hybrid search combining semantic understanding with keyword precision.
|
||
|
||
Args:
|
||
query: Natural language search query
|
||
limit: Maximum results to return
|
||
semantic_weight: Weight for semantic similarity (0-1)
|
||
keyword_weight: Weight for keyword matching (0-1)
|
||
"""
|
||
|
||
client = get_client(ctx)
|
||
username = client.username
|
||
|
||
# Run searches in parallel
|
||
semantic_task = asyncio.create_task(
|
||
semantic_search(query, username, limit=limit * 2)
|
||
)
|
||
keyword_task = asyncio.create_task(
|
||
keyword_search(query, username, limit=limit * 2)
|
||
)
|
||
|
||
semantic_results, keyword_results = await asyncio.gather(
|
||
semantic_task, keyword_task
|
||
)
|
||
|
||
# Fusion: Combine and rerank results
|
||
fused_results = reciprocal_rank_fusion(
|
||
semantic_results,
|
||
keyword_results,
|
||
semantic_weight=semantic_weight,
|
||
keyword_weight=keyword_weight
|
||
)
|
||
|
||
# Verify permissions via Nextcloud API (dual-phase authorization)
|
||
verified_results = []
|
||
for result in fused_results[:limit * 2]: # Get extra for filtering
|
||
try:
|
||
note = await client.notes.get_note(result["note_id"])
|
||
verified_results.append({
|
||
"note": note,
|
||
"score": result["score"],
|
||
"match_type": result["match_type"] # "semantic", "keyword", "both"
|
||
})
|
||
if len(verified_results) >= limit:
|
||
break
|
||
except HTTPStatusError as e:
|
||
if e.response.status_code == 403:
|
||
continue # User lost access
|
||
raise
|
||
|
||
return SearchNotesResponse(
|
||
results=verified_results,
|
||
query=query,
|
||
total_found=len(verified_results),
|
||
search_method="hybrid"
|
||
)
|
||
|
||
def reciprocal_rank_fusion(
|
||
semantic_results: list[dict],
|
||
keyword_results: list[dict],
|
||
semantic_weight: float = 0.7,
|
||
keyword_weight: float = 0.3,
|
||
k: int = 60 # RRF constant
|
||
) -> list[dict]:
|
||
"""
|
||
Reciprocal Rank Fusion for combining search results.
|
||
|
||
RRF is more robust than score normalization because it only
|
||
depends on ranks, not absolute scores.
|
||
"""
|
||
|
||
# Build rank maps
|
||
semantic_ranks = {r["note_id"]: i for i, r in enumerate(semantic_results)}
|
||
keyword_ranks = {r["note_id"]: i for i, r in enumerate(keyword_results)}
|
||
|
||
# Get all unique note IDs
|
||
all_note_ids = set(semantic_ranks.keys()) | set(keyword_ranks.keys())
|
||
|
||
# Calculate fused scores
|
||
fused = []
|
||
for note_id in all_note_ids:
|
||
# RRF formula: score = sum(weight_i / (k + rank_i))
|
||
semantic_score = 0
|
||
keyword_score = 0
|
||
match_type = []
|
||
|
||
if note_id in semantic_ranks:
|
||
semantic_score = semantic_weight / (k + semantic_ranks[note_id])
|
||
match_type.append("semantic")
|
||
|
||
if note_id in keyword_ranks:
|
||
keyword_score = keyword_weight / (k + keyword_ranks[note_id])
|
||
match_type.append("keyword")
|
||
|
||
fused.append({
|
||
"note_id": note_id,
|
||
"score": semantic_score + keyword_score,
|
||
"match_type": "+".join(match_type)
|
||
})
|
||
|
||
# Sort by fused score
|
||
fused.sort(key=lambda x: x["score"], reverse=True)
|
||
return fused
|
||
```
|
||
|
||
### 5. Document Chunking Strategy
|
||
|
||
For large documents (>1000 tokens), implement semantic chunking:
|
||
|
||
```python
|
||
class DocumentChunker:
|
||
"""Chunk large documents for optimal embedding"""
|
||
|
||
def __init__(self, chunk_size: int = 512, overlap: int = 50):
|
||
self.chunk_size = chunk_size # tokens
|
||
self.overlap = overlap # overlapping tokens
|
||
|
||
def chunk_document(
|
||
self,
|
||
content: str,
|
||
metadata: dict
|
||
) -> list[tuple[str, dict]]:
|
||
"""
|
||
Split document into overlapping chunks with metadata.
|
||
|
||
Returns list of (chunk_text, chunk_metadata) tuples.
|
||
"""
|
||
|
||
# Tokenize (approximate with words for simplicity)
|
||
tokens = content.split()
|
||
|
||
if len(tokens) <= self.chunk_size:
|
||
# Document fits in single chunk
|
||
return [(content, metadata)]
|
||
|
||
chunks = []
|
||
start = 0
|
||
|
||
while start < len(tokens):
|
||
end = start + self.chunk_size
|
||
chunk_tokens = tokens[start:end]
|
||
chunk_text = " ".join(chunk_tokens)
|
||
|
||
# Add chunk metadata
|
||
chunk_metadata = {
|
||
**metadata,
|
||
"chunk_index": len(chunks),
|
||
"chunk_start": start,
|
||
"chunk_end": end,
|
||
"is_chunk": True
|
||
}
|
||
|
||
chunks.append((chunk_text, chunk_metadata))
|
||
|
||
# Move to next chunk with overlap
|
||
start = end - self.overlap
|
||
|
||
return chunks
|
||
|
||
# Usage in sync worker
|
||
async def index_document(doc: Document, user_id: str):
|
||
"""Index a document with chunking"""
|
||
|
||
chunker = DocumentChunker(chunk_size=512, overlap=50)
|
||
chunks = chunker.chunk_document(
|
||
content=doc.content,
|
||
metadata={
|
||
"user_id": user_id,
|
||
"content_type": "file",
|
||
"content_id": doc.path,
|
||
"title": doc.title,
|
||
"mime_type": doc.mime_type
|
||
}
|
||
)
|
||
|
||
# Generate embeddings in batch
|
||
chunk_texts = [chunk[0] for chunk in chunks]
|
||
embeddings = await embedding_service.embed_batch(chunk_texts)
|
||
|
||
# Upsert all chunks
|
||
points = []
|
||
for (chunk_text, chunk_metadata), embedding in zip(chunks, embeddings):
|
||
points.append(
|
||
models.PointStruct(
|
||
id=str(uuid.uuid4()),
|
||
vector=embedding,
|
||
payload={
|
||
**chunk_metadata,
|
||
"excerpt": chunk_text[:200] # Preview
|
||
}
|
||
)
|
||
)
|
||
|
||
await qdrant_client.upsert(
|
||
collection_name="nextcloud_content",
|
||
points=points
|
||
)
|
||
```
|
||
|
||
### 6. Background Sync Worker
|
||
|
||
```python
|
||
# nextcloud_mcp_server/sync/vector_indexer.py
|
||
class VectorIndexer:
|
||
"""Indexes content into vector database"""
|
||
|
||
def __init__(
|
||
self,
|
||
qdrant_client: AsyncQdrantClient,
|
||
embedding_service: EmbeddingService,
|
||
auth_provider: SyncAuthProvider # From ADR-002
|
||
):
|
||
self.qdrant = qdrant_client
|
||
self.embeddings = embedding_service
|
||
self.auth = auth_provider
|
||
|
||
async def sync_user_notes(self, user_id: str):
|
||
"""Sync all notes for a user"""
|
||
|
||
# Get authenticated client for user
|
||
client = await self.auth.get_user_client(user_id)
|
||
|
||
# Fetch all notes
|
||
notes = await client.notes.list_notes()
|
||
logger.info(f"Syncing {len(notes)} notes for {user_id}")
|
||
|
||
# Check which notes need updating
|
||
existing_ids = await self._get_indexed_note_ids(user_id)
|
||
notes_to_update = [
|
||
n for n in notes
|
||
if f"note_{n.id}" not in existing_ids
|
||
or n.modified > existing_ids[f"note_{n.id}"]
|
||
]
|
||
|
||
if not notes_to_update:
|
||
logger.info(f"All notes up-to-date for {user_id}")
|
||
return
|
||
|
||
# Generate embeddings in batch
|
||
contents = [f"{n.title}\n\n{n.content}" for n in notes_to_update]
|
||
embeddings = await self.embeddings.embed_batch(contents)
|
||
|
||
# Prepare points for upsert
|
||
points = []
|
||
for note, embedding in zip(notes_to_update, embeddings):
|
||
points.append(
|
||
models.PointStruct(
|
||
id=f"note_{note.id}",
|
||
vector=embedding,
|
||
payload={
|
||
"user_id": user_id,
|
||
"content_type": "note",
|
||
"content_id": str(note.id),
|
||
"note_id": note.id,
|
||
"title": note.title,
|
||
"excerpt": note.content[:200],
|
||
"category": note.category,
|
||
"created_at": note.created,
|
||
"modified_at": note.modified,
|
||
"indexed_at": int(time.time())
|
||
}
|
||
)
|
||
)
|
||
|
||
# Upsert to Qdrant
|
||
await self.qdrant.upsert(
|
||
collection_name="nextcloud_content",
|
||
points=points
|
||
)
|
||
|
||
logger.info(f"Indexed {len(points)} notes for {user_id}")
|
||
|
||
async def _get_indexed_note_ids(self, user_id: str) -> dict[str, int]:
|
||
"""Get map of note_id -> modified_at for indexed notes"""
|
||
|
||
# Query Qdrant for existing notes
|
||
scroll_result = await self.qdrant.scroll(
|
||
collection_name="nextcloud_content",
|
||
scroll_filter=models.Filter(
|
||
must=[
|
||
models.FieldCondition(
|
||
key="user_id",
|
||
match=models.MatchValue(value=user_id)
|
||
),
|
||
models.FieldCondition(
|
||
key="content_type",
|
||
match=models.MatchValue(value="note")
|
||
)
|
||
]
|
||
),
|
||
with_payload=["content_id", "modified_at"],
|
||
limit=10000
|
||
)
|
||
|
||
return {
|
||
point.payload["content_id"]: point.payload["modified_at"]
|
||
for point, _ in scroll_result
|
||
}
|
||
|
||
async def delete_note(self, user_id: str, note_id: int):
|
||
"""Remove deleted note from index"""
|
||
|
||
await self.qdrant.delete(
|
||
collection_name="nextcloud_content",
|
||
points_selector=models.FilterSelector(
|
||
filter=models.Filter(
|
||
must=[
|
||
models.FieldCondition(
|
||
key="user_id",
|
||
match=models.MatchValue(value=user_id)
|
||
),
|
||
models.FieldCondition(
|
||
key="note_id",
|
||
match=models.MatchValue(value=note_id)
|
||
)
|
||
]
|
||
)
|
||
)
|
||
)
|
||
```
|
||
|
||
### 7. Configuration
|
||
|
||
#### 7.1 Environment Variables
|
||
```bash
|
||
# Vector Database
|
||
QDRANT_URL=http://qdrant:6333
|
||
QDRANT_API_KEY=<secure-api-key>
|
||
QDRANT_COLLECTION=nextcloud_content
|
||
|
||
# Embedding Strategy (choose one)
|
||
# Option 1: OpenAI
|
||
OPENAI_API_KEY=sk-...
|
||
OPENAI_EMBEDDING_MODEL=text-embedding-3-small # or text-embedding-3-large
|
||
|
||
# Option 2: Self-hosted
|
||
EMBEDDING_SERVICE_URL=http://embeddings:7997
|
||
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
|
||
|
||
# Option 3: Local (fallback, no config needed)
|
||
|
||
# Search Configuration
|
||
SEMANTIC_SEARCH_ENABLED=true
|
||
HYBRID_SEARCH_DEFAULT_SEMANTIC_WEIGHT=0.7
|
||
HYBRID_SEARCH_DEFAULT_KEYWORD_WEIGHT=0.3
|
||
SEARCH_SCORE_THRESHOLD=0.7
|
||
|
||
# Sync Configuration
|
||
VECTOR_SYNC_INTERVAL=300 # seconds
|
||
VECTOR_SYNC_BATCH_SIZE=100
|
||
```
|
||
|
||
#### 7.2 Docker Compose
|
||
|
||
```yaml
|
||
services:
|
||
# Vector Database
|
||
qdrant:
|
||
image: qdrant/qdrant:latest
|
||
restart: always
|
||
ports:
|
||
- 127.0.0.1:6333:6333 # REST API
|
||
- 127.0.0.1:6334:6334 # gRPC
|
||
volumes:
|
||
- qdrant_storage:/qdrant/storage
|
||
environment:
|
||
- QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY}
|
||
- QDRANT__SERVICE__HTTP_PORT=6333
|
||
- QDRANT__SERVICE__GRPC_PORT=6334
|
||
|
||
# Embedding Service (optional - for self-hosted)
|
||
embeddings:
|
||
image: michaelf34/infinity:latest
|
||
restart: always
|
||
ports:
|
||
- 127.0.0.1:7997:7997
|
||
volumes:
|
||
- embedding_models:/app/.cache
|
||
environment:
|
||
- MODEL_ID=BAAI/bge-small-en-v1.5
|
||
- BATCH_SIZE=32
|
||
- ENGINE=torch # or optimum for better CPU performance
|
||
# Optional: GPU support
|
||
deploy:
|
||
resources:
|
||
reservations:
|
||
devices:
|
||
- driver: nvidia
|
||
count: 1
|
||
capabilities: [gpu]
|
||
|
||
# MCP Server with vector search
|
||
mcp:
|
||
build: .
|
||
command: ["--transport", "streamable-http"]
|
||
depends_on:
|
||
- app
|
||
- qdrant
|
||
- embeddings # optional
|
||
environment:
|
||
# ... existing env vars ...
|
||
- SEMANTIC_SEARCH_ENABLED=true
|
||
- QDRANT_URL=http://qdrant:6333
|
||
- QDRANT_API_KEY=${QDRANT_API_KEY}
|
||
# Choose embedding strategy
|
||
- EMBEDDING_SERVICE_URL=http://embeddings:7997
|
||
# OR
|
||
# - OPENAI_API_KEY=${OPENAI_API_KEY}
|
||
|
||
# Vector Sync Worker
|
||
mcp-vector-sync:
|
||
build: .
|
||
command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
|
||
depends_on:
|
||
- app
|
||
- qdrant
|
||
- embeddings # optional
|
||
environment:
|
||
# Nextcloud + Auth (from ADR-002)
|
||
- NEXTCLOUD_HOST=http://app:80
|
||
- ENABLE_OFFLINE_ACCESS=true
|
||
- TOKEN_ENCRYPTION_KEY=${TOKEN_ENCRYPTION_KEY}
|
||
# Vector Database
|
||
- QDRANT_URL=http://qdrant:6333
|
||
- QDRANT_API_KEY=${QDRANT_API_KEY}
|
||
# Embeddings
|
||
- EMBEDDING_SERVICE_URL=http://embeddings:7997
|
||
volumes:
|
||
- sync-tokens:/app/data
|
||
|
||
volumes:
|
||
qdrant_storage:
|
||
embedding_models:
|
||
sync-tokens:
|
||
```
|
||
|
||
### 8. Performance Optimization
|
||
|
||
#### 8.1 Indexing Performance
|
||
|
||
```python
|
||
# Batch embedding generation
|
||
async def embed_batch_chunked(
|
||
texts: list[str],
|
||
batch_size: int = 100
|
||
) -> list[list[float]]:
|
||
"""Generate embeddings in chunks to avoid memory issues"""
|
||
|
||
embeddings = []
|
||
for i in range(0, len(texts), batch_size):
|
||
batch = texts[i:i + batch_size]
|
||
batch_embeddings = await embedding_service.embed_batch(batch)
|
||
embeddings.extend(batch_embeddings)
|
||
await asyncio.sleep(0.1) # Rate limiting
|
||
|
||
return embeddings
|
||
|
||
# Parallel upsert with batching
|
||
async def upsert_points_batched(
|
||
points: list[models.PointStruct],
|
||
batch_size: int = 100
|
||
):
|
||
"""Upsert points in batches"""
|
||
|
||
for i in range(0, len(points), batch_size):
|
||
batch = points[i:i + batch_size]
|
||
await qdrant_client.upsert(
|
||
collection_name="nextcloud_content",
|
||
points=batch,
|
||
wait=False # Don't wait for indexing
|
||
)
|
||
```
|
||
|
||
#### 8.2 Search Performance
|
||
|
||
```python
|
||
# Search with prefetch for better accuracy
|
||
search_results = await qdrant_client.search(
|
||
collection_name="nextcloud_content",
|
||
query_vector=query_embedding,
|
||
query_filter=user_filter,
|
||
limit=20,
|
||
with_payload=True,
|
||
with_vectors=False, # Don't return vectors (saves bandwidth)
|
||
search_params=models.SearchParams(
|
||
hnsw_ef=128, # Higher = more accurate but slower
|
||
exact=False # Use HNSW index
|
||
)
|
||
)
|
||
```
|
||
|
||
#### 8.3 Caching
|
||
|
||
```python
|
||
# Cache embeddings for common queries
|
||
from functools import lru_cache
|
||
|
||
@lru_cache(maxsize=1000)
|
||
def cache_key(text: str) -> str:
|
||
return hashlib.sha256(text.encode()).hexdigest()
|
||
|
||
async def embed_with_cache(text: str) -> list[float]:
|
||
"""Generate embedding with caching"""
|
||
|
||
key = cache_key(text)
|
||
|
||
# Check Redis cache
|
||
cached = await redis.get(f"embedding:{key}")
|
||
if cached:
|
||
return json.loads(cached)
|
||
|
||
# Generate embedding
|
||
embedding = await embedding_service.embed(text)
|
||
|
||
# Cache for 1 hour
|
||
await redis.setex(
|
||
f"embedding:{key}",
|
||
3600,
|
||
json.dumps(embedding)
|
||
)
|
||
|
||
return embedding
|
||
```
|
||
|
||
### 9. Monitoring and Metrics
|
||
|
||
```python
|
||
# Prometheus metrics
|
||
from prometheus_client import Counter, Histogram, Gauge
|
||
|
||
# Search metrics
|
||
semantic_search_count = Counter(
|
||
'semantic_search_total',
|
||
'Total semantic searches',
|
||
['user_id', 'content_type']
|
||
)
|
||
|
||
semantic_search_latency = Histogram(
|
||
'semantic_search_duration_seconds',
|
||
'Semantic search latency',
|
||
['phase'] # 'embedding', 'vector_search', 'verification'
|
||
)
|
||
|
||
# Indexing metrics
|
||
documents_indexed = Counter(
|
||
'documents_indexed_total',
|
||
'Total documents indexed',
|
||
['user_id', 'content_type']
|
||
)
|
||
|
||
index_queue_size = Gauge(
|
||
'index_queue_size',
|
||
'Number of documents waiting to be indexed'
|
||
)
|
||
|
||
# Usage
|
||
async def semantic_search(query: str, user_id: str):
|
||
semantic_search_count.labels(user_id=user_id, content_type='note').inc()
|
||
|
||
with semantic_search_latency.labels(phase='embedding').time():
|
||
embedding = await embed(query)
|
||
|
||
with semantic_search_latency.labels(phase='vector_search').time():
|
||
results = await qdrant.search(...)
|
||
|
||
with semantic_search_latency.labels(phase='verification').time():
|
||
verified = await verify_access(results)
|
||
|
||
return verified
|
||
```
|
||
|
||
## Consequences
|
||
|
||
### Benefits
|
||
|
||
1. **Semantic Understanding**
|
||
- Find content by meaning, not just keywords
|
||
- Support for natural language queries
|
||
- Cross-lingual search potential
|
||
- Better context discovery for LLMs
|
||
|
||
2. **User Experience**
|
||
- More relevant search results
|
||
- Discover related content across apps
|
||
- Fast sub-second query latency
|
||
- Hybrid search combines best of both worlds
|
||
|
||
3. **Architecture**
|
||
- External sidecar (doesn't bloat MCP server)
|
||
- Configurable embedding backend (cloud/self-hosted/local)
|
||
- Multi-tenant with strict isolation
|
||
- Scales horizontally (Qdrant cluster)
|
||
|
||
4. **Privacy & Security**
|
||
- Self-hosted option available
|
||
- Dual-phase authorization enforces permissions
|
||
- Vector DB is cache, not source of truth
|
||
- Per-user audit trail
|
||
|
||
5. **Developer Experience**
|
||
- Simple async Python API
|
||
- Comprehensive monitoring
|
||
- Clear upgrade path (better embeddings, reranking)
|
||
|
||
### Limitations
|
||
|
||
1. **Complexity**
|
||
- Additional infrastructure (Qdrant + embeddings)
|
||
- More monitoring required
|
||
- Embedding generation latency
|
||
- Initial indexing time for large collections
|
||
|
||
2. **Cost**
|
||
- Storage: ~4KB per document (embedding + metadata)
|
||
- Compute: Embedding generation (API costs or GPU)
|
||
- Memory: Qdrant keeps vectors in RAM for speed
|
||
|
||
3. **Operational**
|
||
- Index maintenance and updates
|
||
- Embedding model versioning
|
||
- Handling deleted/moved content
|
||
- Cold start indexing for new users
|
||
|
||
4. **Search Accuracy**
|
||
- Quality depends on embedding model
|
||
- May miss exact keyword matches (mitigated by hybrid search)
|
||
- Cultural/domain-specific terms may not embed well
|
||
- Requires tuning score thresholds
|
||
|
||
### Performance Characteristics
|
||
|
||
| Metric | Target | Notes |
|
||
|--------|--------|-------|
|
||
| Search latency | <200ms | Embedding + vector search + verification |
|
||
| Indexing throughput | >100 docs/sec | With batch embeddings |
|
||
| Memory per 10k docs | ~40MB | Qdrant vectors + metadata |
|
||
| Disk per 10k docs | ~40MB | Persistent storage |
|
||
| Search accuracy | >90% | At score_threshold=0.7 |
|
||
|
||
### Cost Estimates
|
||
|
||
**Small Deployment** (10 users, 1000 notes each):
|
||
- Initial indexing: 10,000 notes × $0.00002 = $0.20 (OpenAI)
|
||
- Monthly searches: 1000 queries × $0.00002 = $0.02
|
||
- Infrastructure: Qdrant (40MB RAM), Embeddings (optional)
|
||
- **Total**: ~$0.25/month (API) or self-hosted (negligible)
|
||
|
||
**Medium Deployment** (100 users, 500 notes each):
|
||
- Initial indexing: 50,000 notes × $0.00002 = $1.00
|
||
- Monthly searches: 10,000 queries × $0.00002 = $0.20
|
||
- Infrastructure: Qdrant (200MB RAM)
|
||
- **Total**: ~$1.20/month or self-hosted
|
||
|
||
**Self-Hosted** (any size):
|
||
- GPU instance: ~$0.50/hour (~$360/month for 24/7)
|
||
- Or CPU-only: negligible cost, slower embeddings
|
||
|
||
### Future Enhancements
|
||
|
||
1. **Multimodal Search**
|
||
- Image embeddings (CLIP)
|
||
- PDF/document layout understanding
|
||
- Audio transcription + embedding
|
||
|
||
2. **Advanced Ranking**
|
||
- Cross-encoder reranking
|
||
- Learning-to-rank models
|
||
- User feedback signals
|
||
|
||
3. **Query Understanding**
|
||
- Query expansion
|
||
- Spell correction
|
||
- Entity extraction
|
||
|
||
4. **Performance**
|
||
- Query result caching
|
||
- Approximate nearest neighbor improvements
|
||
- Quantization for reduced memory
|
||
|
||
5. **Features**
|
||
- Saved searches
|
||
- Search analytics
|
||
- Recommended content
|
||
|
||
## Alternatives Considered
|
||
|
||
### Alternative 1: Elasticsearch/OpenSearch
|
||
|
||
**Approach**: Use traditional full-text search engine with vector plugin
|
||
|
||
**Pros**:
|
||
- Mature ecosystem
|
||
- Excellent keyword search
|
||
- Rich query DSL
|
||
|
||
**Cons**:
|
||
- Heavy infrastructure (JVM-based)
|
||
- Complex setup and tuning
|
||
- Vector search is plugin/add-on (not native)
|
||
- Higher resource usage
|
||
|
||
**Decision**: Rejected; Qdrant is purpose-built for vectors
|
||
|
||
### Alternative 2: ChromaDB
|
||
|
||
**Approach**: Embedded or client-server vector database
|
||
|
||
**Pros**:
|
||
- Simple Python API
|
||
- Easy to get started
|
||
- Good for prototyping
|
||
|
||
**Cons**:
|
||
- Sync-only Python client (no async)
|
||
- Limited multi-tenancy features
|
||
- Less mature than Qdrant
|
||
- Scaling concerns
|
||
|
||
**Decision**: Rejected; async and multi-tenancy are critical
|
||
|
||
### Alternative 3: Weaviate
|
||
|
||
**Approach**: Full-featured vector database with GraphQL
|
||
|
||
**Pros**:
|
||
- Very feature-rich
|
||
- Built-in vectorization
|
||
- Good documentation
|
||
|
||
**Cons**:
|
||
- More complex architecture
|
||
- Higher resource usage
|
||
- GraphQL adds complexity
|
||
- Overkill for our use case
|
||
|
||
**Decision**: Rejected; Qdrant provides better balance
|
||
|
||
### Alternative 4: pgvector (PostgreSQL Extension)
|
||
|
||
**Approach**: Add vector search to existing PostgreSQL
|
||
|
||
**Pros**:
|
||
- Leverages existing PostgreSQL expertise
|
||
- Transactional consistency
|
||
- Mature database ecosystem
|
||
|
||
**Cons**:
|
||
- This deployment uses MariaDB (would need PostgreSQL)
|
||
- Performance not as optimized as purpose-built vector DB
|
||
- Manual hybrid search implementation
|
||
- HNSW index limitations
|
||
|
||
**Decision**: Rejected; dedicated vector DB is better fit
|
||
|
||
### Alternative 5: Pinecone / Vertex AI Vector Search
|
||
|
||
**Approach**: Managed cloud vector database
|
||
|
||
**Pros**:
|
||
- Fully managed
|
||
- Excellent performance
|
||
- No infrastructure management
|
||
|
||
**Cons**:
|
||
- Cloud-only (no self-hosting)
|
||
- Recurring costs
|
||
- Vendor lock-in
|
||
- Data leaves premises
|
||
|
||
**Decision**: Rejected; self-hosting is important for privacy
|
||
|
||
## Related Decisions
|
||
|
||
- ADR-001: Enhanced Note Search (establishes need for better search)
|
||
- ADR-002: Vector Sync Authentication (defines how sync workers authenticate)
|
||
- [Future] ADR-004: Content Extraction and Document Processing
|
||
- [Future] ADR-005: Cross-App Semantic Search
|
||
|
||
## References
|
||
|
||
- [Qdrant Documentation](https://qdrant.tech/documentation/)
|
||
- [Sentence Transformers](https://www.sbert.net/)
|
||
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
|
||
- [Hybrid Search with RRF](https://qdrant.tech/articles/hybrid-search/)
|
||
- [HNSW Algorithm](https://arxiv.org/abs/1603.09320)
|
||
- [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
|