Compare commits
12 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| d374bfa1e5 | |||
| 2522b13d35 | |||
| 6cfd7e2729 | |||
| 3aa7128f45 | |||
| c3282534eb | |||
| 862308418e | |||
| 3464b21845 | |||
| ea01ce7673 | |||
| 216cb94383 | |||
| 5f3e0b84a3 | |||
| 392e1536b9 | |||
| 00ed3f07e5 |
@@ -5,3 +5,4 @@
|
||||
!uv.lock
|
||||
|
||||
!nextcloud_mcp_server/**/*.py
|
||||
!nextcloud_mcp_server/**/*.html
|
||||
|
||||
@@ -85,4 +85,4 @@ jobs:
|
||||
NEXTCLOUD_USERNAME: "admin"
|
||||
NEXTCLOUD_PASSWORD: "admin"
|
||||
run: |
|
||||
uv run pytest -v --log-cli-level=WARN -m smoke
|
||||
uv run pytest -v --log-cli-level=WARN -m unit -m smoke
|
||||
|
||||
+3
-3
@@ -1,6 +1,6 @@
|
||||
[submodule "oidc"]
|
||||
path = third_party/oidc
|
||||
url = https://github.com/cbcoutinho/oidc
|
||||
[submodule "third_party/oidc"]
|
||||
path = third_party/oidc
|
||||
url = https://github.com/cbcoutinho/oidc
|
||||
[submodule "third_party/notes"]
|
||||
path = third_party/notes
|
||||
url = https://github.com/cbcoutinho/notes
|
||||
|
||||
+2
-2
@@ -1,6 +1,6 @@
|
||||
FROM python:3.12-slim-trixie
|
||||
FROM docker.io/library/python:3.12-slim-trixie@sha256:d86b4c74b936c438cd4cc3a9f7256b9a7c27ad68c7caf8c205e18d9845af0164
|
||||
|
||||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
|
||||
COPY --from=ghcr.io/astral-sh/uv:latest@sha256:f6e3549ed287fee0ddde2460a2a74a2d74366f84b04aaa34c1f19fec40da8652 /uv /uvx /bin/
|
||||
|
||||
# Install dependencies
|
||||
# 1. git (required for caldav dependency from git)
|
||||
|
||||
@@ -147,7 +147,95 @@ This decision consolidates our retrieval logic, eliminates the data consistency
|
||||
|
||||
**Benefits Realized:**
|
||||
- ✅ Consolidated architecture (single Qdrant database for both dense + sparse)
|
||||
- ✅ Native RRF fusion (database-level, more efficient)
|
||||
- ✅ Native fusion algorithms (database-level, more efficient)
|
||||
- ✅ Industry-standard BM25 (replaces custom keyword search)
|
||||
- ✅ Simplified codebase (removed 736 lines of legacy code)
|
||||
- ✅ Better relevance (handles both semantic and keyword queries)
|
||||
- ✅ Configurable fusion methods (RRF and DBSF)
|
||||
|
||||
---
|
||||
|
||||
### 7. Fusion Algorithm Options
|
||||
|
||||
**Update: 2025-11-16**
|
||||
|
||||
The BM25 hybrid search now supports two fusion algorithms for combining dense (semantic) and sparse (BM25) search results:
|
||||
|
||||
#### Reciprocal Rank Fusion (RRF)
|
||||
|
||||
**Default fusion method.** RRF is a widely-used, well-established algorithm that combines rankings from multiple retrieval systems using the reciprocal rank formula:
|
||||
|
||||
```
|
||||
RRF(doc) = Σ 1/(k + rank_i(doc))
|
||||
```
|
||||
|
||||
where `k` is a constant (typically 60) and `rank_i(doc)` is the rank of the document in retrieval system `i`.
|
||||
|
||||
**Characteristics:**
|
||||
- ✅ **General-purpose**: Works well across diverse query types and document collections
|
||||
- ✅ **Rank-based**: Focuses on relative rankings rather than absolute scores
|
||||
- ✅ **Established**: Well-tested, documented, and understood in IR literature
|
||||
- ✅ **Robust**: Less sensitive to score distribution differences between systems
|
||||
|
||||
**When to use RRF:**
|
||||
- Default choice for most use cases
|
||||
- When you have mixed query types (semantic + keyword)
|
||||
- When retrieval systems have very different score ranges
|
||||
- When you want predictable, well-understood behavior
|
||||
|
||||
#### Distribution-Based Score Fusion (DBSF)
|
||||
|
||||
**Alternative fusion method.** DBSF normalizes scores from each retrieval system using distribution statistics before combining them:
|
||||
|
||||
1. **Normalization**: For each query, calculates mean (μ) and standard deviation (σ) of scores
|
||||
2. **Outlier handling**: Uses μ ± 3σ as normalization bounds
|
||||
3. **Fusion**: Sums normalized scores across systems
|
||||
|
||||
**Characteristics:**
|
||||
- ✅ **Score-aware**: Uses actual relevance scores, not just rankings
|
||||
- ✅ **Statistical**: Normalizes based on score distribution properties
|
||||
- ⚠️ **Experimental**: Newer algorithm, less battle-tested than RRF
|
||||
- ⚠️ **Sensitive**: May behave differently depending on score distributions
|
||||
|
||||
**When to use DBSF:**
|
||||
- When retrieval systems have vastly different score ranges that RRF doesn't balance well
|
||||
- When you want to experiment with score-based (vs rank-based) fusion
|
||||
- When statistical normalization better matches your use case
|
||||
- For A/B testing against RRF to measure retrieval quality improvements
|
||||
|
||||
#### Configuration
|
||||
|
||||
Both fusion algorithms are exposed via the `fusion` parameter in MCP tools:
|
||||
|
||||
```python
|
||||
# Use RRF (default)
|
||||
response = await nc_semantic_search(
|
||||
query="async programming",
|
||||
fusion="rrf" # Can be omitted, RRF is default
|
||||
)
|
||||
|
||||
# Use DBSF
|
||||
response = await nc_semantic_search(
|
||||
query="async programming",
|
||||
fusion="dbsf"
|
||||
)
|
||||
```
|
||||
|
||||
The `nc_semantic_search_answer` tool also supports the `fusion` parameter and passes it through to the underlying search.
|
||||
|
||||
#### Future: Configurable Weights
|
||||
|
||||
**Current limitation**: Neither RRF nor DBSF currently support per-system weights (e.g., 0.8 for semantic, 0.2 for BM25). This is a Qdrant platform limitation tracked in [qdrant/qdrant#6067](https://github.com/qdrant/qdrant/issues/6067).
|
||||
|
||||
When Qdrant adds weight support, the `fusion` parameter can be extended to accept weight configurations:
|
||||
|
||||
```python
|
||||
# Hypothetical future API
|
||||
response = await nc_semantic_search(
|
||||
query="async programming",
|
||||
fusion="rrf",
|
||||
fusion_weights={"dense": 0.7, "sparse": 0.3} # Not yet implemented
|
||||
)
|
||||
```
|
||||
|
||||
**Recommendation**: Start with RRF (default). If you encounter cases where keyword matches are under- or over-weighted, experiment with DBSF. Monitor [qdrant/qdrant#6067](https://github.com/qdrant/qdrant/issues/6067) for configurable weight support.
|
||||
|
||||
@@ -1478,6 +1478,7 @@ def get_app(transport: str = "sse", enabled_apps: list[str] | None = None):
|
||||
vector_sync_status_fragment,
|
||||
)
|
||||
from nextcloud_mcp_server.auth.viz_routes import (
|
||||
chunk_context_endpoint,
|
||||
vector_visualization_html,
|
||||
vector_visualization_search,
|
||||
)
|
||||
@@ -1509,6 +1510,11 @@ def get_app(transport: str = "sse", enabled_apps: list[str] | None = None):
|
||||
vector_visualization_search,
|
||||
methods=["GET"],
|
||||
), # /app/vector-viz/search
|
||||
Route(
|
||||
"/chunk-context",
|
||||
chunk_context_endpoint,
|
||||
methods=["GET"],
|
||||
), # /app/chunk-context
|
||||
# Webhook management routes (admin-only)
|
||||
Route("/webhooks", webhook_management_pane, methods=["GET"]), # /app/webhooks
|
||||
Route(
|
||||
|
||||
@@ -0,0 +1,339 @@
|
||||
<style>
|
||||
.viz-card {
|
||||
background: white;
|
||||
border-radius: 8px;
|
||||
padding: 20px;
|
||||
margin-bottom: 20px;
|
||||
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
||||
}
|
||||
.viz-controls {
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
.viz-control-row {
|
||||
display: grid;
|
||||
grid-template-columns: 2fr 1fr auto;
|
||||
gap: 12px;
|
||||
margin-bottom: 12px;
|
||||
align-items: end;
|
||||
}
|
||||
.viz-control-group {
|
||||
margin-bottom: 15px;
|
||||
}
|
||||
.viz-control-group label {
|
||||
display: block;
|
||||
margin-bottom: 5px;
|
||||
font-weight: 500;
|
||||
color: #333;
|
||||
}
|
||||
.viz-control-group input[type="text"],
|
||||
.viz-control-group input[type="number"],
|
||||
.viz-control-group select {
|
||||
width: 100%;
|
||||
padding: 8px 12px;
|
||||
border: 1px solid #ddd;
|
||||
border-radius: 4px;
|
||||
font-size: 14px;
|
||||
}
|
||||
.viz-control-group input[type="range"] {
|
||||
width: 100%;
|
||||
}
|
||||
.viz-control-group select[multiple] {
|
||||
min-height: 100px;
|
||||
}
|
||||
.viz-weight-display {
|
||||
display: inline-block;
|
||||
min-width: 40px;
|
||||
text-align: right;
|
||||
color: #666;
|
||||
}
|
||||
.viz-btn {
|
||||
background: #0066cc;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 10px 20px;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
font-size: 14px;
|
||||
font-weight: 500;
|
||||
}
|
||||
.viz-btn:hover {
|
||||
background: #0052a3;
|
||||
}
|
||||
.viz-btn-secondary {
|
||||
background: #6c757d;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 6px 12px;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
font-size: 13px;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
.viz-btn-secondary:hover {
|
||||
background: #5a6268;
|
||||
}
|
||||
#viz-plot-container {
|
||||
width: 100%;
|
||||
height: 600px;
|
||||
position: relative;
|
||||
}
|
||||
#viz-plot {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}
|
||||
.viz-loading {
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
color: #666;
|
||||
}
|
||||
.viz-loading-overlay {
|
||||
position: absolute;
|
||||
inset: 0;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
background: white;
|
||||
color: #666;
|
||||
}
|
||||
.viz-no-results {
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
color: #666;
|
||||
font-style: italic;
|
||||
}
|
||||
.viz-advanced-section {
|
||||
margin-top: 16px;
|
||||
padding: 16px;
|
||||
background: #f8f9fa;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #dee2e6;
|
||||
}
|
||||
.viz-advanced-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 20px;
|
||||
}
|
||||
.viz-info-box {
|
||||
background: #e3f2fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
padding: 12px;
|
||||
margin-bottom: 20px;
|
||||
font-size: 14px;
|
||||
}
|
||||
.chunk-toggle-btn {
|
||||
background: #6c757d;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 4px 10px;
|
||||
border-radius: 3px;
|
||||
cursor: pointer;
|
||||
font-size: 12px;
|
||||
margin-top: 6px;
|
||||
}
|
||||
.chunk-toggle-btn:hover {
|
||||
background: #5a6268;
|
||||
}
|
||||
.chunk-context {
|
||||
background: #f8f9fa;
|
||||
border: 1px solid #dee2e6;
|
||||
border-radius: 4px;
|
||||
padding: 12px;
|
||||
margin-top: 8px;
|
||||
font-family: monospace;
|
||||
font-size: 13px;
|
||||
line-height: 1.6;
|
||||
white-space: pre-wrap;
|
||||
word-wrap: break-word;
|
||||
}
|
||||
.chunk-text {
|
||||
color: #666;
|
||||
}
|
||||
.chunk-matched {
|
||||
background: #fff3cd;
|
||||
border: 1px solid #ffc107;
|
||||
padding: 2px 4px;
|
||||
border-radius: 2px;
|
||||
font-weight: 500;
|
||||
color: #333;
|
||||
}
|
||||
.chunk-ellipsis {
|
||||
color: #999;
|
||||
font-style: italic;
|
||||
}
|
||||
</style>
|
||||
|
||||
<div x-data="vizApp()">
|
||||
<div class="viz-card">
|
||||
<h2>Vector Visualization</h2>
|
||||
<div class="viz-info-box">
|
||||
Testing search algorithms on your indexed documents. User: <strong>{{ username }}</strong>
|
||||
</div>
|
||||
|
||||
<form @submit.prevent="executeSearch">
|
||||
<div class="viz-controls">
|
||||
<!-- Main Controls -->
|
||||
<div class="viz-control-group">
|
||||
<label>Search Query</label>
|
||||
<input type="text" x-model="query" placeholder="Enter search query..." required />
|
||||
</div>
|
||||
|
||||
<div class="viz-control-row">
|
||||
<div class="viz-control-group" style="margin-bottom: 0;">
|
||||
<label>Algorithm</label>
|
||||
<select x-model="algorithm">
|
||||
<option value="semantic">Semantic (Dense Vectors)</option>
|
||||
<option value="bm25_hybrid" selected>BM25 Hybrid (Dense + Sparse)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="viz-control-group" style="margin-bottom: 0;">
|
||||
<label>Fusion Method</label>
|
||||
<select x-model="fusion" :disabled="algorithm !== 'bm25_hybrid'" :style="algorithm !== 'bm25_hybrid' ? 'opacity: 0.5; cursor: not-allowed;' : ''">
|
||||
<option value="rrf" selected>RRF (Reciprocal Rank Fusion)</option>
|
||||
<option value="dbsf">DBSF (Distribution-Based Score Fusion)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; align-items: flex-end;">
|
||||
<button type="submit" class="viz-btn" style="width: 100%;">Search & Visualize</button>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; align-items: flex-end;">
|
||||
<button type="button" class="viz-btn-secondary" @click="showAdvanced = !showAdvanced" style="white-space: nowrap;">
|
||||
<span x-text="showAdvanced ? 'Hide Advanced' : 'Advanced'"></span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Advanced Options (Collapsible) -->
|
||||
<div class="viz-advanced-section" x-show="showAdvanced" x-transition.opacity.duration.200ms>
|
||||
<h3 style="margin-top: 0; margin-bottom: 16px; font-size: 16px;">Advanced Options</h3>
|
||||
|
||||
<div class="viz-advanced-grid">
|
||||
<div class="viz-control-group">
|
||||
<label style="display: block; margin-bottom: 8px;">Document Types</label>
|
||||
<div style="display: grid; grid-template-columns: 1fr; gap: 6px;">
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="" style="margin-right: 8px;">
|
||||
<span>All Types</span>
|
||||
</label>
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="note" style="margin-right: 8px;">
|
||||
<span>Notes</span>
|
||||
</label>
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="file" style="margin-right: 8px;">
|
||||
<span>Files</span>
|
||||
</label>
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="calendar" style="margin-right: 8px;">
|
||||
<span>Calendar Events</span>
|
||||
</label>
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="contact" style="margin-right: 8px;">
|
||||
<span>Contacts</span>
|
||||
</label>
|
||||
<label style="display: flex; align-items: center; cursor: pointer; font-weight: normal;">
|
||||
<input type="checkbox" x-model="docTypes" value="deck" style="margin-right: 8px;">
|
||||
<span>Deck Cards</span>
|
||||
</label>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
<div class="viz-control-group">
|
||||
<label>Score Threshold (Semantic/Hybrid)</label>
|
||||
<input type="number" x-model.number="scoreThreshold" min="0" max="1" step="any" />
|
||||
</div>
|
||||
|
||||
<div class="viz-control-group">
|
||||
<label>Result Limit</label>
|
||||
<input type="number" x-model.number="limit" min="1" max="100" />
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Info: BM25 Hybrid fusion methods -->
|
||||
<div x-show="algorithm === 'bm25_hybrid'" style="margin-top: 16px; padding: 12px; background: #e9ecef; border-radius: 4px;">
|
||||
<p style="margin: 0; font-size: 14px; color: #666;">
|
||||
<strong>BM25 Hybrid Search:</strong> Combines dense semantic vectors with sparse BM25 keyword vectors.
|
||||
</p>
|
||||
<p style="margin: 8px 0 0 0; font-size: 13px; color: #666;">
|
||||
<strong>RRF:</strong> Reciprocal Rank Fusion - Rank-based fusion producing scores in [0.0, 1.0]
|
||||
</p>
|
||||
<p style="margin: 4px 0 0 0; font-size: 13px; color: #666;">
|
||||
<strong>DBSF:</strong> Distribution-Based Score Fusion - Sums normalized scores (can exceed 1.0)
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<div class="viz-card">
|
||||
<div id="viz-plot-container">
|
||||
<div x-show="loading" class="viz-loading-overlay" x-transition.opacity.duration.200ms>
|
||||
Executing search and computing PCA projection...
|
||||
</div>
|
||||
<div id="viz-plot" x-show="!loading" x-transition.opacity.duration.200ms></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="viz-card">
|
||||
<h3>Search Results (<span x-text="loading ? '...' : results.length"></span>)</h3>
|
||||
|
||||
<div x-show="loading" class="viz-loading" x-transition.opacity.duration.200ms>
|
||||
Loading results...
|
||||
</div>
|
||||
|
||||
<div x-show="!loading && results.length === 0" class="viz-no-results" x-transition.opacity.duration.200ms>
|
||||
No results found. Try a different query or adjust your search parameters.
|
||||
</div>
|
||||
|
||||
<template x-if="!loading && results.length > 0">
|
||||
<div x-transition.opacity.duration.200ms>
|
||||
<template x-for="result in results" :key="result.id">
|
||||
<div style="padding: 12px; border-bottom: 1px solid #eee;">
|
||||
<a :href="getNextcloudUrl(result)" target="_blank" style="font-weight: 500; color: #0066cc; text-decoration: none;">
|
||||
<span x-text="result.title"></span>
|
||||
</a>
|
||||
<div style="font-size: 14px; color: #666; margin-top: 4px;" x-text="result.excerpt"></div>
|
||||
<div style="font-size: 12px; color: #999; margin-top: 4px;">
|
||||
Raw Score: <span x-text="result.original_score.toFixed(3)"></span>
|
||||
(<span x-text="(result.score * 100).toFixed(0)"></span>% relative) |
|
||||
Type: <span x-text="result.doc_type"></span>
|
||||
</div>
|
||||
|
||||
<!-- Show Chunk button (only if chunk position is available) -->
|
||||
<template x-if="hasChunkPosition(result)">
|
||||
<button
|
||||
class="chunk-toggle-btn"
|
||||
@click="toggleChunk(result)"
|
||||
x-text="isChunkExpanded(`${result.doc_type}_${result.id}`) ? 'Hide Chunk' : 'Show Chunk'"
|
||||
></button>
|
||||
</template>
|
||||
|
||||
<!-- Chunk context (expanded inline) -->
|
||||
<template x-if="isChunkExpanded(`${result.doc_type}_${result.id}`)">
|
||||
<div class="chunk-context" x-transition.opacity.duration.200ms>
|
||||
<template x-if="chunkLoading[`${result.doc_type}_${result.id}`]">
|
||||
<div style="color: #666; font-style: italic;">Loading chunk...</div>
|
||||
</template>
|
||||
<template x-if="!chunkLoading[`${result.doc_type}_${result.id}`]">
|
||||
<div>
|
||||
<template x-if="expandedChunks[`${result.doc_type}_${result.id}`]?.has_more_before">
|
||||
<span class="chunk-ellipsis">...</span>
|
||||
</template>
|
||||
<span class="chunk-text" x-text="expandedChunks[`${result.doc_type}_${result.id}`]?.before_context"></span><span class="chunk-matched" x-text="expandedChunks[`${result.doc_type}_${result.id}`]?.chunk_text"></span><span class="chunk-text" x-text="expandedChunks[`${result.doc_type}_${result.id}`]?.after_context"></span><template x-if="expandedChunks[`${result.doc_type}_${result.id}`]?.has_more_after">
|
||||
<span class="chunk-ellipsis">...</span>
|
||||
</template>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</div>
|
||||
@@ -677,12 +677,15 @@ async def user_info_html(request: Request) -> HTMLResponse:
|
||||
return {{
|
||||
query: '',
|
||||
algorithm: 'bm25_hybrid',
|
||||
fusion: 'rrf', // Default fusion method for BM25 Hybrid
|
||||
showAdvanced: false,
|
||||
docTypes: [''], // Default to "All Types"
|
||||
limit: 50,
|
||||
scoreThreshold: 0.0,
|
||||
loading: false,
|
||||
results: [],
|
||||
expandedChunks: {{}}, // Track which chunks are expanded (result_id -> chunk data)
|
||||
chunkLoading: {{}}, // Track loading state per result
|
||||
|
||||
async executeSearch() {{
|
||||
this.loading = true;
|
||||
@@ -696,6 +699,11 @@ async def user_info_html(request: Request) -> HTMLResponse:
|
||||
score_threshold: this.scoreThreshold,
|
||||
}});
|
||||
|
||||
// Add fusion parameter for BM25 Hybrid
|
||||
if (this.algorithm === 'bm25_hybrid') {{
|
||||
params.append('fusion', this.fusion);
|
||||
}}
|
||||
|
||||
// Add doc_types parameter (filter out empty string for "All Types")
|
||||
const selectedTypes = this.docTypes.filter(t => t !== '');
|
||||
if (selectedTypes.length > 0) {{
|
||||
@@ -729,7 +737,7 @@ async def user_info_html(request: Request) -> HTMLResponse:
|
||||
y: coordinates.map(c => c[1]),
|
||||
mode: 'markers',
|
||||
type: 'scatter',
|
||||
text: results.map(r => `${{r.title}}<br>Score: ${{r.score.toFixed(3)}}`),
|
||||
text: results.map(r => `${{r.title}}<br>Raw Score: ${{r.original_score.toFixed(3)}} (${{(r.score * 100).toFixed(0)}}% relative)`),
|
||||
marker: {{
|
||||
// Multi-channel encoding: size + opacity + color for visual hierarchy
|
||||
// Power scaling (score^2) amplifies visual differences dramatically
|
||||
@@ -778,6 +786,51 @@ async def user_info_html(request: Request) -> HTMLResponse:
|
||||
default:
|
||||
return `${{baseUrl}}`;
|
||||
}}
|
||||
}},
|
||||
|
||||
hasChunkPosition(result) {{
|
||||
// Check if result has position metadata
|
||||
return result.chunk_start_offset != null && result.chunk_end_offset != null;
|
||||
}},
|
||||
|
||||
isChunkExpanded(resultKey) {{
|
||||
return this.expandedChunks[resultKey] !== undefined;
|
||||
}},
|
||||
|
||||
async toggleChunk(result) {{
|
||||
const resultKey = `${{result.doc_type}}_${{result.id}}`;
|
||||
|
||||
// If already expanded, collapse
|
||||
if (this.isChunkExpanded(resultKey)) {{
|
||||
delete this.expandedChunks[resultKey];
|
||||
return;
|
||||
}}
|
||||
|
||||
// Otherwise, fetch and expand
|
||||
this.chunkLoading[resultKey] = true;
|
||||
|
||||
try {{
|
||||
const params = new URLSearchParams({{
|
||||
doc_type: result.doc_type,
|
||||
doc_id: result.id,
|
||||
start: result.chunk_start_offset,
|
||||
end: result.chunk_end_offset,
|
||||
context: 500 // 500 chars before/after
|
||||
}});
|
||||
|
||||
const response = await fetch(`/app/chunk-context?${{params}}`);
|
||||
const data = await response.json();
|
||||
|
||||
if (data.success) {{
|
||||
this.expandedChunks[resultKey] = data;
|
||||
}} else {{
|
||||
alert('Failed to load chunk: ' + data.error);
|
||||
}}
|
||||
}} catch (error) {{
|
||||
alert('Error loading chunk: ' + error.message);
|
||||
}} finally {{
|
||||
delete this.chunkLoading[resultKey];
|
||||
}}
|
||||
}}
|
||||
}}
|
||||
}}
|
||||
|
||||
@@ -12,8 +12,10 @@ All processing happens server-side following ADR-012:
|
||||
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from jinja2 import Environment, FileSystemLoader
|
||||
from starlette.authentication import requires
|
||||
from starlette.requests import Request
|
||||
from starlette.responses import HTMLResponse, JSONResponse
|
||||
@@ -28,6 +30,10 @@ from nextcloud_mcp_server.vector.qdrant_client import get_qdrant_client
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Setup Jinja2 environment for templates
|
||||
_template_dir = Path(__file__).parent / "templates"
|
||||
_jinja_env = Environment(loader=FileSystemLoader(_template_dir))
|
||||
|
||||
|
||||
@requires("authenticated", redirect="oauth_login")
|
||||
async def vector_visualization_html(request: Request) -> HTMLResponse:
|
||||
@@ -63,252 +69,9 @@ async def vector_visualization_html(request: Request) -> HTMLResponse:
|
||||
else "unknown"
|
||||
)
|
||||
|
||||
html_content = f"""
|
||||
<style>
|
||||
.viz-card {{
|
||||
background: white;
|
||||
border-radius: 8px;
|
||||
padding: 20px;
|
||||
margin-bottom: 20px;
|
||||
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
||||
}}
|
||||
.viz-controls {{
|
||||
margin-bottom: 20px;
|
||||
}}
|
||||
.viz-control-row {{
|
||||
display: grid;
|
||||
grid-template-columns: 2fr 1fr auto;
|
||||
gap: 12px;
|
||||
margin-bottom: 12px;
|
||||
align-items: end;
|
||||
}}
|
||||
.viz-control-group {{
|
||||
margin-bottom: 15px;
|
||||
}}
|
||||
.viz-control-group label {{
|
||||
display: block;
|
||||
margin-bottom: 5px;
|
||||
font-weight: 500;
|
||||
color: #333;
|
||||
}}
|
||||
.viz-control-group input[type="text"],
|
||||
.viz-control-group input[type="number"],
|
||||
.viz-control-group select {{
|
||||
width: 100%;
|
||||
padding: 8px 12px;
|
||||
border: 1px solid #ddd;
|
||||
border-radius: 4px;
|
||||
font-size: 14px;
|
||||
}}
|
||||
.viz-control-group input[type="range"] {{
|
||||
width: 100%;
|
||||
}}
|
||||
.viz-control-group select[multiple] {{
|
||||
min-height: 100px;
|
||||
}}
|
||||
.viz-weight-display {{
|
||||
display: inline-block;
|
||||
min-width: 40px;
|
||||
text-align: right;
|
||||
color: #666;
|
||||
}}
|
||||
.viz-btn {{
|
||||
background: #0066cc;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 10px 20px;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
font-size: 14px;
|
||||
font-weight: 500;
|
||||
}}
|
||||
.viz-btn:hover {{
|
||||
background: #0052a3;
|
||||
}}
|
||||
.viz-btn-secondary {{
|
||||
background: #6c757d;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 6px 12px;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
font-size: 13px;
|
||||
margin-bottom: 12px;
|
||||
}}
|
||||
.viz-btn-secondary:hover {{
|
||||
background: #5a6268;
|
||||
}}
|
||||
#viz-plot-container {{
|
||||
width: 100%;
|
||||
height: 600px;
|
||||
position: relative;
|
||||
}}
|
||||
#viz-plot {{
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}}
|
||||
.viz-loading {{
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
color: #666;
|
||||
}}
|
||||
.viz-loading-overlay {{
|
||||
position: absolute;
|
||||
inset: 0;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
background: white;
|
||||
color: #666;
|
||||
}}
|
||||
.viz-no-results {{
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
color: #666;
|
||||
font-style: italic;
|
||||
}}
|
||||
.viz-advanced-section {{
|
||||
margin-top: 16px;
|
||||
padding: 16px;
|
||||
background: #f8f9fa;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #dee2e6;
|
||||
}}
|
||||
.viz-advanced-grid {{
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 20px;
|
||||
}}
|
||||
.viz-info-box {{
|
||||
background: #e3f2fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
padding: 12px;
|
||||
margin-bottom: 20px;
|
||||
font-size: 14px;
|
||||
}}
|
||||
</style>
|
||||
|
||||
<div x-data="vizApp()">
|
||||
<div class="viz-card">
|
||||
<h2>Vector Visualization</h2>
|
||||
<div class="viz-info-box">
|
||||
Testing search algorithms on your indexed documents. User: <strong>{username}</strong>
|
||||
</div>
|
||||
|
||||
<form @submit.prevent="executeSearch">
|
||||
<div class="viz-controls">
|
||||
<!-- Main Controls -->
|
||||
<div class="viz-control-group">
|
||||
<label>Search Query</label>
|
||||
<input type="text" x-model="query" placeholder="Enter search query..." required />
|
||||
</div>
|
||||
|
||||
<div class="viz-control-row">
|
||||
<div class="viz-control-group" style="margin-bottom: 0;">
|
||||
<label>Algorithm</label>
|
||||
<select x-model="algorithm">
|
||||
<option value="semantic">Semantic (Dense Vectors)</option>
|
||||
<option value="bm25_hybrid" selected>BM25 Hybrid (Dense + Sparse RRF)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; align-items: flex-end;">
|
||||
<button type="submit" class="viz-btn" style="width: 100%;">Search & Visualize</button>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; align-items: flex-end;">
|
||||
<button type="button" class="viz-btn-secondary" @click="showAdvanced = !showAdvanced" style="white-space: nowrap;">
|
||||
<span x-text="showAdvanced ? 'Hide Advanced' : 'Advanced'"></span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Advanced Options (Collapsible) -->
|
||||
<div class="viz-advanced-section" x-show="showAdvanced" x-transition.opacity.duration.200ms>
|
||||
<h3 style="margin-top: 0; margin-bottom: 16px; font-size: 16px;">Advanced Options</h3>
|
||||
|
||||
<div class="viz-advanced-grid">
|
||||
<div class="viz-control-group">
|
||||
<label>Document Types</label>
|
||||
<select x-model="docTypes" multiple>
|
||||
<option value="">All Types (cross-app search)</option>
|
||||
<option value="note">Notes</option>
|
||||
<option value="file">Files</option>
|
||||
<option value="calendar">Calendar Events</option>
|
||||
<option value="contact">Contacts</option>
|
||||
<option value="deck">Deck Cards</option>
|
||||
</select>
|
||||
<small style="color: #666; display: block; margin-top: 4px;">
|
||||
Hold Ctrl/Cmd to select multiple
|
||||
</small>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
<div class="viz-control-group">
|
||||
<label>Score Threshold (Semantic/Hybrid)</label>
|
||||
<input type="number" x-model.number="scoreThreshold" min="0" max="1" step="0.1" />
|
||||
</div>
|
||||
|
||||
<div class="viz-control-group">
|
||||
<label>Result Limit</label>
|
||||
<input type="number" x-model.number="limit" min="1" max="100" />
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Info: BM25 Hybrid uses native RRF fusion (no manual weights) -->
|
||||
<div x-show="algorithm === 'bm25_hybrid'" style="margin-top: 16px; padding: 12px; background: #e9ecef; border-radius: 4px;">
|
||||
<p style="margin: 0; font-size: 14px; color: #666;">
|
||||
<strong>BM25 Hybrid Search:</strong> Uses Qdrant's native Reciprocal Rank Fusion (RRF)
|
||||
to automatically combine dense semantic vectors with sparse BM25 keyword vectors.
|
||||
No manual weight tuning required.
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<div class="viz-card">
|
||||
<div id="viz-plot-container">
|
||||
<div x-show="loading" class="viz-loading-overlay" x-transition.opacity.duration.200ms>
|
||||
Executing search and computing PCA projection...
|
||||
</div>
|
||||
<div id="viz-plot" x-show="!loading" x-transition.opacity.duration.200ms></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="viz-card">
|
||||
<h3>Search Results (<span x-text="loading ? '...' : results.length"></span>)</h3>
|
||||
|
||||
<div x-show="loading" class="viz-loading" x-transition.opacity.duration.200ms>
|
||||
Loading results...
|
||||
</div>
|
||||
|
||||
<div x-show="!loading && results.length === 0" class="viz-no-results" x-transition.opacity.duration.200ms>
|
||||
No results found. Try a different query or adjust your search parameters.
|
||||
</div>
|
||||
|
||||
<template x-if="!loading && results.length > 0">
|
||||
<div x-transition.opacity.duration.200ms>
|
||||
<template x-for="result in results" :key="result.id">
|
||||
<div style="padding: 12px; border-bottom: 1px solid #eee;">
|
||||
<a :href="getNextcloudUrl(result)" target="_blank" style="font-weight: 500; color: #0066cc; text-decoration: none;">
|
||||
<span x-text="result.title"></span>
|
||||
</a>
|
||||
<div style="font-size: 14px; color: #666; margin-top: 4px;" x-text="result.excerpt"></div>
|
||||
<div style="font-size: 12px; color: #999; margin-top: 4px;">
|
||||
Score: <span x-text="result.score.toFixed(3)"></span> |
|
||||
Type: <span x-text="result.doc_type"></span>
|
||||
</div>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</template>
|
||||
</div>
|
||||
</div>
|
||||
"""
|
||||
|
||||
# Load and render template
|
||||
template = _jinja_env.get_template("vector_viz.html")
|
||||
html_content = template.render(username=username)
|
||||
return HTMLResponse(content=html_content)
|
||||
|
||||
|
||||
@@ -352,6 +115,7 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
algorithm = request.query_params.get("algorithm", "bm25_hybrid")
|
||||
limit = int(request.query_params.get("limit", "50"))
|
||||
score_threshold = float(request.query_params.get("score_threshold", "0.0"))
|
||||
fusion = request.query_params.get("fusion", "rrf") # Default to RRF
|
||||
|
||||
# Parse doc_types (comma-separated list, None = all types)
|
||||
doc_types_param = request.query_params.get("doc_types", "")
|
||||
@@ -359,7 +123,7 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
|
||||
logger.info(
|
||||
f"Viz search: user={username}, query='{query}', "
|
||||
f"algorithm={algorithm}, limit={limit}, doc_types={doc_types}"
|
||||
f"algorithm={algorithm}, fusion={fusion}, limit={limit}, doc_types={doc_types}"
|
||||
)
|
||||
|
||||
try:
|
||||
@@ -377,7 +141,9 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
if algorithm == "semantic":
|
||||
search_algo = SemanticSearchAlgorithm(score_threshold=score_threshold)
|
||||
elif algorithm == "bm25_hybrid":
|
||||
search_algo = BM25HybridSearchAlgorithm(score_threshold=score_threshold)
|
||||
search_algo = BM25HybridSearchAlgorithm(
|
||||
score_threshold=score_threshold, fusion=fusion
|
||||
)
|
||||
else:
|
||||
return JSONResponse(
|
||||
{"success": False, "error": f"Unknown algorithm: {algorithm}"},
|
||||
@@ -418,7 +184,7 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
search_results = all_results[:limit]
|
||||
search_duration = time.perf_counter() - search_start
|
||||
|
||||
# Normalize scores relative to this result set for better visualization
|
||||
# Store original scores and normalize for visualization
|
||||
# (best result = 1.0, worst result = 0.0 within THIS result set)
|
||||
# This makes visual encoding meaningful regardless of RRF normalization
|
||||
if search_results:
|
||||
@@ -431,8 +197,11 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
f"→ [0.0, 1.0]"
|
||||
)
|
||||
|
||||
# Rescale each result's score to 0-1 within this result set
|
||||
# Store original score and rescale to 0-1 for visualization
|
||||
for r in search_results:
|
||||
# Store original score before normalization
|
||||
r.original_score = r.score
|
||||
# Rescale for visual encoding
|
||||
r.score = (r.score - min_score) / score_range
|
||||
|
||||
if not search_results:
|
||||
@@ -551,7 +320,12 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
"doc_type": r.doc_type,
|
||||
"title": r.title,
|
||||
"excerpt": r.excerpt,
|
||||
"score": r.score,
|
||||
"score": r.score, # Normalized score for visual encoding (0-1)
|
||||
"original_score": getattr(
|
||||
r, "original_score", r.score
|
||||
), # Raw score from algorithm
|
||||
"chunk_start_offset": r.chunk_start_offset,
|
||||
"chunk_end_offset": r.chunk_end_offset,
|
||||
}
|
||||
for r in search_results
|
||||
]
|
||||
@@ -594,3 +368,125 @@ async def vector_visualization_search(request: Request) -> JSONResponse:
|
||||
{"success": False, "error": str(e)},
|
||||
status_code=500,
|
||||
)
|
||||
|
||||
|
||||
@requires("authenticated", redirect="oauth_login")
|
||||
async def chunk_context_endpoint(request: Request) -> JSONResponse:
|
||||
"""Fetch chunk text with surrounding context for visualization.
|
||||
|
||||
This endpoint retrieves the matched chunk along with surrounding text
|
||||
to provide context for the search result. Used by the viz pane to
|
||||
display chunks inline.
|
||||
|
||||
Query parameters:
|
||||
doc_type: Document type (e.g., "note")
|
||||
doc_id: Document ID
|
||||
start: Chunk start offset (character position)
|
||||
end: Chunk end offset (character position)
|
||||
context: Characters of context before/after (default: 500)
|
||||
|
||||
Returns:
|
||||
JSON with chunk_text, before_context, after_context, and flags
|
||||
"""
|
||||
try:
|
||||
# Get query parameters
|
||||
doc_type = request.query_params.get("doc_type")
|
||||
doc_id = request.query_params.get("doc_id")
|
||||
start_str = request.query_params.get("start")
|
||||
end_str = request.query_params.get("end")
|
||||
context_chars = int(request.query_params.get("context", "500"))
|
||||
|
||||
# Validate required parameters
|
||||
if not all([doc_type, doc_id, start_str, end_str]):
|
||||
return JSONResponse(
|
||||
{
|
||||
"success": False,
|
||||
"error": "Missing required parameters: doc_type, doc_id, start, end",
|
||||
},
|
||||
status_code=400,
|
||||
)
|
||||
|
||||
start = int(start_str)
|
||||
end = int(end_str)
|
||||
|
||||
# Currently only support notes
|
||||
if doc_type != "note":
|
||||
return JSONResponse(
|
||||
{"success": False, "error": f"Unsupported doc_type: {doc_type}"},
|
||||
status_code=400,
|
||||
)
|
||||
|
||||
# Get authenticated HTTP client and fetch note
|
||||
from nextcloud_mcp_server.auth.userinfo_routes import (
|
||||
_get_authenticated_client_for_userinfo,
|
||||
)
|
||||
from nextcloud_mcp_server.client.notes import NotesClient
|
||||
|
||||
# Get username from request auth
|
||||
username = (
|
||||
request.user.display_name
|
||||
if hasattr(request.user, "display_name")
|
||||
else "unknown"
|
||||
)
|
||||
|
||||
# Create notes client with authenticated HTTP client
|
||||
http_client = await _get_authenticated_client_for_userinfo(request)
|
||||
notes_client = NotesClient(http_client, username)
|
||||
|
||||
# Fetch full note content
|
||||
note = await notes_client.get_note(int(doc_id))
|
||||
full_content = f"{note['title']}\n\n{note['content']}"
|
||||
|
||||
# Validate offsets
|
||||
if start < 0 or end > len(full_content) or start >= end:
|
||||
return JSONResponse(
|
||||
{
|
||||
"success": False,
|
||||
"error": f"Invalid offsets: start={start}, end={end}, content_length={len(full_content)}",
|
||||
},
|
||||
status_code=400,
|
||||
)
|
||||
|
||||
# Extract chunk
|
||||
chunk_text = full_content[start:end]
|
||||
|
||||
# Extract context before and after
|
||||
before_start = max(0, start - context_chars)
|
||||
before_context = full_content[before_start:start]
|
||||
|
||||
after_end = min(len(full_content), end + context_chars)
|
||||
after_context = full_content[end:after_end]
|
||||
|
||||
# Determine if there's more content
|
||||
has_more_before = before_start > 0
|
||||
has_more_after = after_end < len(full_content)
|
||||
|
||||
logger.info(
|
||||
f"Fetched chunk context for {doc_type}_{doc_id}: "
|
||||
f"chunk_len={len(chunk_text)}, before_len={len(before_context)}, "
|
||||
f"after_len={len(after_context)}"
|
||||
)
|
||||
|
||||
return JSONResponse(
|
||||
{
|
||||
"success": True,
|
||||
"chunk_text": chunk_text,
|
||||
"before_context": before_context,
|
||||
"after_context": after_context,
|
||||
"has_more_before": has_more_before,
|
||||
"has_more_after": has_more_after,
|
||||
}
|
||||
)
|
||||
|
||||
except ValueError as e:
|
||||
logger.error(f"Invalid parameter format: {e}")
|
||||
return JSONResponse(
|
||||
{"success": False, "error": f"Invalid parameter format: {e}"},
|
||||
status_code=400,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Chunk context error: {e}", exc_info=True)
|
||||
return JSONResponse(
|
||||
{"success": False, "error": str(e)},
|
||||
status_code=500,
|
||||
)
|
||||
|
||||
@@ -19,9 +19,22 @@ class SemanticSearchResult(BaseModel):
|
||||
default="", description="Document category (notes) or location (calendar)"
|
||||
)
|
||||
excerpt: str = Field(description="Excerpt from matching chunk")
|
||||
score: float = Field(description="Semantic similarity score (0-1)")
|
||||
score: float = Field(
|
||||
description=(
|
||||
"Relevance score (≥ 0.0, higher is better). "
|
||||
"Score range depends on fusion method: "
|
||||
"RRF produces scores in [0.0, 1.0], "
|
||||
"DBSF can exceed 1.0 (sum of normalized scores from multiple systems)"
|
||||
)
|
||||
)
|
||||
chunk_index: int = Field(description="Index of matching chunk in document")
|
||||
total_chunks: int = Field(description="Total number of chunks in document")
|
||||
chunk_start_offset: Optional[int] = Field(
|
||||
default=None, description="Character position where chunk starts in document"
|
||||
)
|
||||
chunk_end_offset: Optional[int] = Field(
|
||||
default=None, description="Character position where chunk ends in document"
|
||||
)
|
||||
|
||||
|
||||
class SemanticSearchResponse(BaseResponse):
|
||||
|
||||
@@ -127,8 +127,12 @@ class SearchResult:
|
||||
doc_type: Document type (note, file, calendar, contact, etc.)
|
||||
title: Document title
|
||||
excerpt: Content excerpt showing match context
|
||||
score: Relevance score (0.0-1.0, higher is better)
|
||||
score: Relevance score (≥ 0.0, higher is better)
|
||||
- RRF fusion: scores in [0.0, 1.0]
|
||||
- DBSF fusion: scores can exceed 1.0 (sum of normalized scores)
|
||||
metadata: Additional algorithm-specific metadata
|
||||
chunk_start_offset: Character position where chunk starts (None if not available)
|
||||
chunk_end_offset: Character position where chunk ends (None if not available)
|
||||
"""
|
||||
|
||||
id: int
|
||||
@@ -137,11 +141,20 @@ class SearchResult:
|
||||
excerpt: str
|
||||
score: float
|
||||
metadata: dict[str, Any] | None = None
|
||||
chunk_start_offset: int | None = None
|
||||
chunk_end_offset: int | None = None
|
||||
|
||||
def __post_init__(self):
|
||||
"""Validate score is in valid range."""
|
||||
if not 0.0 <= self.score <= 1.0:
|
||||
raise ValueError(f"Score must be between 0.0 and 1.0, got {self.score}")
|
||||
"""Validate score is non-negative.
|
||||
|
||||
Note: Different fusion methods produce different score ranges:
|
||||
- RRF (Reciprocal Rank Fusion): Bounded to [0.0, 1.0]
|
||||
- DBSF (Distribution-Based Score Fusion): Unbounded (can exceed 1.0)
|
||||
DBSF sums normalized scores from multiple systems, so scores can be
|
||||
1.5, 2.0, etc. when multiple systems agree a document is highly relevant.
|
||||
"""
|
||||
if self.score < 0.0:
|
||||
raise ValueError(f"Score must be non-negative, got {self.score}")
|
||||
|
||||
|
||||
class SearchAlgorithm(ABC):
|
||||
|
||||
@@ -28,15 +28,27 @@ class BM25HybridSearchAlgorithm(SearchAlgorithm):
|
||||
eliminating the need for application-layer result merging.
|
||||
"""
|
||||
|
||||
def __init__(self, score_threshold: float = 0.0):
|
||||
def __init__(self, score_threshold: float = 0.0, fusion: str = "rrf"):
|
||||
"""
|
||||
Initialize BM25 hybrid search algorithm.
|
||||
|
||||
Args:
|
||||
score_threshold: Minimum RRF score (0-1, default: 0.0 to allow RRF scoring)
|
||||
Note: RRF produces normalized scores, so threshold is typically lower
|
||||
score_threshold: Minimum fusion score (0-1, default: 0.0 to allow fusion scoring)
|
||||
Note: Both RRF and DBSF produce normalized scores
|
||||
fusion: Fusion algorithm to use: "rrf" (Reciprocal Rank Fusion, default)
|
||||
or "dbsf" (Distribution-Based Score Fusion)
|
||||
|
||||
Raises:
|
||||
ValueError: If fusion is not "rrf" or "dbsf"
|
||||
"""
|
||||
if fusion not in ("rrf", "dbsf"):
|
||||
raise ValueError(
|
||||
f"Invalid fusion algorithm '{fusion}'. Must be 'rrf' or 'dbsf'"
|
||||
)
|
||||
|
||||
self.score_threshold = score_threshold
|
||||
self.fusion = models.Fusion.RRF if fusion == "rrf" else models.Fusion.DBSF
|
||||
self.fusion_name = fusion
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
@@ -78,7 +90,8 @@ class BM25HybridSearchAlgorithm(SearchAlgorithm):
|
||||
|
||||
logger.info(
|
||||
f"BM25 hybrid search: query='{query}', user={user_id}, "
|
||||
f"limit={limit}, score_threshold={score_threshold}, doc_type={doc_type}"
|
||||
f"limit={limit}, score_threshold={score_threshold}, doc_type={doc_type}, "
|
||||
f"fusion={self.fusion_name}"
|
||||
)
|
||||
|
||||
# Generate dense embedding for semantic search
|
||||
@@ -139,8 +152,8 @@ class BM25HybridSearchAlgorithm(SearchAlgorithm):
|
||||
filter=query_filter,
|
||||
),
|
||||
],
|
||||
# RRF fusion query (no additional query needed, just fusion)
|
||||
query=models.FusionQuery(fusion=models.Fusion.RRF),
|
||||
# Fusion query (RRF or DBSF based on initialization)
|
||||
query=models.FusionQuery(fusion=self.fusion),
|
||||
limit=limit * 2, # Get extra for deduplication
|
||||
score_threshold=score_threshold,
|
||||
with_payload=True,
|
||||
@@ -152,14 +165,16 @@ class BM25HybridSearchAlgorithm(SearchAlgorithm):
|
||||
raise
|
||||
|
||||
logger.info(
|
||||
f"Qdrant RRF fusion returned {len(search_response.points)} results "
|
||||
f"Qdrant {self.fusion_name.upper()} fusion returned {len(search_response.points)} results "
|
||||
f"(before deduplication)"
|
||||
)
|
||||
|
||||
if search_response.points:
|
||||
# Log top 3 RRF scores to help with threshold tuning
|
||||
# Log top 3 fusion scores to help with threshold tuning
|
||||
top_scores = [p.score for p in search_response.points[:3]]
|
||||
logger.debug(f"Top 3 RRF fusion scores: {top_scores}")
|
||||
logger.debug(
|
||||
f"Top 3 {self.fusion_name.upper()} fusion scores: {top_scores}"
|
||||
)
|
||||
|
||||
# Deduplicate by (doc_id, doc_type) - multiple chunks per document
|
||||
seen_docs = set()
|
||||
@@ -183,12 +198,14 @@ class BM25HybridSearchAlgorithm(SearchAlgorithm):
|
||||
doc_type=doc_type,
|
||||
title=result.payload.get("title", "Untitled"),
|
||||
excerpt=result.payload.get("excerpt", ""),
|
||||
score=result.score, # RRF fusion score
|
||||
score=result.score, # Fusion score (RRF or DBSF)
|
||||
metadata={
|
||||
"chunk_index": result.payload.get("chunk_index"),
|
||||
"total_chunks": result.payload.get("total_chunks"),
|
||||
"search_method": "bm25_hybrid_rrf",
|
||||
"search_method": f"bm25_hybrid_{self.fusion_name}",
|
||||
},
|
||||
chunk_start_offset=result.payload.get("chunk_start_offset"),
|
||||
chunk_end_offset=result.payload.get("chunk_end_offset"),
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
@@ -150,6 +150,8 @@ class SemanticSearchAlgorithm(SearchAlgorithm):
|
||||
"chunk_index": result.payload.get("chunk_index"),
|
||||
"total_chunks": result.payload.get("total_chunks"),
|
||||
},
|
||||
chunk_start_offset=result.payload.get("chunk_start_offset"),
|
||||
chunk_end_offset=result.payload.get("chunk_end_offset"),
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
@@ -42,6 +42,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
limit: int = 10,
|
||||
doc_types: list[str] | None = None,
|
||||
score_threshold: float = 0.0,
|
||||
fusion: str = "rrf",
|
||||
) -> SemanticSearchResponse:
|
||||
"""
|
||||
Search Nextcloud content using BM25 hybrid search with cross-app support.
|
||||
@@ -50,7 +51,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
- Dense semantic vectors: For conceptual similarity and natural language queries
|
||||
- BM25 sparse vectors: For precise keyword matching, acronyms, and specific terms
|
||||
|
||||
Results are automatically fused using Reciprocal Rank Fusion (RRF) in the
|
||||
Results are automatically fused using the selected fusion algorithm in the
|
||||
database for optimal relevance. This provides the best of both semantic
|
||||
understanding and keyword precision.
|
||||
|
||||
@@ -61,10 +62,13 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
query: Natural language or keyword search query
|
||||
limit: Maximum number of results to return (default: 10)
|
||||
doc_types: Document types to search (e.g., ["note", "file"]). None = search all indexed types (default)
|
||||
score_threshold: Minimum RRF fusion score (0-1, default: 0.0 for RRF scoring)
|
||||
score_threshold: Minimum fusion score (0-1, default: 0.0)
|
||||
fusion: Fusion algorithm: "rrf" (Reciprocal Rank Fusion, default) or "dbsf" (Distribution-Based Score Fusion)
|
||||
RRF: Good general-purpose fusion using reciprocal ranks
|
||||
DBSF: Uses distribution-based normalization, may better balance different score ranges
|
||||
|
||||
Returns:
|
||||
SemanticSearchResponse with matching documents ranked by RRF fusion scores
|
||||
SemanticSearchResponse with matching documents ranked by fusion scores
|
||||
"""
|
||||
from nextcloud_mcp_server.config import get_settings
|
||||
|
||||
@@ -74,7 +78,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
|
||||
logger.info(
|
||||
f"BM25 hybrid search: query='{query}', user={username}, "
|
||||
f"limit={limit}, score_threshold={score_threshold}"
|
||||
f"limit={limit}, score_threshold={score_threshold}, fusion={fusion}"
|
||||
)
|
||||
|
||||
# Check that vector sync is enabled
|
||||
@@ -87,8 +91,10 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
)
|
||||
|
||||
try:
|
||||
# Create BM25 hybrid search algorithm
|
||||
search_algo = BM25HybridSearchAlgorithm(score_threshold=score_threshold)
|
||||
# Create BM25 hybrid search algorithm with specified fusion
|
||||
search_algo = BM25HybridSearchAlgorithm(
|
||||
score_threshold=score_threshold, fusion=fusion
|
||||
)
|
||||
|
||||
# Execute search across requested document types
|
||||
# If doc_types is None, search all indexed types (cross-app search)
|
||||
@@ -152,6 +158,8 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
total_chunks=r.metadata.get("total_chunks", 1)
|
||||
if r.metadata
|
||||
else 1,
|
||||
chunk_start_offset=r.chunk_start_offset,
|
||||
chunk_end_offset=r.chunk_end_offset,
|
||||
)
|
||||
)
|
||||
|
||||
@@ -161,7 +169,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
results=results,
|
||||
query=query,
|
||||
total_found=len(results),
|
||||
search_method="bm25_hybrid",
|
||||
search_method=f"bm25_hybrid_{fusion}",
|
||||
)
|
||||
|
||||
except ValueError as e:
|
||||
@@ -193,6 +201,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
limit: int = 5,
|
||||
score_threshold: float = 0.7,
|
||||
max_answer_tokens: int = 500,
|
||||
fusion: str = "rrf",
|
||||
) -> SamplingSearchResponse:
|
||||
"""
|
||||
Semantic search with LLM-generated answer using MCP sampling.
|
||||
@@ -217,6 +226,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
limit: Maximum number of documents to retrieve (default: 5)
|
||||
score_threshold: Minimum similarity score 0-1 (default: 0.7)
|
||||
max_answer_tokens: Maximum tokens for generated answer (default: 500)
|
||||
fusion: Fusion algorithm: "rrf" (Reciprocal Rank Fusion, default) or "dbsf" (Distribution-Based Score Fusion)
|
||||
|
||||
Returns:
|
||||
SamplingSearchResponse containing:
|
||||
@@ -256,6 +266,7 @@ def configure_semantic_tools(mcp: FastMCP):
|
||||
ctx=ctx,
|
||||
limit=limit,
|
||||
score_threshold=score_threshold,
|
||||
fusion=fusion,
|
||||
)
|
||||
|
||||
# 2. Handle no results case - don't waste a sampling call
|
||||
|
||||
@@ -1,10 +1,21 @@
|
||||
"""Document chunking for large texts."""
|
||||
|
||||
import logging
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkWithPosition:
|
||||
"""A text chunk with its character position in the original document."""
|
||||
|
||||
text: str
|
||||
start_offset: int # Character position where chunk starts
|
||||
end_offset: int # Character position where chunk ends (exclusive)
|
||||
|
||||
|
||||
class DocumentChunker:
|
||||
"""Chunk large documents for optimal embedding."""
|
||||
|
||||
@@ -19,33 +30,66 @@ class DocumentChunker:
|
||||
self.chunk_size = chunk_size
|
||||
self.overlap = overlap
|
||||
|
||||
def chunk_text(self, content: str) -> list[str]:
|
||||
def chunk_text(self, content: str) -> list[ChunkWithPosition]:
|
||||
"""
|
||||
Split text into overlapping chunks.
|
||||
Split text into overlapping chunks with position tracking.
|
||||
|
||||
Uses simple word-based chunking with configurable overlap to preserve
|
||||
context across chunk boundaries.
|
||||
context across chunk boundaries. Tracks character positions for each chunk.
|
||||
|
||||
Args:
|
||||
content: Text content to chunk
|
||||
|
||||
Returns:
|
||||
List of text chunks (may be single item if content is small)
|
||||
List of chunks with their character positions in the original content
|
||||
"""
|
||||
# Simple word-based chunking
|
||||
words = content.split()
|
||||
# Use regex to find all words and their positions
|
||||
# This preserves the original spacing and allows accurate position tracking
|
||||
word_pattern = re.compile(r"\S+")
|
||||
word_matches = list(word_pattern.finditer(content))
|
||||
|
||||
if len(words) <= self.chunk_size:
|
||||
return [content]
|
||||
if len(word_matches) <= self.chunk_size:
|
||||
# Single chunk - use entire content
|
||||
return [
|
||||
ChunkWithPosition(text=content, start_offset=0, end_offset=len(content))
|
||||
]
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
start_idx = 0
|
||||
|
||||
while start < len(words):
|
||||
end = start + self.chunk_size
|
||||
chunk_words = words[start:end]
|
||||
chunks.append(" ".join(chunk_words))
|
||||
start = end - self.overlap
|
||||
while start_idx < len(word_matches):
|
||||
end_idx = min(start_idx + self.chunk_size, len(word_matches))
|
||||
|
||||
logger.debug(f"Chunked document into {len(chunks)} chunks ({len(words)} words)")
|
||||
# Get the first and last word positions
|
||||
first_word = word_matches[start_idx]
|
||||
last_word = word_matches[end_idx - 1]
|
||||
|
||||
# Extract chunk using character positions
|
||||
start_offset = first_word.start()
|
||||
end_offset = last_word.end()
|
||||
chunk_text = content[start_offset:end_offset]
|
||||
|
||||
chunks.append(
|
||||
ChunkWithPosition(
|
||||
text=chunk_text, start_offset=start_offset, end_offset=end_offset
|
||||
)
|
||||
)
|
||||
|
||||
# If we've reached the end, break
|
||||
if end_idx >= len(word_matches):
|
||||
break
|
||||
|
||||
# Move to next chunk with overlap
|
||||
next_start_idx = end_idx - self.overlap
|
||||
|
||||
# Safety check: ensure we're making forward progress
|
||||
# If we're not advancing (overlap >= chunk processed), break to prevent infinite loop
|
||||
if next_start_idx <= start_idx:
|
||||
break
|
||||
|
||||
start_idx = next_start_idx
|
||||
|
||||
logger.debug(
|
||||
f"Chunked document into {len(chunks)} chunks ({len(word_matches)} words)"
|
||||
)
|
||||
return chunks
|
||||
|
||||
@@ -233,13 +233,16 @@ async def _index_document(
|
||||
)
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Extract chunk texts for embedding
|
||||
chunk_texts = [chunk.text for chunk in chunks]
|
||||
|
||||
# Generate dense embeddings (I/O bound - external API call)
|
||||
embedding_service = get_embedding_service()
|
||||
dense_embeddings = await embedding_service.embed_batch(chunks)
|
||||
dense_embeddings = await embedding_service.embed_batch(chunk_texts)
|
||||
|
||||
# Generate sparse embeddings (BM25 for keyword matching)
|
||||
bm25_service = get_bm25_service()
|
||||
sparse_embeddings = bm25_service.encode_batch(chunks)
|
||||
sparse_embeddings = bm25_service.encode_batch(chunk_texts)
|
||||
|
||||
# Prepare Qdrant points
|
||||
indexed_at = int(time.time())
|
||||
@@ -265,12 +268,15 @@ async def _index_document(
|
||||
"doc_id": doc_task.doc_id,
|
||||
"doc_type": doc_task.doc_type,
|
||||
"title": title,
|
||||
"excerpt": chunk[:200],
|
||||
"excerpt": chunk.text[:200],
|
||||
"indexed_at": indexed_at,
|
||||
"modified_at": doc_task.modified_at,
|
||||
"etag": etag,
|
||||
"chunk_index": i,
|
||||
"total_chunks": len(chunks),
|
||||
"chunk_start_offset": chunk.start_offset,
|
||||
"chunk_end_offset": chunk.end_offset,
|
||||
"metadata_version": 2, # v2 includes position metadata
|
||||
},
|
||||
)
|
||||
)
|
||||
|
||||
+5
-4
@@ -12,7 +12,7 @@ keywords = ["nextcloud", "mcp", "model-context-protocol", "llm", "ai", "claude",
|
||||
dependencies = [
|
||||
"mcp[cli] (>=1.21,<1.22)",
|
||||
"httpx (>=0.28.1,<0.29.0)",
|
||||
"pillow (>=10.3.0,<12.0.0)", # Compatible with fastembed
|
||||
"pillow (>=10.3.0,<12.0.0)", # Compatible with fastembed
|
||||
"icalendar (>=6.0.0,<7.0.0)",
|
||||
"pythonvcard4>=0.2.0",
|
||||
"pydantic>=2.11.4",
|
||||
@@ -22,7 +22,9 @@ dependencies = [
|
||||
"aiosqlite>=0.20.0", # Async SQLite for refresh token storage
|
||||
"authlib>=1.6.5",
|
||||
"qdrant-client>=1.7.0",
|
||||
"fastembed>=0.4.2", # BM25 sparse vector embeddings for hybrid search
|
||||
"fastembed>=0.4.2", # BM25 sparse vector embeddings for hybrid search
|
||||
"anthropic>=0.42.0", # For RAG evaluation with Anthropic LLMs
|
||||
"boto3>=1.35.0", # For Amazon Bedrock provider (optional)
|
||||
# Observability dependencies
|
||||
"prometheus-client>=0.21.0", # Prometheus metrics
|
||||
"opentelemetry-api>=1.28.2", # OpenTelemetry API
|
||||
@@ -32,6 +34,7 @@ dependencies = [
|
||||
"opentelemetry-instrumentation-logging>=0.49b2", # Logging integration
|
||||
"opentelemetry-exporter-otlp-proto-grpc>=1.28.2", # OTLP gRPC exporter
|
||||
"python-json-logger>=3.2.0", # Structured JSON logging
|
||||
"jinja2>=3.1.6",
|
||||
]
|
||||
classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
@@ -103,8 +106,6 @@ module-root = ""
|
||||
|
||||
[dependency-groups]
|
||||
dev = [
|
||||
"anthropic>=0.42.0", # For RAG evaluation with Anthropic LLMs
|
||||
"boto3>=1.35.0", # For Amazon Bedrock provider (optional)
|
||||
"commitizen>=4.8.2",
|
||||
"datasets>=3.3.0", # For BeIR nfcorpus dataset loading
|
||||
"ipython>=9.2.0",
|
||||
|
||||
@@ -0,0 +1 @@
|
||||
"""Unit tests for search algorithms."""
|
||||
@@ -0,0 +1,54 @@
|
||||
"""Unit tests for BM25 hybrid search algorithm."""
|
||||
|
||||
import pytest
|
||||
from qdrant_client import models
|
||||
|
||||
from nextcloud_mcp_server.search.bm25_hybrid import BM25HybridSearchAlgorithm
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_bm25_hybrid_initialization_default():
|
||||
"""Test BM25HybridSearchAlgorithm initializes with default RRF fusion."""
|
||||
algo = BM25HybridSearchAlgorithm()
|
||||
|
||||
assert algo.score_threshold == 0.0
|
||||
assert algo.fusion == models.Fusion.RRF
|
||||
assert algo.fusion_name == "rrf"
|
||||
assert algo.name == "bm25_hybrid"
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_bm25_hybrid_initialization_with_rrf():
|
||||
"""Test BM25HybridSearchAlgorithm initializes with explicit RRF fusion."""
|
||||
algo = BM25HybridSearchAlgorithm(score_threshold=0.5, fusion="rrf")
|
||||
|
||||
assert algo.score_threshold == 0.5
|
||||
assert algo.fusion == models.Fusion.RRF
|
||||
assert algo.fusion_name == "rrf"
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_bm25_hybrid_initialization_with_dbsf():
|
||||
"""Test BM25HybridSearchAlgorithm initializes with DBSF fusion."""
|
||||
algo = BM25HybridSearchAlgorithm(score_threshold=0.7, fusion="dbsf")
|
||||
|
||||
assert algo.score_threshold == 0.7
|
||||
assert algo.fusion == models.Fusion.DBSF
|
||||
assert algo.fusion_name == "dbsf"
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_bm25_hybrid_invalid_fusion_raises_error():
|
||||
"""Test BM25HybridSearchAlgorithm raises ValueError for invalid fusion."""
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
BM25HybridSearchAlgorithm(fusion="invalid")
|
||||
|
||||
assert "Invalid fusion algorithm 'invalid'" in str(exc_info.value)
|
||||
assert "Must be 'rrf' or 'dbsf'" in str(exc_info.value)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_bm25_hybrid_requires_vector_db():
|
||||
"""Test BM25HybridSearchAlgorithm reports it requires vector database."""
|
||||
algo = BM25HybridSearchAlgorithm()
|
||||
assert algo.requires_vector_db is True
|
||||
@@ -0,0 +1,135 @@
|
||||
"""Unit tests for SearchResult validation."""
|
||||
|
||||
import pytest
|
||||
|
||||
from nextcloud_mcp_server.search.algorithms import SearchResult
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_rrf_score_in_range():
|
||||
"""Test SearchResult accepts RRF scores in [0.0, 1.0] range."""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="Test excerpt",
|
||||
score=0.85,
|
||||
)
|
||||
|
||||
assert result.score == 0.85
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_rrf_score_at_lower_bound():
|
||||
"""Test SearchResult accepts RRF score at lower bound (0.0)."""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="Test excerpt",
|
||||
score=0.0,
|
||||
)
|
||||
|
||||
assert result.score == 0.0
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_rrf_score_at_upper_bound():
|
||||
"""Test SearchResult accepts RRF score at upper bound (1.0)."""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="Test excerpt",
|
||||
score=1.0,
|
||||
)
|
||||
|
||||
assert result.score == 1.0
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_dbsf_score_above_one():
|
||||
"""Test SearchResult accepts DBSF scores > 1.0.
|
||||
|
||||
DBSF (Distribution-Based Score Fusion) sums normalized scores from multiple
|
||||
systems (dense semantic + sparse BM25), so scores can exceed 1.0 when both
|
||||
systems strongly agree a document is relevant.
|
||||
"""
|
||||
# Typical DBSF score when both systems agree
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Highly Relevant Note",
|
||||
excerpt="Contains keywords and is semantically similar",
|
||||
score=1.55,
|
||||
)
|
||||
|
||||
assert result.score == 1.55
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_dbsf_score_edge_case():
|
||||
"""Test SearchResult accepts DBSF maximum theoretical score (2.0).
|
||||
|
||||
Maximum DBSF score with 2 systems: 1.0 (dense) + 1.0 (sparse) = 2.0
|
||||
"""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Perfect Match",
|
||||
excerpt="Perfect semantic and keyword match",
|
||||
score=2.0,
|
||||
)
|
||||
|
||||
assert result.score == 2.0
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_negative_score_raises_error():
|
||||
"""Test SearchResult rejects negative scores."""
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="Test excerpt",
|
||||
score=-0.1,
|
||||
)
|
||||
|
||||
assert "Score must be non-negative" in str(exc_info.value)
|
||||
assert "got -0.1" in str(exc_info.value)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_with_metadata():
|
||||
"""Test SearchResult with optional metadata field."""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="Test excerpt",
|
||||
score=1.25,
|
||||
metadata={"fusion_method": "dbsf", "dense_score": 0.8, "sparse_score": 0.45},
|
||||
)
|
||||
|
||||
assert result.score == 1.25
|
||||
assert result.metadata["fusion_method"] == "dbsf"
|
||||
assert result.metadata["dense_score"] == 0.8
|
||||
assert result.metadata["sparse_score"] == 0.45
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_search_result_with_chunk_offsets():
|
||||
"""Test SearchResult with chunk offset information."""
|
||||
result = SearchResult(
|
||||
id=1,
|
||||
doc_type="note",
|
||||
title="Test Note",
|
||||
excerpt="matching chunk text",
|
||||
score=0.9,
|
||||
chunk_start_offset=100,
|
||||
chunk_end_offset=500,
|
||||
)
|
||||
|
||||
assert result.chunk_start_offset == 100
|
||||
assert result.chunk_end_offset == 500
|
||||
@@ -0,0 +1,190 @@
|
||||
"""Unit tests for DocumentChunker with position tracking."""
|
||||
|
||||
from nextcloud_mcp_server.vector.document_chunker import (
|
||||
ChunkWithPosition,
|
||||
DocumentChunker,
|
||||
)
|
||||
|
||||
|
||||
class TestDocumentChunkerPositions:
|
||||
"""Test suite for DocumentChunker position tracking functionality."""
|
||||
|
||||
def test_single_chunk_simple_text(self):
|
||||
"""Test that single-chunk documents return correct positions."""
|
||||
chunker = DocumentChunker(chunk_size=512, overlap=50)
|
||||
content = "This is a short document."
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
assert len(chunks) == 1
|
||||
assert isinstance(chunks[0], ChunkWithPosition)
|
||||
assert chunks[0].text == content
|
||||
assert chunks[0].start_offset == 0
|
||||
assert chunks[0].end_offset == len(content)
|
||||
|
||||
def test_multiple_chunks_positions(self):
|
||||
"""Test that multi-chunk documents have correct positions."""
|
||||
chunker = DocumentChunker(chunk_size=10, overlap=2) # Small chunks for testing
|
||||
# Create content with exactly 30 words
|
||||
words = [f"word{i:02d}" for i in range(30)]
|
||||
content = " ".join(words)
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Verify we got multiple chunks (30 words, 10 per chunk, 2 overlap = 4 chunks)
|
||||
assert len(chunks) == 4
|
||||
|
||||
# Verify all chunks are ChunkWithPosition
|
||||
for chunk in chunks:
|
||||
assert isinstance(chunk, ChunkWithPosition)
|
||||
|
||||
# Verify first chunk starts at 0
|
||||
assert chunks[0].start_offset == 0
|
||||
|
||||
# Verify last chunk ends at content length
|
||||
assert chunks[-1].end_offset == len(content)
|
||||
|
||||
# Verify chunks are contiguous or overlap (no gaps)
|
||||
for i in range(len(chunks) - 1):
|
||||
# Next chunk should start at or before current chunk ends
|
||||
assert chunks[i + 1].start_offset <= chunks[i].end_offset
|
||||
|
||||
# Verify we can reconstruct the content using positions
|
||||
for chunk in chunks:
|
||||
extracted = content[chunk.start_offset : chunk.end_offset]
|
||||
assert extracted == chunk.text
|
||||
|
||||
def test_chunk_positions_with_whitespace(self):
|
||||
"""Test position tracking with various whitespace."""
|
||||
chunker = DocumentChunker(chunk_size=5, overlap=1)
|
||||
content = "word1 word2\n\nword3\tword4 word5 word6"
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Verify positions correctly handle whitespace
|
||||
for chunk in chunks:
|
||||
extracted = content[chunk.start_offset : chunk.end_offset]
|
||||
assert extracted == chunk.text
|
||||
# Verify no leading/trailing whitespace unless in original
|
||||
if chunk != chunks[0] and chunk != chunks[-1]:
|
||||
# Middle chunks should be extracted correctly
|
||||
assert len(chunk.text.strip()) > 0
|
||||
|
||||
def test_empty_content(self):
|
||||
"""Test that empty content returns empty chunk."""
|
||||
chunker = DocumentChunker(chunk_size=512, overlap=50)
|
||||
content = ""
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0].text == ""
|
||||
assert chunks[0].start_offset == 0
|
||||
assert chunks[0].end_offset == 0
|
||||
|
||||
def test_chunk_overlap_positions(self):
|
||||
"""Test that overlapping chunks have correct positions."""
|
||||
chunker = DocumentChunker(chunk_size=10, overlap=3)
|
||||
words = [f"word{i:02d}" for i in range(25)]
|
||||
content = " ".join(words)
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Verify overlap exists
|
||||
for i in range(len(chunks) - 1):
|
||||
current_chunk = chunks[i]
|
||||
next_chunk = chunks[i + 1]
|
||||
|
||||
# Next chunk should start before current ends (overlap)
|
||||
# This happens because we move back by overlap words
|
||||
# The actual character overlap depends on word lengths
|
||||
assert next_chunk.start_offset >= 0
|
||||
assert current_chunk.end_offset <= len(content)
|
||||
|
||||
def test_unicode_content_positions(self):
|
||||
"""Test position tracking with Unicode characters."""
|
||||
chunker = DocumentChunker(chunk_size=10, overlap=2)
|
||||
content = "Hello 世界 こんにちは мир Привет שלום مرحبا 你好"
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Verify all chunks extract correctly
|
||||
for chunk in chunks:
|
||||
extracted = content[chunk.start_offset : chunk.end_offset]
|
||||
assert extracted == chunk.text
|
||||
|
||||
# Verify full coverage
|
||||
if len(chunks) == 1:
|
||||
assert chunks[0].start_offset == 0
|
||||
assert chunks[0].end_offset == len(content)
|
||||
|
||||
def test_single_word_chunks(self):
|
||||
"""Test position tracking with single-word chunks."""
|
||||
chunker = DocumentChunker(chunk_size=1, overlap=0)
|
||||
content = "one two three"
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
assert len(chunks) == 3
|
||||
assert chunks[0].text == "one"
|
||||
assert chunks[1].text == "two"
|
||||
assert chunks[2].text == "three"
|
||||
|
||||
# Verify positions
|
||||
assert content[chunks[0].start_offset : chunks[0].end_offset] == "one"
|
||||
assert content[chunks[1].start_offset : chunks[1].end_offset] == "two"
|
||||
assert content[chunks[2].start_offset : chunks[2].end_offset] == "three"
|
||||
|
||||
def test_realistic_note_content(self):
|
||||
"""Test with realistic note content similar to Nextcloud Notes."""
|
||||
chunker = DocumentChunker(chunk_size=50, overlap=10)
|
||||
content = """My Project Notes
|
||||
|
||||
This is a note about my project. It contains several paragraphs of text
|
||||
that should be chunked appropriately for embedding.
|
||||
|
||||
## Key Points
|
||||
|
||||
- First important point with some details
|
||||
- Second point that needs to be remembered
|
||||
- Third point for future reference
|
||||
|
||||
The document continues with more content here. We want to make sure that
|
||||
the chunking preserves context across boundaries while maintaining proper
|
||||
position tracking for each chunk.
|
||||
|
||||
This allows us to highlight the exact chunk that matched a search query,
|
||||
which builds trust in the RAG system."""
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
# Should have multiple chunks
|
||||
assert len(chunks) > 1
|
||||
|
||||
# Verify all chunks
|
||||
for chunk in chunks:
|
||||
assert isinstance(chunk, ChunkWithPosition)
|
||||
# Verify extraction
|
||||
extracted = content[chunk.start_offset : chunk.end_offset]
|
||||
assert extracted == chunk.text
|
||||
# Verify positions are valid
|
||||
assert chunk.start_offset >= 0
|
||||
assert chunk.end_offset <= len(content)
|
||||
assert chunk.start_offset < chunk.end_offset
|
||||
|
||||
def test_chunk_boundaries(self):
|
||||
"""Test that chunk boundaries are word-aligned."""
|
||||
chunker = DocumentChunker(chunk_size=10, overlap=2)
|
||||
words = [f"word{i:02d}" for i in range(30)]
|
||||
content = " ".join(words)
|
||||
|
||||
chunks = chunker.chunk_text(content)
|
||||
|
||||
for chunk in chunks:
|
||||
# Verify chunk text starts and ends with word characters (no split words)
|
||||
# Unless it's the full content
|
||||
if len(chunks) > 1:
|
||||
# Each chunk should start with a word (not whitespace)
|
||||
assert chunk.text[0].strip() != ""
|
||||
# Each chunk should end with a word (not whitespace)
|
||||
assert chunk.text[-1].strip() != ""
|
||||
+1
Submodule third_party/notes added at ce08e985a7
Vendored
+1
-1
Submodule third_party/oidc updated: 9616294911...5670bc7e30
@@ -1861,12 +1861,15 @@ version = "0.40.0"
|
||||
source = { editable = "." }
|
||||
dependencies = [
|
||||
{ name = "aiosqlite" },
|
||||
{ name = "anthropic" },
|
||||
{ name = "authlib" },
|
||||
{ name = "boto3" },
|
||||
{ name = "caldav" },
|
||||
{ name = "click" },
|
||||
{ name = "fastembed" },
|
||||
{ name = "httpx" },
|
||||
{ name = "icalendar" },
|
||||
{ name = "jinja2" },
|
||||
{ name = "mcp", extra = ["cli"] },
|
||||
{ name = "opentelemetry-api" },
|
||||
{ name = "opentelemetry-exporter-otlp-proto-grpc" },
|
||||
@@ -1885,8 +1888,6 @@ dependencies = [
|
||||
|
||||
[package.dev-dependencies]
|
||||
dev = [
|
||||
{ name = "anthropic" },
|
||||
{ name = "boto3" },
|
||||
{ name = "commitizen" },
|
||||
{ name = "datasets" },
|
||||
{ name = "ipython" },
|
||||
@@ -1904,12 +1905,15 @@ dev = [
|
||||
[package.metadata]
|
||||
requires-dist = [
|
||||
{ name = "aiosqlite", specifier = ">=0.20.0" },
|
||||
{ name = "anthropic", specifier = ">=0.42.0" },
|
||||
{ name = "authlib", specifier = ">=1.6.5" },
|
||||
{ name = "boto3", specifier = ">=1.35.0" },
|
||||
{ name = "caldav", git = "https://github.com/cbcoutinho/caldav?branch=feature%2Fhttpx" },
|
||||
{ name = "click", specifier = ">=8.1.8" },
|
||||
{ name = "fastembed", specifier = ">=0.4.2" },
|
||||
{ name = "httpx", specifier = ">=0.28.1,<0.29.0" },
|
||||
{ name = "icalendar", specifier = ">=6.0.0,<7.0.0" },
|
||||
{ name = "jinja2", specifier = ">=3.1.6" },
|
||||
{ name = "mcp", extras = ["cli"], specifier = ">=1.21,<1.22" },
|
||||
{ name = "opentelemetry-api", specifier = ">=1.28.2" },
|
||||
{ name = "opentelemetry-exporter-otlp-proto-grpc", specifier = ">=1.28.2" },
|
||||
@@ -1928,8 +1932,6 @@ requires-dist = [
|
||||
|
||||
[package.metadata.requires-dev]
|
||||
dev = [
|
||||
{ name = "anthropic", specifier = ">=0.42.0" },
|
||||
{ name = "boto3", specifier = ">=1.35.0" },
|
||||
{ name = "commitizen", specifier = ">=4.8.2" },
|
||||
{ name = "datasets", specifier = ">=3.3.0" },
|
||||
{ name = "ipython", specifier = ">=9.2.0" },
|
||||
|
||||
Reference in New Issue
Block a user