Notes are indexed as "{title}\n\n{content}" in processor.py but were
being retrieved as just content during chunk expansion, causing
chunk_start_offset and chunk_end_offset to be misaligned.
This fix reconstructs the full content structure when fetching notes
for chunk expansion, ensuring the displayed chunks match the excerpts
shown in search results.
Fixes chunk/excerpt mismatch reported in vector visualization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major improvements to vector visualization page:
- Refactor PCA to display individual chunks instead of averaged documents
- Add context expansion module for fetching surrounding text from notes and PDFs
- Update deduplication to use (doc_id, doc_type, chunk_start, chunk_end) keys
- Fix Alpine.js rendering with chunk-specific keys including offsets
- Refactor authentication helper to return NextcloudClient for better reuse
- Add async context manager support to NextcloudClient
Technical details:
- viz_routes.py: Fetch specific chunk vectors instead of averaging per document
- context.py: New module supporting both notes and PDF text extraction via PyMuPDF
- search algorithms: Extract page_number, chunk_index, total_chunks from Qdrant
- vector-viz.js/html: Use chunk positions in expansion tracking keys
This enables users to see which specific chunks match their query
and view them with surrounding context in the PCA visualization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses multiple issues with async operations, PDF metadata
extraction, and type safety in document processing and search.
## Async/Await Fixes
- processor.py:259 - Added await for chunker.chunk_text(content)
- processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts)
- tests/unit/test_document_chunker.py - Converted all 12 test methods to async
## PDF Metadata Enhancement
- pymupdf.py:143 - Added file_size metadata extraction
- pymupdf.py:145-206 - Refactored to extract text page-by-page
- Manually loop through pages instead of using page_chunks=True
- Generate page_boundaries metadata for precise page tracking
- Works around pymupdf.layout.activate() breaking page_chunks=True
- processor.py:32-66 - Added assign_page_numbers() helper function
- Assigns page numbers to chunks based on overlap with page boundaries
- Handles chunks spanning multiple pages
- processor.py:298-300 - Call assign_page_numbers() for PDF files
## Type Safety Fixes
- bm25_hybrid.py:184 - Removed int() conversion of doc_id
- semantic.py:131 - Removed int() conversion of doc_id
- viz_routes.py:275 - Removed int() conversion of doc_id
- Added comments documenting that doc_id can be int (notes) or str (file paths)
## Testing
- All 18 tests passing (12 unit + 6 integration)
- No type errors in modified files
- Container logs show successful processing
- Vector viz searches working correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Get container dimensions before creating Plotly layout to render at correct size immediately
- Add init() method with window resize listener for responsive plot sizing
- Remove post-render resize call (no longer needed with explicit dimensions)
- Improve colorbar positioning and scene domain configuration
This eliminates the visual "jump" during initial render and ensures the plot resizes smoothly when the browser window changes size.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Two fixes for the vector visualization page:
1. **CSS Loading Fix**: Moved CSS <link> from vector_viz.html fragment
to user_info.html <head> block. HTMX fragments don't process <link>
tags in <head>, causing unstyled page. Now CSS loads correctly.
2. **Camera Preservation**: Modified renderPlot() to preserve camera
position when toggling query point visibility. Previously, toggling
the "Show Query Point" checkbox would reset zoom/rotation to default.
Now reads existing camera settings from plot before updating.
Related: nextcloud_mcp_server/auth/static/vector-viz.js:123-130
Related: nextcloud_mcp_server/auth/templates/user_info.html:12
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Extract CSS and JavaScript into separate static files
- Created nextcloud_mcp_server/auth/static/vector-viz.css
- Created nextcloud_mcp_server/auth/static/vector-viz.js
- Updated templates to reference external assets
- Fix vector visualization issues:
- Normalize vectors before PCA to match Qdrant's cosine distance
- Add zero-norm and NaN detection/handling for large datasets
- Enable responsive Plotly sizing (autosize + responsive config)
- Widen plot area to full viewport width with minimized margins
- Improve visualization accuracy:
- Query point now positioned correctly relative to documents
- Handles 200+ points without JSON serialization errors
- Full-width plot maximizes screen space utilization
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit updates the web interface to better align with Nextcloud's
design system and improve the Vector Viz layout.
Changes:
- Replace emoji icons with Material Design SVG icons for better
consistency with Nextcloud apps
- Simplify navigation styling with minimal padding and subtle active
states (250px width)
- Update CSS variables to match Nextcloud design system
- Restructure Vector Viz from two-column to single-column vertical
layout for better plot visibility
- Move search controls to compact horizontal grid at top
- Make navigation toggle always visible (not just on mobile)
- Fix plot container sizing with overflow:visible to prevent colorbar
clipping
- Remove heavy shadows and custom card styling for cleaner aesthetic
- Add error and success page templates with consistent styling
Technical details:
- Preserve Alpine.js for reactive functionality
- Use CSS Grid for responsive horizontal controls layout
- Add smooth transitions for navigation collapse/expand
- Maintain HTMX for dynamic content loading
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Migrates from custom word-based chunking to LangChain's MarkdownTextSplitter
for better semantic search quality. This implements the chunking portion of
ADR-011.
Changes:
- Replace custom regex word chunker with MarkdownTextSplitter
- Optimized for Markdown content (headers, code blocks, lists)
- Convert from word-based (512 words) to character-based (2048 chars) chunking
- Maintain backward-compatible ChunkWithPosition interface
- Update configuration defaults and validation
- Update all unit tests (12/12 passing)
Benefits:
- Respects markdown structure boundaries
- Never breaks code blocks or headers mid-chunk
- Preserves semantic coherence within chunks
- Expected 20-30% improvement in recall quality
- Industry-standard approach (used by production RAG systems)
Note: Full reindex required to apply new chunking to existing documents.
Current vector database still contains old word-based chunks.
Related: ADR-011 (Improving Semantic Search Quality)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit enhances the vector visualization interface with better score
transparency and improved UX:
**Dual-Score Display:**
- Store original algorithm scores before normalization (viz_routes.py:203)
- Display both raw and normalized scores: "Raw Score: 0.842 (89% relative)"
- Update plot hover text with dual scores (userinfo_routes.py:740)
- Fixes issue where all queries showed at least one 100% match regardless
of actual relevance (normalization artifact)
**UI Improvements:**
1. Fusion Method dropdown: Changed from x-show to :disabled
- Prevents jarring layout shift when switching algorithms
- Dropdown stays visible but grayed out when Semantic is selected
- Better UX with opacity: 0.5 and cursor: not-allowed
2. Score Threshold: Changed step from 0.1 to "any"
- Allows arbitrary float precision (0.7, 0.85, 0.123)
- Users can now fine-tune threshold values
3. Document Types: Converted multi-select to checkbox grid
- Replaced clunky Ctrl/Cmd multi-select listbox
- Checkbox grid with cleaner layout
- Positioned left of Score Threshold and Result Limit inputs
- More intuitive UX
**Technical Details:**
- Raw score ranges vary by algorithm:
- Semantic: 0.0-1.0 (cosine similarity)
- BM25 RRF: ~0.001-0.033 (Reciprocal Rank Fusion)
- BM25 DBSF: Can exceed 1.0 (Distribution-Based Score Fusion)
- Normalized scores (0-1) used for visual encoding (marker size, color)
- Original scores preserved in API response via getattr fallback
Files modified:
- nextcloud_mcp_server/auth/viz_routes.py (store original_score)
- nextcloud_mcp_server/auth/templates/vector_viz.html (UI controls)
- nextcloud_mcp_server/auth/userinfo_routes.py (plot hover text)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added support for two fusion algorithms (RRF and DBSF) to combine dense
semantic and sparse BM25 search results, with comprehensive documentation
and unit tests.
Changes:
- Added fusion parameter to nc_semantic_search and nc_semantic_search_answer tools
- Updated ADR-014 with detailed comparison of RRF vs DBSF fusion algorithms
- Added unit tests for fusion algorithm initialization and validation
- Updated search_method in responses to include fusion type (e.g., "bm25_hybrid_rrf")
Fusion Algorithms:
- RRF (Reciprocal Rank Fusion): Default, rank-based, general-purpose
- DBSF (Distribution-Based Score Fusion): Score normalization using statistics
RRF is recommended for most use cases due to its robustness and established
track record. DBSF may provide better results when retrieval systems have
very different score distributions.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Track character offsets (start_offset, end_offset) for each chunk in vector
database metadata, enabling precise chunk highlighting in visualization pane.
Changes:
- processor.py: Store chunk_start_offset and chunk_end_offset in Qdrant metadata
- processor.py: Added metadata_version=2 to indicate position tracking support
- search/semantic.py: Return chunk positions from search results
- server/semantic.py: Expose chunk positions in API responses (SemanticSearchResult)
Enables viz pane to:
1. Display exact matched chunk with surrounding context
2. Highlight the precise portion of text that matched the query
3. Build user trust by showing what the RAG system actually retrieved
Position tracking uses ChunkWithPosition dataclass from document_chunker.py
which provides character-accurate offsets in the original document.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>