- Replace sequential Qdrant scroll calls with batch retrieve
(50 HTTP requests → 1 request, ~50x faster vector fetch)
- Add point_id to SearchResult to enable batch retrieval by Qdrant point ID
- Reuse query embedding from search algorithm in viz_routes
(eliminates redundant embedding call, saves ~30ms)
- Make BM25 encode() async with thread pool to avoid blocking event loop
(~4.4s was blocking, now properly async)
- Run PCA computation in thread pool to avoid blocking event loop
(~1.2s was blocking, now properly async)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Introduces a placeholder-based state tracking system to prevent duplicate
document processing during the gap between scanner queuing and processor
completion.
**Key Changes:**
1. **Placeholder Helper Functions** (`vector/placeholder.py`):
- `write_placeholder_point()` - Creates zero-vector placeholder when queuing
- `query_document_metadata()` - Queries for existing entry (placeholder or real)
- `delete_placeholder_point()` - Removes placeholder before writing real vectors
- `get_placeholder_filter()` - Filters placeholders from user-facing queries
2. **Scanner Updates** (`vector/scanner.py`):
- Replace `indexed_at` comparison with `modified_at` comparison
- Write placeholder before queuing each document
- Query per-document metadata instead of bulk-querying indexed_at
- Fixes bug where files were resubmitted every scan cycle
3. **Processor Updates** (`vector/processor.py`):
- Delete placeholder before upserting real vectors
- Ensures no duplicate points in Qdrant
4. **Query Filters** (all search files):
- Add `get_placeholder_filter()` to all user-facing queries
- Ensures placeholders never appear in search results or visualizations
- Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py
**Architecture:**
- Placeholders use zero vectors with dimension from embedding service
- Payload includes `is_placeholder: True` flag for filtering
- Status field tracks: "pending", "processing", "completed", "failed"
- Deterministic UUIDs using uuid5 for consistent point IDs
**Impact:**
- Eliminates duplicate processing of same documents
- Fixes race condition where long-running documents get queued multiple times
- Prevents scanner from resubmitting files every scan cycle
- Maintains clean separation between in-flight and indexed documents
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major improvements to vector visualization page:
- Refactor PCA to display individual chunks instead of averaged documents
- Add context expansion module for fetching surrounding text from notes and PDFs
- Update deduplication to use (doc_id, doc_type, chunk_start, chunk_end) keys
- Fix Alpine.js rendering with chunk-specific keys including offsets
- Refactor authentication helper to return NextcloudClient for better reuse
- Add async context manager support to NextcloudClient
Technical details:
- viz_routes.py: Fetch specific chunk vectors instead of averaging per document
- context.py: New module supporting both notes and PDF text extraction via PyMuPDF
- search algorithms: Extract page_number, chunk_index, total_chunks from Qdrant
- vector-viz.js/html: Use chunk positions in expansion tracking keys
This enables users to see which specific chunks match their query
and view them with surrounding context in the PCA visualization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses multiple issues with async operations, PDF metadata
extraction, and type safety in document processing and search.
## Async/Await Fixes
- processor.py:259 - Added await for chunker.chunk_text(content)
- processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts)
- tests/unit/test_document_chunker.py - Converted all 12 test methods to async
## PDF Metadata Enhancement
- pymupdf.py:143 - Added file_size metadata extraction
- pymupdf.py:145-206 - Refactored to extract text page-by-page
- Manually loop through pages instead of using page_chunks=True
- Generate page_boundaries metadata for precise page tracking
- Works around pymupdf.layout.activate() breaking page_chunks=True
- processor.py:32-66 - Added assign_page_numbers() helper function
- Assigns page numbers to chunks based on overlap with page boundaries
- Handles chunks spanning multiple pages
- processor.py:298-300 - Call assign_page_numbers() for PDF files
## Type Safety Fixes
- bm25_hybrid.py:184 - Removed int() conversion of doc_id
- semantic.py:131 - Removed int() conversion of doc_id
- viz_routes.py:275 - Removed int() conversion of doc_id
- Added comments documenting that doc_id can be int (notes) or str (file paths)
## Testing
- All 18 tests passing (12 unit + 6 integration)
- No type errors in modified files
- Container logs show successful processing
- Vector viz searches working correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fix false-positive validation error where DBSF (Distribution-Based Score
Fusion) correctly produces scores > 1.0 but SearchResult validation
incorrectly rejected them.
**Root Cause**: SearchResult.__post_init__() enforced scores in [0.0, 1.0]
range, but DBSF sums normalized scores from multiple retrieval systems
(dense semantic + sparse BM25), resulting in scores like 1.55 when both
systems strongly agree a document is relevant.
**Changes**:
- Relaxed validation to allow any score ≥ 0.0 (algorithms.py:147-157)
- Updated SearchResult and SemanticSearchResult documentation to explain
score ranges for RRF ([0.0, 1.0]) vs DBSF (unbounded)
- Added comprehensive test coverage for both fusion methods
- Added DBSF fusion option to vector visualization UI
- Updated viz routes and vizApp() to support fusion parameter selection
**Testing**: All 157 unit tests pass, type checking passes, ruff passes
Fixes error: "Configuration error: Score must be between 0.0 and 1.0, got 1.1528953"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>