Changes:
- Add file_path to metadata in semantic and BM25 hybrid search algorithms
for PDF viewer integration (search/semantic.py:161-163, search/bm25_hybrid.py:230-232)
- Include chunk_start_offset, chunk_end_offset, page_number, and page_count
in search results for rich chunk display (api/management.py:981-1004)
- Add point_id field to SearchResult for batch retrieval (models/semantic.py)
- Fix type narrowing for chunk context API parameters (api/management.py:1102-1111)
- Fix None-safety in doc_types discovery (search/algorithms.py:114)
This enables the Astroglobe UI to display PDF pages at the correct
location for matched chunks.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds comprehensive vector search support for Nextcloud Deck cards,
including semantic search indexing, chunk preview in the vector viz UI,
and proper deep linking to cards.
**Vector Search Indexing**
- Add deck_card scanning in scanner.py (scan_deck_cards function)
- Index cards from non-archived, non-deleted boards
- Store metadata: board_id, board_title, stack_id, stack_title, card_type, duedate, owner
- Content structure: title + "\n\n" + description (matches indexing format)
- Incremental sync based on lastModified timestamp
- Deletion tracking with grace period
**Vector Visualization Support**
- Add deck_card handler in context.py for chunk preview expansion
- Include board_id in search result metadata (bm25_hybrid.py, semantic.py)
- Expose metadata in viz_routes.py JSON responses
- Update vector-viz.js to construct proper Deck URLs: /apps/deck/board/{board_id}/card/{card_id}
- Update vector_viz.html filter label from "Deck" to "Deck Cards"
**Bug Fixes**
- Skip soft-deleted boards (deletedAt > 0) to prevent 403 Forbidden errors
- Applies to scanner, processor, and context expansion code paths
- Deck API returns deleted boards but rejects stack access with 403
**Testing**
- Add integration tests in test_deck_vector_search.py:
- test_deck_card_semantic_search: Filtered search with doc_type="deck_card"
- test_deck_card_appears_in_cross_app_search: Cross-app search includes deck cards
- test_deck_card_chunk_context: Chunk context fetching for viz preview
**Documentation**
- Update README.md: Add Deck cards to semantic search feature list
- Update semantic-search-architecture.md: Document deck_card support
- Update nc_semantic_search tool documentation
**Type Safety**
- Fix type narrowing for page_boundaries (could be None) using cast()
- Fix scanner.py payload None check for type safety
Resolves vector search for Deck cards across indexing, search, and visualization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed 8 type checker errors across the codebase:
- vector/scanner.py: Handle None scroll results with null-safe iteration
- search/{bm25_hybrid,semantic}.py: Add None checks for result.payload
- auth/{unified_verifier,webhook_routes}.py: Assert non-None auth credentials
- client/webdav.py: Add None checks before int() conversions
- providers/openai.py: Assert embedding_model is not None
- search/algorithms.py: Explicitly type doc_types set and cast values
- observability/logging_config.py: Match parent class signature (log_data)
Also fixed test_create_tag_creates_system_tag to match WebDAV implementation
(was testing OCS API endpoint, now tests correct WebDAV endpoint with
Content-Location header).
Type checker: 0 errors (down from 8), 20 warnings (ignored)
Tests: All 192 unit tests passing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Move scanner/processor tasks from FastMCP session lifespan to Starlette
server lifespan (correct architecture: background tasks run once at
server level, not per-session)
- Change default CLI transport from SSE to streamable-http
- Remove SSE transport option from CLI (SSE is deprecated)
- Remove SSE client session factory from test fixtures
- Add tracing instrumentation to BM25 hybrid search operations for
better observability
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace sequential Qdrant scroll calls with batch retrieve
(50 HTTP requests → 1 request, ~50x faster vector fetch)
- Add point_id to SearchResult to enable batch retrieval by Qdrant point ID
- Reuse query embedding from search algorithm in viz_routes
(eliminates redundant embedding call, saves ~30ms)
- Make BM25 encode() async with thread pool to avoid blocking event loop
(~4.4s was blocking, now properly async)
- Run PCA computation in thread pool to avoid blocking event loop
(~1.2s was blocking, now properly async)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Introduces a placeholder-based state tracking system to prevent duplicate
document processing during the gap between scanner queuing and processor
completion.
**Key Changes:**
1. **Placeholder Helper Functions** (`vector/placeholder.py`):
- `write_placeholder_point()` - Creates zero-vector placeholder when queuing
- `query_document_metadata()` - Queries for existing entry (placeholder or real)
- `delete_placeholder_point()` - Removes placeholder before writing real vectors
- `get_placeholder_filter()` - Filters placeholders from user-facing queries
2. **Scanner Updates** (`vector/scanner.py`):
- Replace `indexed_at` comparison with `modified_at` comparison
- Write placeholder before queuing each document
- Query per-document metadata instead of bulk-querying indexed_at
- Fixes bug where files were resubmitted every scan cycle
3. **Processor Updates** (`vector/processor.py`):
- Delete placeholder before upserting real vectors
- Ensures no duplicate points in Qdrant
4. **Query Filters** (all search files):
- Add `get_placeholder_filter()` to all user-facing queries
- Ensures placeholders never appear in search results or visualizations
- Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py
**Architecture:**
- Placeholders use zero vectors with dimension from embedding service
- Payload includes `is_placeholder: True` flag for filtering
- Status field tracks: "pending", "processing", "completed", "failed"
- Deterministic UUIDs using uuid5 for consistent point IDs
**Impact:**
- Eliminates duplicate processing of same documents
- Fixes race condition where long-running documents get queued multiple times
- Prevents scanner from resubmitting files every scan cycle
- Maintains clean separation between in-flight and indexed documents
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major improvements to vector visualization page:
- Refactor PCA to display individual chunks instead of averaged documents
- Add context expansion module for fetching surrounding text from notes and PDFs
- Update deduplication to use (doc_id, doc_type, chunk_start, chunk_end) keys
- Fix Alpine.js rendering with chunk-specific keys including offsets
- Refactor authentication helper to return NextcloudClient for better reuse
- Add async context manager support to NextcloudClient
Technical details:
- viz_routes.py: Fetch specific chunk vectors instead of averaging per document
- context.py: New module supporting both notes and PDF text extraction via PyMuPDF
- search algorithms: Extract page_number, chunk_index, total_chunks from Qdrant
- vector-viz.js/html: Use chunk positions in expansion tracking keys
This enables users to see which specific chunks match their query
and view them with surrounding context in the PCA visualization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses multiple issues with async operations, PDF metadata
extraction, and type safety in document processing and search.
## Async/Await Fixes
- processor.py:259 - Added await for chunker.chunk_text(content)
- processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts)
- tests/unit/test_document_chunker.py - Converted all 12 test methods to async
## PDF Metadata Enhancement
- pymupdf.py:143 - Added file_size metadata extraction
- pymupdf.py:145-206 - Refactored to extract text page-by-page
- Manually loop through pages instead of using page_chunks=True
- Generate page_boundaries metadata for precise page tracking
- Works around pymupdf.layout.activate() breaking page_chunks=True
- processor.py:32-66 - Added assign_page_numbers() helper function
- Assigns page numbers to chunks based on overlap with page boundaries
- Handles chunks spanning multiple pages
- processor.py:298-300 - Call assign_page_numbers() for PDF files
## Type Safety Fixes
- bm25_hybrid.py:184 - Removed int() conversion of doc_id
- semantic.py:131 - Removed int() conversion of doc_id
- viz_routes.py:275 - Removed int() conversion of doc_id
- Added comments documenting that doc_id can be int (notes) or str (file paths)
## Testing
- All 18 tests passing (12 unit + 6 integration)
- No type errors in modified files
- Container logs show successful processing
- Vector viz searches working correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fix false-positive validation error where DBSF (Distribution-Based Score
Fusion) correctly produces scores > 1.0 but SearchResult validation
incorrectly rejected them.
**Root Cause**: SearchResult.__post_init__() enforced scores in [0.0, 1.0]
range, but DBSF sums normalized scores from multiple retrieval systems
(dense semantic + sparse BM25), resulting in scores like 1.55 when both
systems strongly agree a document is relevant.
**Changes**:
- Relaxed validation to allow any score ≥ 0.0 (algorithms.py:147-157)
- Updated SearchResult and SemanticSearchResult documentation to explain
score ranges for RRF ([0.0, 1.0]) vs DBSF (unbounded)
- Added comprehensive test coverage for both fusion methods
- Added DBSF fusion option to vector visualization UI
- Updated viz routes and vizApp() to support fusion parameter selection
**Testing**: All 157 unit tests pass, type checking passes, ruff passes
Fixes error: "Configuration error: Score must be between 0.0 and 1.0, got 1.1528953"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>