feat: Implement Qdrant placeholder state management
Introduces a placeholder-based state tracking system to prevent duplicate document processing during the gap between scanner queuing and processor completion. **Key Changes:** 1. **Placeholder Helper Functions** (`vector/placeholder.py`): - `write_placeholder_point()` - Creates zero-vector placeholder when queuing - `query_document_metadata()` - Queries for existing entry (placeholder or real) - `delete_placeholder_point()` - Removes placeholder before writing real vectors - `get_placeholder_filter()` - Filters placeholders from user-facing queries 2. **Scanner Updates** (`vector/scanner.py`): - Replace `indexed_at` comparison with `modified_at` comparison - Write placeholder before queuing each document - Query per-document metadata instead of bulk-querying indexed_at - Fixes bug where files were resubmitted every scan cycle 3. **Processor Updates** (`vector/processor.py`): - Delete placeholder before upserting real vectors - Ensures no duplicate points in Qdrant 4. **Query Filters** (all search files): - Add `get_placeholder_filter()` to all user-facing queries - Ensures placeholders never appear in search results or visualizations - Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py **Architecture:** - Placeholders use zero vectors with dimension from embedding service - Payload includes `is_placeholder: True` flag for filtering - Status field tracks: "pending", "processing", "completed", "failed" - Deterministic UUIDs using uuid5 for consistent point IDs **Impact:** - Eliminates duplicate processing of same documents - Fixes race condition where long-running documents get queued multiple times - Prevents scanner from resubmitting files every scan cycle - Maintains clean separation between in-flight and indexed documents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -23,6 +23,7 @@ from nextcloud_mcp_server.observability.metrics import (
|
||||
)
|
||||
from nextcloud_mcp_server.observability.tracing import trace_operation
|
||||
from nextcloud_mcp_server.vector.document_chunker import DocumentChunker
|
||||
from nextcloud_mcp_server.vector.placeholder import delete_placeholder_point
|
||||
from nextcloud_mcp_server.vector.qdrant_client import get_qdrant_client
|
||||
from nextcloud_mcp_server.vector.scanner import DocumentTask
|
||||
|
||||
@@ -418,6 +419,20 @@ async def _index_document(
|
||||
)
|
||||
)
|
||||
|
||||
# Delete placeholder before writing real vectors
|
||||
# This prevents duplicates and cleans up the placeholder state
|
||||
try:
|
||||
await delete_placeholder_point(
|
||||
doc_id=doc_task.doc_id,
|
||||
doc_type=doc_task.doc_type,
|
||||
user_id=doc_task.user_id,
|
||||
)
|
||||
except Exception as e:
|
||||
# Log but don't fail indexing if placeholder deletion fails
|
||||
logger.warning(
|
||||
f"Failed to delete placeholder for {doc_task.doc_type}_{doc_task.doc_id}: {e}"
|
||||
)
|
||||
|
||||
# Upsert to Qdrant
|
||||
await qdrant_client.upsert(
|
||||
collection_name=settings.get_collection_name(),
|
||||
|
||||
Reference in New Issue
Block a user