perf: fix vector viz search performance and visual encoding

This commit addresses critical performance issues with vector visualization search (reducing time from 40s to ~2s) and improves result visualization through better visual encoding. ## Performance Fixes ### 1. Fix blocking sleep in retry decorator (base.py:51) - Changed `time.sleep(5)` to `await anyio.sleep(5)` in @retry_on_429 - Prevents entire event loop from freezing during rate limit retries - Impact: Reduced search time from 22s to 16s initially ### 2. Add concurrency limiting for verification (verification.py:77-93) - Added `anyio.Semaphore(20)` to limit concurrent HTTP requests - Prevents connection pool exhaustion (RequestError) from 90+ simultaneous requests - Fixes false filtering (was filtering 77/90 results incorrectly) - Note: Semaphore still in code but verification removed from viz endpoint ### 3. Remove unnecessary verification from viz endpoint (viz_routes.py:483-486) - Visualization only needs Qdrant metadata (title, excerpt), not full content - Verification only required for sampling (LLM needs full note content) - Impact: Reduced search time from 43.7s to ~2s (final fix) ### 4. Restore streaming scanner pattern (scanner.py) - Process notes one-at-a-time using async generator - Avoids loading all notes into memory ## Visualization Improvements ### 5. Result-relative score normalization (viz_routes.py:489-504) - Normalize scores within result set: best=1.0, worst=0.0 - Removes arbitrary RRF normalization (theoretical max didn't make sense) - Makes visual encoding meaningful regardless of algorithm scores ### 6. Power scaling for marker sizes (userinfo_routes.py:743) - Changed from linear `8 + (score * 12)` to power `6 + (score² * 14)` - Creates dramatic visual contrast: 0.0→6px, 0.5→9.5px, 1.0→20px - Combined with opacity (0.2-1.0) for clear visual hierarchy ### 7. Multi-channel visual encoding (userinfo_routes.py:740-745) - Size: Exponentially scaled with score² - Opacity: Linear 0.2-1.0 (keeps all points visible) - Color: Viridis gradient (blue→yellow) - Effect: Top results are large/bright/opaque, context results small/dim/transparent ## Result - Search time: 40s → ~2s (20x faster) - Visual contrast: Subtle → dramatic (clear result hierarchy) - No arbitrary cutoffs: All results visible, best naturally highlighted 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 07:01:35 +01:00
parent c8d9cc24e0
commit 137d1d6c75
5 changed files with 161 additions and 133 deletions
@@ -182,73 +182,43 @@ async def scan_user_documents(
                f"[SCAN-{scan_id}] Using pruneBefore={prune_before} to optimize data transfer"
            )

-        # Fetch all notes from Nextcloud
-        notes = [
-            note
-            async for note in nc_client.notes.get_all_notes(prune_before=prune_before)
-        ]
-        logger.info(f"[SCAN-{scan_id}] Found {len(notes)} notes for {user_id}")
+        # Get indexed state from Qdrant first (for incremental sync)
+        indexed_docs = {}
+        if not initial_sync:
+            qdrant_client = await get_qdrant_client()
+            scroll_result = await qdrant_client.scroll(
+                collection_name=get_settings().get_collection_name(),
+                scroll_filter=Filter(
+                    must=[
+                        FieldCondition(key="user_id", match=MatchValue(value=user_id)),
+                        FieldCondition(key="doc_type", match=MatchValue(value="note")),
+                    ]
+                ),
+                with_payload=["doc_id", "indexed_at"],
+                with_vectors=False,
+                limit=10000,
+            )

-        # Record documents scanned
-        record_vector_sync_scan(len(notes))
+            indexed_docs = {
+                point.payload["doc_id"]: point.payload["indexed_at"]
+                for point in scroll_result[0]
+            }

-        if initial_sync:
-            # Send everything on first sync
-            for note in notes:
-                modified_at = note.get("modified", 0)
-                await send_stream.send(
-                    DocumentTask(
-                        user_id=user_id,
-                        doc_id=str(note["id"]),
-                        doc_type="note",
-                        operation="index",
-                        modified_at=modified_at,
-                    )
-                )
-            logger.info(f"Sent {len(notes)} documents for initial sync: {user_id}")
-            return
+            logger.debug(f"Found {len(indexed_docs)} indexed documents in Qdrant")

-        # Get indexed state from Qdrant
-        qdrant_client = await get_qdrant_client()
-        scroll_result = await qdrant_client.scroll(
-            collection_name=get_settings().get_collection_name(),
-            scroll_filter=Filter(
-                must=[
-                    FieldCondition(key="user_id", match=MatchValue(value=user_id)),
-                    FieldCondition(key="doc_type", match=MatchValue(value="note")),
-                ]
-            ),
-            with_payload=["doc_id", "indexed_at"],
-            with_vectors=False,
-            limit=10000,
-        )
-
-        indexed_docs = {
-            point.payload["doc_id"]: point.payload["indexed_at"]
-            for point in scroll_result[0]
-        }
-
-        logger.debug(f"Found {len(indexed_docs)} indexed documents in Qdrant")
-
-        # Compare and queue changes
+        # Stream notes from Nextcloud and process immediately
+        note_count = 0
        queued = 0
-        nextcloud_doc_ids = {str(note["id"]) for note in notes}
+        nextcloud_doc_ids = set()

-        for note in notes:
+        async for note in nc_client.notes.get_all_notes(prune_before=prune_before):
+            note_count += 1
            doc_id = str(note["id"])
-            indexed_at = indexed_docs.get(doc_id)
+            nextcloud_doc_ids.add(doc_id)
            modified_at = note.get("modified", 0)

-            # If document reappeared, remove from potentially_deleted
-            doc_key = (user_id, doc_id)
-            if doc_key in _potentially_deleted:
-                logger.debug(
-                    f"Document {doc_id} reappeared, removing from deletion grace period"
-                )
-                del _potentially_deleted[doc_key]
-
-            # Send if never indexed or modified since last index
-            if indexed_at is None or modified_at > indexed_at:
+            if initial_sync:
+                # Send everything on first sync
                await send_stream.send(
                    DocumentTask(
                        user_id=user_id,
@@ -259,6 +229,38 @@ async def scan_user_documents(
                    )
                )
                queued += 1
+            else:
+                # Incremental sync: compare with indexed state
+                indexed_at = indexed_docs.get(doc_id)
+
+                # If document reappeared, remove from potentially_deleted
+                doc_key = (user_id, doc_id)
+                if doc_key in _potentially_deleted:
+                    logger.debug(
+                        f"Document {doc_id} reappeared, removing from deletion grace period"
+                    )
+                    del _potentially_deleted[doc_key]
+
+                # Send if never indexed or modified since last index
+                if indexed_at is None or modified_at > indexed_at:
+                    await send_stream.send(
+                        DocumentTask(
+                            user_id=user_id,
+                            doc_id=doc_id,
+                            doc_type="note",
+                            operation="index",
+                            modified_at=modified_at,
+                        )
+                    )
+                    queued += 1
+
+        # Log and record metrics after streaming
+        logger.info(f"[SCAN-{scan_id}] Found {note_count} notes for {user_id}")
+        record_vector_sync_scan(note_count)
+
+        if initial_sync:
+            logger.info(f"Sent {queued} documents for initial sync: {user_id}")
+            return

        # Check for deleted documents (in Qdrant but not in Nextcloud)
        # Use grace period: only delete after 2 consecutive scans confirm absence