refactor: migrate vector sync from asyncio.Queue to anyio memory object streams

Replace asyncio.Queue with anyio.create_memory_object_stream() throughout the vector sync system for better library consistency and improved shutdown semantics. ## Changes Made **scanner.py**: - Changed parameter type from `asyncio.Queue` to `MemoryObjectSendStream[DocumentTask]` - Replaced all `await document_queue.put()` calls with `await send_stream.send()` - Wrapped scanner loop in `async with send_stream:` context manager for automatic cleanup - Updated log messages: "Queued" → "Sent" - Removed `import asyncio` (no longer needed) **processor.py**: - Changed parameter type from `asyncio.Queue` to `MemoryObjectReceiveStream[DocumentTask]` - Replaced `asyncio.wait_for(document_queue.get(), timeout=1.0)` with `anyio.fail_after(1.0)` + `await receive_stream.receive()` - Removed all `document_queue.task_done()` calls (not needed with streams) - Added `anyio.EndOfStream` exception handling for graceful shutdown when scanner closes - Removed `import asyncio` (no longer needed) **app.py**: - Removed `import asyncio` from top-level imports - Added `from anyio.streams.memory import MemoryObjectReceiveStream, MemoryObjectSendStream` - Updated AppContext dataclass: - Replaced `document_queue: Optional[asyncio.Queue]` with: - `document_send_stream: Optional[MemoryObjectSendStream]` - `document_receive_stream: Optional[MemoryObjectReceiveStream]` - Updated `app_lifespan_basic()`: - Replaced `asyncio.Queue(maxsize=...)` with `anyio.create_memory_object_stream(max_buffer_size=...)` - Pass `send_stream` to scanner_task - Pass `receive_stream.clone()` to each processor_task (enables multiple consumers) - Updated AppContext yield to include both streams - Updated `starlette_lifespan()`: - Same changes as app_lifespan_basic for streamable-http transport - Removed `import asyncio as asyncio_module` (no longer needed) - Updated app.state storage to use send_stream and receive_stream **semantic.py**: - Updated `nc_get_vector_sync_status()` tool: - Access `document_receive_stream` instead of `document_queue` from lifespan context - Use `stream_stats.current_buffer_used` instead of `queue.qsize()` for pending count - More reliable metrics (qsize() was not guaranteed accurate) ## Benefits 1. **Library Consistency**: Pure anyio throughout codebase (was mixing asyncio.Queue with anyio.Event and anyio.create_task_group) 2. **Graceful Shutdown**: `async with send_stream:` automatically closes stream on exit, signaling EndOfStream to all processors 3. **Better Timeout Handling**: `anyio.fail_after()` is more idiomatic than `asyncio.wait_for()` 4. **Stream Cloning**: Easy to add multiple consumers via `receive_stream.clone()` 5. **Better Statistics**: `.statistics()` provides accurate buffer metrics (qsize() was unreliable) 6. **Type Safety**: Separate send/receive types prevent accidental misuse 7. **No task_done() tracking**: Streams handle completion automatically ## Testing - ✅ All 69 unit tests passing - ✅ All 5 smoke tests passing - ✅ No regressions in functionality - ✅ Graceful shutdown behavior improved ## References - https://anyio.readthedocs.io/en/stable/why.html#queue-fix - https://anyio.readthedocs.io/en/stable/streams.html#memory-object-streams 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-09 06:43:44 +01:00
parent 4b026e9aa0
commit 72232f937a
4 changed files with 83 additions and 78 deletions
@@ -1,14 +1,14 @@
 """Processor task for vector database synchronization.

-Processes documents from queue: fetches content, generates embeddings, stores in Qdrant.
+Processes documents from stream: fetches content, generates embeddings, stores in Qdrant.
 """

-import asyncio
 import logging
 import time
 import uuid

 import anyio
+from anyio.streams.memory import MemoryObjectReceiveStream
 from httpx import HTTPStatusError
 from qdrant_client.models import FieldCondition, Filter, MatchValue, PointStruct

@@ -24,27 +24,26 @@ logger = logging.getLogger(__name__)

 async def processor_task(
    worker_id: int,
-    document_queue: asyncio.Queue,
+    receive_stream: MemoryObjectReceiveStream[DocumentTask],
    shutdown_event: anyio.Event,
    nc_client: NextcloudClient,
    user_id: str,
 ):
    """
-    Process documents from queue concurrently.
+    Process documents from stream concurrently.

    Each processor task runs in a loop:
-    1. Pull document from queue (with timeout)
+    1. Receive document from stream (with timeout)
    2. Fetch content from Nextcloud
    3. Tokenize and chunk text
    4. Generate embeddings (I/O bound - external API)
    5. Upload vectors to Qdrant
-    6. Mark task complete

    Multiple processors run concurrently for I/O parallelism.

    Args:
        worker_id: Worker identifier for logging
-        document_queue: Queue to pull documents from
+        receive_stream: Stream to receive documents from
        shutdown_event: Event signaling shutdown
        nc_client: Authenticated Nextcloud client
        user_id: User being processed
@@ -54,32 +53,28 @@ async def processor_task(
    while not shutdown_event.is_set():
        try:
            # Get document with timeout (allows checking shutdown)
-            doc_task = await asyncio.wait_for(
-                document_queue.get(),
-                timeout=1.0,
-            )
+            with anyio.fail_after(1.0):
+                doc_task = await receive_stream.receive()

            # Process document
            await process_document(doc_task, nc_client)

-            # Mark complete
-            document_queue.task_done()
-
-        except asyncio.TimeoutError:
+        except TimeoutError:
            # No documents available, continue
            continue

+        except anyio.EndOfStream:
+            # Scanner finished and closed stream, exit gracefully
+            logger.info(f"Processor {worker_id}: Scanner finished, exiting")
+            break
+
        except Exception as e:
            logger.error(
                f"Processor {worker_id} error processing "
                f"{doc_task.doc_type}_{doc_task.doc_id}: {e}",
                exc_info=True,
            )
-            # Mark task done even on error to prevent queue blocking
-            try:
-                document_queue.task_done()
-            except ValueError:
-                pass
+            # Continue to next document (no task_done() needed with streams)

    logger.info(f"Processor {worker_id} stopped")