refactor: migrate vector sync from asyncio.Queue to anyio memory object streams
Replace asyncio.Queue with anyio.create_memory_object_stream() throughout
the vector sync system for better library consistency and improved shutdown
semantics.
## Changes Made
**scanner.py**:
- Changed parameter type from `asyncio.Queue` to `MemoryObjectSendStream[DocumentTask]`
- Replaced all `await document_queue.put()` calls with `await send_stream.send()`
- Wrapped scanner loop in `async with send_stream:` context manager for automatic cleanup
- Updated log messages: "Queued" → "Sent"
- Removed `import asyncio` (no longer needed)
**processor.py**:
- Changed parameter type from `asyncio.Queue` to `MemoryObjectReceiveStream[DocumentTask]`
- Replaced `asyncio.wait_for(document_queue.get(), timeout=1.0)` with `anyio.fail_after(1.0)` + `await receive_stream.receive()`
- Removed all `document_queue.task_done()` calls (not needed with streams)
- Added `anyio.EndOfStream` exception handling for graceful shutdown when scanner closes
- Removed `import asyncio` (no longer needed)
**app.py**:
- Removed `import asyncio` from top-level imports
- Added `from anyio.streams.memory import MemoryObjectReceiveStream, MemoryObjectSendStream`
- Updated AppContext dataclass:
- Replaced `document_queue: Optional[asyncio.Queue]` with:
- `document_send_stream: Optional[MemoryObjectSendStream]`
- `document_receive_stream: Optional[MemoryObjectReceiveStream]`
- Updated `app_lifespan_basic()`:
- Replaced `asyncio.Queue(maxsize=...)` with `anyio.create_memory_object_stream(max_buffer_size=...)`
- Pass `send_stream` to scanner_task
- Pass `receive_stream.clone()` to each processor_task (enables multiple consumers)
- Updated AppContext yield to include both streams
- Updated `starlette_lifespan()`:
- Same changes as app_lifespan_basic for streamable-http transport
- Removed `import asyncio as asyncio_module` (no longer needed)
- Updated app.state storage to use send_stream and receive_stream
**semantic.py**:
- Updated `nc_get_vector_sync_status()` tool:
- Access `document_receive_stream` instead of `document_queue` from lifespan context
- Use `stream_stats.current_buffer_used` instead of `queue.qsize()` for pending count
- More reliable metrics (qsize() was not guaranteed accurate)
## Benefits
1. **Library Consistency**: Pure anyio throughout codebase (was mixing asyncio.Queue with anyio.Event and anyio.create_task_group)
2. **Graceful Shutdown**: `async with send_stream:` automatically closes stream on exit, signaling EndOfStream to all processors
3. **Better Timeout Handling**: `anyio.fail_after()` is more idiomatic than `asyncio.wait_for()`
4. **Stream Cloning**: Easy to add multiple consumers via `receive_stream.clone()`
5. **Better Statistics**: `.statistics()` provides accurate buffer metrics (qsize() was unreliable)
6. **Type Safety**: Separate send/receive types prevent accidental misuse
7. **No task_done() tracking**: Streams handle completion automatically
## Testing
- ✅ All 69 unit tests passing
- ✅ All 5 smoke tests passing
- ✅ No regressions in functionality
- ✅ Graceful shutdown behavior improved
## References
- https://anyio.readthedocs.io/en/stable/why.html#queue-fix
- https://anyio.readthedocs.io/en/stable/streams.html#memory-object-streams
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -1,14 +1,14 @@
|
||||
"""Processor task for vector database synchronization.
|
||||
|
||||
Processes documents from queue: fetches content, generates embeddings, stores in Qdrant.
|
||||
Processes documents from stream: fetches content, generates embeddings, stores in Qdrant.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import time
|
||||
import uuid
|
||||
|
||||
import anyio
|
||||
from anyio.streams.memory import MemoryObjectReceiveStream
|
||||
from httpx import HTTPStatusError
|
||||
from qdrant_client.models import FieldCondition, Filter, MatchValue, PointStruct
|
||||
|
||||
@@ -24,27 +24,26 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
async def processor_task(
|
||||
worker_id: int,
|
||||
document_queue: asyncio.Queue,
|
||||
receive_stream: MemoryObjectReceiveStream[DocumentTask],
|
||||
shutdown_event: anyio.Event,
|
||||
nc_client: NextcloudClient,
|
||||
user_id: str,
|
||||
):
|
||||
"""
|
||||
Process documents from queue concurrently.
|
||||
Process documents from stream concurrently.
|
||||
|
||||
Each processor task runs in a loop:
|
||||
1. Pull document from queue (with timeout)
|
||||
1. Receive document from stream (with timeout)
|
||||
2. Fetch content from Nextcloud
|
||||
3. Tokenize and chunk text
|
||||
4. Generate embeddings (I/O bound - external API)
|
||||
5. Upload vectors to Qdrant
|
||||
6. Mark task complete
|
||||
|
||||
Multiple processors run concurrently for I/O parallelism.
|
||||
|
||||
Args:
|
||||
worker_id: Worker identifier for logging
|
||||
document_queue: Queue to pull documents from
|
||||
receive_stream: Stream to receive documents from
|
||||
shutdown_event: Event signaling shutdown
|
||||
nc_client: Authenticated Nextcloud client
|
||||
user_id: User being processed
|
||||
@@ -54,32 +53,28 @@ async def processor_task(
|
||||
while not shutdown_event.is_set():
|
||||
try:
|
||||
# Get document with timeout (allows checking shutdown)
|
||||
doc_task = await asyncio.wait_for(
|
||||
document_queue.get(),
|
||||
timeout=1.0,
|
||||
)
|
||||
with anyio.fail_after(1.0):
|
||||
doc_task = await receive_stream.receive()
|
||||
|
||||
# Process document
|
||||
await process_document(doc_task, nc_client)
|
||||
|
||||
# Mark complete
|
||||
document_queue.task_done()
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
except TimeoutError:
|
||||
# No documents available, continue
|
||||
continue
|
||||
|
||||
except anyio.EndOfStream:
|
||||
# Scanner finished and closed stream, exit gracefully
|
||||
logger.info(f"Processor {worker_id}: Scanner finished, exiting")
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Processor {worker_id} error processing "
|
||||
f"{doc_task.doc_type}_{doc_task.doc_id}: {e}",
|
||||
exc_info=True,
|
||||
)
|
||||
# Mark task done even on error to prevent queue blocking
|
||||
try:
|
||||
document_queue.task_done()
|
||||
except ValueError:
|
||||
pass
|
||||
# Continue to next document (no task_done() needed with streams)
|
||||
|
||||
logger.info(f"Processor {worker_id} stopped")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user