13b2d0048c
Introduces a placeholder-based state tracking system to prevent duplicate document processing during the gap between scanner queuing and processor completion. **Key Changes:** 1. **Placeholder Helper Functions** (`vector/placeholder.py`): - `write_placeholder_point()` - Creates zero-vector placeholder when queuing - `query_document_metadata()` - Queries for existing entry (placeholder or real) - `delete_placeholder_point()` - Removes placeholder before writing real vectors - `get_placeholder_filter()` - Filters placeholders from user-facing queries 2. **Scanner Updates** (`vector/scanner.py`): - Replace `indexed_at` comparison with `modified_at` comparison - Write placeholder before queuing each document - Query per-document metadata instead of bulk-querying indexed_at - Fixes bug where files were resubmitted every scan cycle 3. **Processor Updates** (`vector/processor.py`): - Delete placeholder before upserting real vectors - Ensures no duplicate points in Qdrant 4. **Query Filters** (all search files): - Add `get_placeholder_filter()` to all user-facing queries - Ensures placeholders never appear in search results or visualizations - Applied to: bm25_hybrid.py, semantic.py, viz_routes.py, algorithms.py **Architecture:** - Placeholders use zero vectors with dimension from embedding service - Payload includes `is_placeholder: True` flag for filtering - Status field tracks: "pending", "processing", "completed", "failed" - Deterministic UUIDs using uuid5 for consistent point IDs **Impact:** - Eliminates duplicate processing of same documents - Fixes race condition where long-running documents get queued multiple times - Prevents scanner from resubmitting files every scan cycle - Maintains clean separation between in-flight and indexed documents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
24 lines
760 B
Docker
24 lines
760 B
Docker
FROM docker.io/library/python:3.12-slim-trixie@sha256:2e683fc3e18a248aa23b8022f2a3474b072b04fb851efe9b49f6b516a8944939
|
|
|
|
COPY --from=ghcr.io/astral-sh/uv:0.9.10@sha256:29bd45092ea8902c0bbb7f0a338f0494a382b1f4b18355df5be270ade679ff1d /uv /uvx /bin/
|
|
|
|
# Install dependencies
|
|
# 1. git (required for caldav dependency from git)
|
|
# 2. sqlite for development with token db
|
|
RUN apt update && apt install --no-install-recommends --no-install-suggests -y \
|
|
git \
|
|
tesseract-ocr \
|
|
sqlite3 && apt clean
|
|
|
|
WORKDIR /app
|
|
|
|
COPY . .
|
|
|
|
RUN uv sync --locked --no-dev --no-editable --no-cache
|
|
|
|
ENV PYTHONUNBUFFERED=1
|
|
ENV VIRTUAL_ENV=/app/.venv
|
|
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
|
|
|
|
ENTRYPOINT ["/app/.venv/bin/nextcloud-mcp-server", "--host", "0.0.0.0"]
|