Files
Chris Coutinho 30a4d84458 feat: add concurrent uploads and --force flag to upload command
- Add --force flag to delete all existing notes in target category before upload
- Implement concurrent uploads using anyio task groups (20 concurrent max)
- Add semaphore to limit concurrent requests and avoid overwhelming server
- Improve progress reporting with upload count and error tracking
- Update README with --force flag documentation

Performance improvement: Concurrent uploads significantly reduce upload time
from ~10-15 minutes to ~2-3 minutes for 3,633 documents.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 00:41:00 +01:00

9.0 KiB

RAG Evaluation Tests

This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the nc_semantic_search_answer tool.

Architecture

The RAG system has two components that are tested independently:

  1. Retrieval - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
  2. Generation - MCP client LLM synthesis (retrieved context → natural language answer)

See ADR-013 for full architectural details.

Test Structure

tests/rag_evaluation/
├── README.md                       # This file
├── conftest.py                     # Pytest fixtures
├── llm_providers.py                # LLM provider abstraction (Ollama/Anthropic)
├── fixtures/
│   └── ground_truth.json           # Pre-generated reference answers
├── test_retrieval_quality.py       # Retrieval evaluation (Context Recall)
└── test_generation_quality.py      # Generation evaluation (Answer Correctness)

Metrics

Retrieval Evaluation

  • Metric: Context Recall
  • Method: Heuristic - Check if ground-truth document IDs appear in top-k results
  • Target: ≥80% recall

Generation Evaluation

  • Metric: Answer Correctness
  • Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
  • Evaluation: External LLM evaluates semantic equivalence

Dataset

BeIR/nfcorpus - Medical/biomedical corpus with ~3,600 documents

Test Queries (5 selected):

  1. PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
  2. PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
  3. PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
  4. PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
  5. PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)

Setup

1. Install Dependencies

uv sync --group dev

This installs:

  • anthropic>=0.42.0 - For Anthropic LLM evaluation
  • click>=8.1.8 - For CLI interface
  • datasets>=3.3.0 - For BeIR nfcorpus dataset loading

2. Configure LLM Provider

Set environment variables for your LLM provider:

Option A: Ollama (default, local/remote)

export RAG_EVAL_PROVIDER=ollama
export OLLAMA_HOST=https://ollama.example.com  # or RAG_EVAL_OLLAMA_BASE_URL
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b

Option B: Anthropic (cloud)

export RAG_EVAL_PROVIDER=anthropic
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

3. One-Time Setup: Generate Ground Truth

Generate synthetic reference answers for the 5 test queries:

uv run python tools/rag_eval_cli.py generate

What this does:

  • Downloads nfcorpus dataset to tests/rag_evaluation/fixtures/nfcorpus/ (cached locally)
  • For each of the 5 selected queries, extracts highly relevant documents
  • Uses configured LLM to synthesize a reference answer
  • Saves to tests/rag_evaluation/fixtures/ground_truth.json

Optional flags:

  • --provider ollama|anthropic - Override LLM provider
  • --model MODEL_NAME - Override model name
  • --force-download - Re-download nfcorpus dataset

4. One-Time Setup: Upload Corpus to Nextcloud

Upload all 3,633 nfcorpus documents as Nextcloud notes:

uv run python tools/rag_eval_cli.py upload \
    --nextcloud-url http://localhost:8000 \
    --username admin \
    --password admin

What this does:

  • Downloads nfcorpus dataset (if not already cached)
  • Uploads all documents as notes in Nextcloud
  • Saves document ID → note ID mapping to tests/rag_evaluation/fixtures/note_mapping.json

Optional flags:

  • --category CATEGORY - Custom category for notes (default: nfcorpus_rag_eval)
  • --force-download - Re-download nfcorpus dataset
  • --force - Delete all existing notes in the target category before uploading (efficient corpus refresh)

Important: This step requires:

  • A running Nextcloud instance with vector sync enabled
  • Notes app installed
  • Valid credentials

Duration: ~10-15 minutes to upload 3,633 documents

Running Tests

Run All RAG Evaluation Tests

uv run pytest tests/rag_evaluation/ -v

Run Specific Test Suites

Retrieval Quality Only:

uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v

Generation Quality Only:

uv run pytest tests/rag_evaluation/test_generation_quality.py -v

Run Individual Tests

uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v

Test Execution Flow

Prerequisites (one-time setup):

  1. Generated ground truth (tools/rag_eval_cli.py generate)
  2. Uploaded corpus to Nextcloud (tools/rag_eval_cli.py upload)

Retrieval Quality Tests

  1. Setup (nfcorpus_test_data fixture):

    • Loads pre-generated ground truth from fixtures/ground_truth.json
    • Loads note mapping from fixtures/note_mapping.json
    • Returns test cases with expected note IDs
  2. Test (test_retrieval_context_recall):

    • For each query: Perform semantic search (top-10)
    • Extract retrieved note IDs
    • Calculate Context Recall = (expected ∩ retrieved) / expected
    • Assert recall ≥ 80%
  3. Cleanup:

    • None required (notes persist in Nextcloud for reuse)

Generation Quality Tests

  1. Setup:

    • Same as retrieval tests (reuses nfcorpus_test_data fixture)
    • Creates evaluation LLM provider
  2. Test (test_answer_correctness):

    • For each query: Call nc_semantic_search_answer MCP tool
    • Extract generated answer
    • Use LLM-as-judge to compare vs ground truth
    • Assert semantic equivalence (TRUE/FALSE)
  3. Cleanup:

    • LLM provider closed

Expected Test Duration

One-time setup:

  • Generate ground truth: ~5-10 minutes (5 queries with LLM generation)
  • Upload corpus: ~10-15 minutes (3,633 documents)
  • Total setup: ~15-25 minutes

Test execution (after setup):

  • Retrieval tests: ~1-2 minutes (5 queries, no upload/cleanup)
  • Generation tests: ~5-10 minutes (RAG generation + LLM evaluation)
  • Total per run: ~6-12 minutes

Note: These are NOT smoke tests and are NOT run in CI.

Limitations & Future Work

Current Limitations:

  • Only 5 test queries (limited statistical confidence)
  • Medical domain bias (may not represent production use cases)
  • Synthetic ground truth (LLM-generated, not human-validated)
  • Manual test execution (requires external LLM access)

Future Enhancements:

  • Expand to 50-100 queries for statistical significance
  • Add custom test dataset with production-representative documents
  • Implement additional metrics (faithfulness, context relevance, answer relevance)
  • Create automated benchmarking dashboard
  • Test multi-hop reasoning (synthesis questions)
  • Evaluate out-of-scope handling ("I don't know" responses)

Troubleshooting

Tests Fail with "Ground truth file not found"

Run the generate command first:

uv run python tools/rag_eval_cli.py generate

Tests Fail with "Note mapping file not found"

Run the upload command first:

uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin

Tests Fail with "MCP sampling client not yet implemented"

The mcp_sampling_client fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in conftest.py.

Upload Command Fails

Common issues:

  1. Nextcloud not running: Ensure Nextcloud is accessible at the URL
  2. Invalid credentials: Verify username/password
  3. Notes app not installed: Install Notes app in Nextcloud
  4. Network timeout: Increase timeout in CLI (currently 60s)

LLM Timeout

If ground truth generation times out:

  1. Increase timeout in llm_providers.py (currently 10 min)
  2. Use a faster model: --model llama3.2:1b
  3. Check Ollama/Anthropic service availability

Dataset Download Fails

The nfcorpus dataset is downloaded automatically. If download fails:

  1. Check internet connection
  2. Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
  3. Extract to tests/rag_evaluation/fixtures/nfcorpus/
  4. Or use HuggingFace datasets cache: ~/.cache/huggingface/datasets/BeIR___nfcorpus/

Vector Sync Not Indexing Documents

After uploading, vector sync must index the documents:

  1. Check vector sync is enabled in Nextcloud
  2. Trigger manual sync if needed
  3. Wait for background job to process all documents
  4. Verify in Qdrant that vectors exist for uploaded notes

References