Files

T

Chris Coutinho c272ddd82d feat: implement RAG evaluation framework with CLI tooling

- Add ADR-013 documenting RAG evaluation architecture
- Implement two-part evaluation: Context Recall (retrieval) + Answer Correctness (generation)
- Create Click CLI for ground truth generation and corpus upload
- Add pytest fixtures and tests for retrieval/generation quality
- Use BeIR/nfcorpus dataset with 5 selected test queries
- Support Ollama and Anthropic LLM providers
- Generate synthetic ground truth answers offline
- Add comprehensive documentation in tests/rag_evaluation/README.md

The framework separates one-time setup (generate/upload) from test execution,
making tests much faster (~6-12 min vs ~15-25 min per run).

Tests are manual only (not in CI) and require external LLM access.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-15 23:11:21 +01:00

8.9 KiB

Raw Blame History

RAG Evaluation Tests

This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the nc_semantic_search_answer tool.

Architecture

The RAG system has two components that are tested independently:

Retrieval - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
Generation - MCP client LLM synthesis (retrieved context → natural language answer)

See ADR-013 for full architectural details.

Test Structure

tests/rag_evaluation/
├── README.md                       # This file
├── conftest.py                     # Pytest fixtures
├── llm_providers.py                # LLM provider abstraction (Ollama/Anthropic)
├── fixtures/
│   └── ground_truth.json           # Pre-generated reference answers
├── test_retrieval_quality.py       # Retrieval evaluation (Context Recall)
└── test_generation_quality.py      # Generation evaluation (Answer Correctness)

Metrics

Retrieval Evaluation

Metric: Context Recall
Method: Heuristic - Check if ground-truth document IDs appear in top-k results
Target: ≥80% recall

Generation Evaluation

Metric: Answer Correctness
Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
Evaluation: External LLM evaluates semantic equivalence

Dataset

BeIR/nfcorpus - Medical/biomedical corpus with ~3,600 documents

Test Queries (5 selected):

PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)

Setup

1. Install Dependencies

uv sync --group dev

This installs:

anthropic>=0.42.0 - For Anthropic LLM evaluation
click>=8.1.8 - For CLI interface
datasets>=3.3.0 - For BeIR nfcorpus dataset loading

2. Configure LLM Provider

Set environment variables for your LLM provider:

Option A: Ollama (default, local/remote)

export RAG_EVAL_PROVIDER=ollama
export OLLAMA_HOST=https://ollama.example.com  # or RAG_EVAL_OLLAMA_BASE_URL
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b

Option B: Anthropic (cloud)

export RAG_EVAL_PROVIDER=anthropic
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

3. One-Time Setup: Generate Ground Truth

Generate synthetic reference answers for the 5 test queries:

uv run python tools/rag_eval_cli.py generate

What this does:

Downloads nfcorpus dataset to tests/rag_evaluation/fixtures/nfcorpus/ (cached locally)
For each of the 5 selected queries, extracts highly relevant documents
Uses configured LLM to synthesize a reference answer
Saves to tests/rag_evaluation/fixtures/ground_truth.json

Optional flags:

--provider ollama|anthropic - Override LLM provider
--model MODEL_NAME - Override model name
--force-download - Re-download nfcorpus dataset

4. One-Time Setup: Upload Corpus to Nextcloud

Upload all 3,633 nfcorpus documents as Nextcloud notes:

uv run python tools/rag_eval_cli.py upload \
    --nextcloud-url http://localhost:8000 \
    --username admin \
    --password admin

What this does:

Downloads nfcorpus dataset (if not already cached)
Uploads all documents as notes in Nextcloud
Saves document ID → note ID mapping to tests/rag_evaluation/fixtures/note_mapping.json

Optional flags:

--category CATEGORY - Custom category for notes (default: nfcorpus_rag_eval)
--force-download - Re-download nfcorpus dataset

Important: This step requires:

A running Nextcloud instance with vector sync enabled
Notes app installed
Valid credentials

Duration: ~10-15 minutes to upload 3,633 documents

Running Tests

Run All RAG Evaluation Tests

uv run pytest tests/rag_evaluation/ -v

Run Specific Test Suites

Retrieval Quality Only:

uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v

Generation Quality Only:

uv run pytest tests/rag_evaluation/test_generation_quality.py -v

Run Individual Tests

uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v

Test Execution Flow

Prerequisites (one-time setup):

Generated ground truth (tools/rag_eval_cli.py generate)
Uploaded corpus to Nextcloud (tools/rag_eval_cli.py upload)

Retrieval Quality Tests

Setup (nfcorpus_test_data fixture):
- Loads pre-generated ground truth from fixtures/ground_truth.json
- Loads note mapping from fixtures/note_mapping.json
- Returns test cases with expected note IDs
Test (test_retrieval_context_recall):
- For each query: Perform semantic search (top-10)
- Extract retrieved note IDs
- Calculate Context Recall = (expected ∩ retrieved) / expected
- Assert recall ≥ 80%
Cleanup:
- None required (notes persist in Nextcloud for reuse)

Generation Quality Tests

Setup:
- Same as retrieval tests (reuses nfcorpus_test_data fixture)
- Creates evaluation LLM provider
Test (test_answer_correctness):
- For each query: Call nc_semantic_search_answer MCP tool
- Extract generated answer
- Use LLM-as-judge to compare vs ground truth
- Assert semantic equivalence (TRUE/FALSE)
Cleanup:
- LLM provider closed

Expected Test Duration

One-time setup:

Generate ground truth: ~5-10 minutes (5 queries with LLM generation)
Upload corpus: ~10-15 minutes (3,633 documents)
Total setup: ~15-25 minutes

Test execution (after setup):

Retrieval tests: ~1-2 minutes (5 queries, no upload/cleanup)
Generation tests: ~5-10 minutes (RAG generation + LLM evaluation)
Total per run: ~6-12 minutes

Note: These are NOT smoke tests and are NOT run in CI.

Limitations & Future Work

Current Limitations:

Only 5 test queries (limited statistical confidence)
Medical domain bias (may not represent production use cases)
Synthetic ground truth (LLM-generated, not human-validated)
Manual test execution (requires external LLM access)

Future Enhancements:

Expand to 50-100 queries for statistical significance
Add custom test dataset with production-representative documents
Implement additional metrics (faithfulness, context relevance, answer relevance)
Create automated benchmarking dashboard
Test multi-hop reasoning (synthesis questions)
Evaluate out-of-scope handling ("I don't know" responses)

Troubleshooting

Tests Fail with "Ground truth file not found"

Run the generate command first:

uv run python tools/rag_eval_cli.py generate

Tests Fail with "Note mapping file not found"

Run the upload command first:

uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin

Tests Fail with "MCP sampling client not yet implemented"

The mcp_sampling_client fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in conftest.py.

Upload Command Fails

Common issues:

Nextcloud not running: Ensure Nextcloud is accessible at the URL
Invalid credentials: Verify username/password
Notes app not installed: Install Notes app in Nextcloud
Network timeout: Increase timeout in CLI (currently 60s)

LLM Timeout

If ground truth generation times out:

Increase timeout in llm_providers.py (currently 10 min)
Use a faster model: --model llama3.2:1b
Check Ollama/Anthropic service availability

Dataset Download Fails

The nfcorpus dataset is downloaded automatically. If download fails:

Check internet connection
Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
Extract to tests/rag_evaluation/fixtures/nfcorpus/
Or use HuggingFace datasets cache: ~/.cache/huggingface/datasets/BeIR___nfcorpus/

Vector Sync Not Indexing Documents

After uploading, vector sync must index the documents:

Check vector sync is enabled in Nextcloud
Trigger manual sync if needed
Wait for background job to process all documents
Verify in Qdrant that vectors exist for uploaded notes

8.9 KiB Raw Blame History