- Add ADR-013 documenting RAG evaluation architecture - Implement two-part evaluation: Context Recall (retrieval) + Answer Correctness (generation) - Create Click CLI for ground truth generation and corpus upload - Add pytest fixtures and tests for retrieval/generation quality - Use BeIR/nfcorpus dataset with 5 selected test queries - Support Ollama and Anthropic LLM providers - Generate synthetic ground truth answers offline - Add comprehensive documentation in tests/rag_evaluation/README.md The framework separates one-time setup (generate/upload) from test execution, making tests much faster (~6-12 min vs ~15-25 min per run). Tests are manual only (not in CI) and require external LLM access. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
ADR-013: RAG Evaluation Testing Framework
Status: Proposed
Date: 2025-11-15
Context
The nc_semantic_search_answer tool implements a Retrieval-Augmented Generation (RAG) system where:
- Retrieval: Vector sync pipeline indexes Nextcloud documents (notes, calendar, contacts, etc.) into a vector database
- Generation: MCP client's LLM synthesizes answers from retrieved documents via MCP sampling (ADR-008)
We need a testing framework to evaluate RAG system performance and identify whether failures occur in retrieval (wrong documents found) or generation (poor answer quality). This framework must use industry-standard evaluation methodologies while remaining practical to implement and maintain.
To establish a baseline, we will use the BeIR/nfcorpus dataset (medical/biomedical corpus) with ~5,000 documents and established query/answer pairs.
Homepage: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/ Download: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
Decision
We will implement a two-part evaluation framework that independently tests retrieval and generation quality using pytest fixtures.
In Scope
1. Retrieval Evaluation Tests the vector sync/embedding pipeline's ability to find relevant documents.
- Metric: Context Recall (Did we retrieve documents containing the answer?)
- Evaluation method: Heuristic - Check if ground-truth document IDs appear in top-k retrieval results
- Test: Query → Semantic search → Assert expected doc IDs present
2. Generation Evaluation Tests the MCP client LLM's ability to synthesize correct answers from retrieved context.
- Metric: Answer Correctness (Is the generated answer factually correct?)
- Evaluation method: LLM-as-judge - Compare RAG answer against ground-truth answer
- Test: Query →
nc_semantic_search_answer→ LLM evaluates answer vs. ground truth (binary true/false)
Out of Scope (Initial Implementation)
- Context Relevance/Precision: Measuring irrelevant documents in retrieval results
- Faithfulness/Groundedness: Detecting hallucinations not supported by retrieved context
- Answer Relevance: Whether answer addresses the specific question asked
- Out-of-Scope Handling: Testing "I don't know" responses when answer isn't in context
- Continuous benchmarking: Automated tracking of metric trends over time
- Custom domain datasets: Production-specific test data (medical corpus used initially)
These remain valuable for future iterations but add complexity beyond our initial goals.
Implementation
Test Structure
Location: tests/rag_evaluation/
test_retrieval_quality.py- Retrieval evaluation teststest_generation_quality.py- Generation evaluation testsconftest.py- Fixtures for test data, MCP clients, and evaluation LLMs
Required Pytest Fixtures
-
nfcorpus_test_data(session-scoped)- Downloads/caches BeIR nfcorpus dataset at runtime
- Loads 5 pre-selected test queries with:
- Query text
- Pre-generated ground-truth answer (from
tests/rag_evaluation/fixtures/ground_truth.json) - Expected document IDs (from qrels with score=2)
- Uploads all corpus documents as notes in test Nextcloud instance
- Triggers vector sync to index documents
- Waits for indexing completion
- Returns test case data structure
-
mcp_sampling_client(session-scoped)- Creates MCP client that supports sampling
- Configurable LLM provider (ollama or anthropic) via environment:
RAG_EVAL_PROVIDER=ollama(default) oranthropicRAG_EVAL_OLLAMA_BASE_URL=http://localhost:11434RAG_EVAL_OLLAMA_MODEL=llama3.1:8bRAG_EVAL_ANTHROPIC_API_KEY=sk-...RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
- Returns configured MCP client fixture
-
evaluation_llm(session-scoped)- Separate LLM instance for evaluation (independent from MCP client)
- Same provider configuration as
mcp_sampling_client - Returns callable:
async def evaluate(prompt: str) -> str
Test Implementation Examples
# tests/rag_evaluation/test_retrieval_quality.py
async def test_retrieval_recall(nc_client, nfcorpus_test_data):
"""Test that semantic search retrieves documents containing the answer."""
for test_case in nfcorpus_test_data:
# Perform semantic search (retrieval only, no generation)
results = await nc_client.notes.semantic_search(
query=test_case.query,
limit=10
)
retrieved_doc_ids = {r.document_id for r in results}
expected_doc_ids = set(test_case.expected_document_ids)
# Context Recall: Are expected documents in top-k results?
recall = len(expected_doc_ids & retrieved_doc_ids) / len(expected_doc_ids)
assert recall >= 0.8, f"Recall {recall} below threshold for query: {test_case.query}"
# tests/rag_evaluation/test_generation_quality.py
async def test_answer_correctness(mcp_sampling_client, evaluation_llm, nfcorpus_test_data):
"""Test that RAG system generates factually correct answers."""
for test_case in nfcorpus_test_data:
# Execute full RAG pipeline (retrieval + generation)
result = await mcp_sampling_client.call_tool(
"nc_semantic_search_answer",
arguments={"query": test_case.query, "limit": 5}
)
rag_answer = result["generated_answer"]
# LLM-as-judge evaluation
evaluation_prompt = f"""Compare these two answers and respond with only TRUE or FALSE.
Question: {test_case.query}
Generated Answer: {rag_answer}
Ground Truth Answer: {test_case.ground_truth}
Are these answers semantically equivalent (do they convey the same factual information)?
Respond with only: TRUE or FALSE"""
evaluation_result = await evaluation_llm(evaluation_prompt)
assert evaluation_result.strip().upper() == "TRUE", \
f"Answer mismatch for query: {test_case.query}\nGot: {rag_answer}\nExpected: {test_case.ground_truth}"
Dataset Integration
The BeIR nfcorpus dataset structure:
- corpus.jsonl: 3,633 medical/biomedical documents (articles from PubMed)
- queries.jsonl: 3,237 queries (questions)
- qrels/*.tsv: Relevance judgments mapping query IDs to document IDs with scores (2=highly relevant, 1=somewhat relevant)
Important: The dataset provides relevance judgments (which documents answer which queries) but does NOT include ground truth answers. We must generate synthetic ground truth offline.
Selected Test Queries (5 diverse candidates):
- PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (5 words, 21 highly relevant docs)
- PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (8 words, 20 highly relevant docs)
- PLAIN-2510: "Coffee and Artery Function" (4 words, 16 highly relevant docs)
- PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (6 words, 15 highly relevant docs)
- PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (5 words, 14 highly relevant docs)
Ground Truth Generation (offline, pre-test):
Ground truth answers will be generated offline using a script that:
- Loads nfcorpus dataset
- For each selected query, extracts top 3-5 highly relevant documents
- Uses an LLM (ollama/anthropic) to synthesize a reference answer
- Stores ground truth in
tests/rag_evaluation/fixtures/ground_truth.json
# tools/generate_rag_ground_truth.py
async def generate_ground_truth(query: str, relevant_docs: List[dict], llm: LLMProvider) -> str:
"""Generate synthetic ground truth answer from highly relevant documents."""
context = "\n\n".join([
f"Document {i+1}:\nTitle: {doc['title']}\n{doc['text']}"
for i, doc in enumerate(relevant_docs[:5])
])
prompt = f"""Based on the following documents, provide a comprehensive answer to this question:
Question: {query}
{context}
Provide a factual, well-structured answer that synthesizes information from the documents.
Focus on accuracy and completeness."""
return await llm.generate(prompt, max_tokens=500)
Dataset Loading at Test Runtime (in nfcorpus_test_data fixture):
- Download nfcorpus dataset (cached in pytest temp directory)
- Load corpus, queries, and qrels (relevance judgments)
- Load pre-generated ground truth from
tests/rag_evaluation/fixtures/ground_truth.json - Upload all corpus documents as Nextcloud notes
- Trigger vector sync to index documents
- Wait for indexing completion
- Return test cases with query, ground truth, and expected doc IDs
LLM Provider Abstraction
# tests/rag_evaluation/llm_providers.py
class LLMProvider(Protocol):
async def generate(self, prompt: str, max_tokens: int = 100) -> str: ...
class OllamaProvider:
def __init__(self, base_url: str, model: str):
self.base_url = base_url
self.model = model
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
# Use httpx to call Ollama API
...
class AnthropicProvider:
def __init__(self, api_key: str, model: str):
self.client = anthropic.AsyncAnthropic(api_key=api_key)
self.model = model
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
message = await self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
Consequences
Positive:
- Actionable debugging: Separate retrieval/generation tests pinpoint failure location
- Industry-standard metrics: Context Recall and Answer Correctness are recognized RAG evaluation metrics
- Simple initial implementation: Binary LLM evaluation (true/false) is straightforward to implement and interpret
- Extensible framework: Easy to add more metrics (faithfulness, relevance) later
- Standardized benchmark: nfcorpus provides objective comparison against published RAG systems
- Hybrid evaluation: Combines efficiency (heuristics for retrieval) with quality (LLM-as-judge for generation)
- Provider flexibility: Supports both local (Ollama) and cloud (Anthropic) LLM evaluation
Negative:
- Medical domain bias: nfcorpus is medical/biomedical content, may not represent production use cases (personal notes, calendar events, etc.)
- Manual test execution: Tests require external LLM access and are not integrated into CI pipeline
- Limited initial coverage: Starting with only 5 queries provides limited statistical confidence
- Evaluation cost: LLM-as-judge for generation evaluation incurs API costs (Anthropic) or requires local inference (Ollama)
- Single metric per component: Initial scope tests only one metric per component, missing other important quality dimensions
- Synthetic ground truth: Ground truth answers are LLM-generated, not human-validated, which may introduce evaluation bias
- Large corpus upload: Uploading 3,633 documents at test runtime may be slow; caching strategy needed
Future Work:
- Expand to 50-100 queries for statistical significance
- Add custom test dataset with production-representative documents (meeting notes, task lists, etc.)
- Implement additional metrics (faithfulness, context relevance, answer relevance)
- Create automated benchmarking dashboard to track metric trends
- Test multi-hop reasoning (synthesis questions requiring multiple documents)
- Evaluate out-of-scope handling ("I don't know" responses)