feat: implement RAG evaluation framework with CLI tooling

- Add ADR-013 documenting RAG evaluation architecture
- Implement two-part evaluation: Context Recall (retrieval) + Answer Correctness (generation)
- Create Click CLI for ground truth generation and corpus upload
- Add pytest fixtures and tests for retrieval/generation quality
- Use BeIR/nfcorpus dataset with 5 selected test queries
- Support Ollama and Anthropic LLM providers
- Generate synthetic ground truth answers offline
- Add comprehensive documentation in tests/rag_evaluation/README.md

The framework separates one-time setup (generate/upload) from test execution,
making tests much faster (~6-12 min vs ~15-25 min per run).

Tests are manual only (not in CI) and require external LLM access.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-11-15 23:11:06 +01:00
parent dc78d92e5b
commit c272ddd82d
10 changed files with 2158 additions and 0 deletions
+3
View File
@@ -13,3 +13,6 @@ docker-compose.override.yml
# Generated by pytest used to login users # Generated by pytest used to login users
.nextcloud_oauth_*.json .nextcloud_oauth_*.json
.playwright-mcp/ .playwright-mcp/
# RAG Evaluation
tests/rag_evaluation/fixtures/
+254
View File
@@ -0,0 +1,254 @@
## ADR-013: RAG Evaluation Testing Framework
**Status:** Proposed
**Date:** 2025-11-15
### Context
The `nc_semantic_search_answer` tool implements a Retrieval-Augmented Generation (RAG) system where:
1. **Retrieval**: Vector sync pipeline indexes Nextcloud documents (notes, calendar, contacts, etc.) into a vector database
2. **Generation**: MCP client's LLM synthesizes answers from retrieved documents via MCP sampling (ADR-008)
We need a testing framework to evaluate RAG system performance and identify whether failures occur in retrieval (wrong documents found) or generation (poor answer quality). This framework must use industry-standard evaluation methodologies while remaining practical to implement and maintain.
To establish a baseline, we will use the **BeIR/nfcorpus** dataset (medical/biomedical corpus) with ~5,000 documents and established query/answer pairs.
Homepage: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/
Download: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
### Decision
We will implement a **two-part evaluation framework** that independently tests retrieval and generation quality using pytest fixtures.
#### In Scope
**1. Retrieval Evaluation**
Tests the vector sync/embedding pipeline's ability to find relevant documents.
- **Metric: Context Recall** (Did we retrieve documents containing the answer?)
- **Evaluation method**: Heuristic - Check if ground-truth document IDs appear in top-k retrieval results
- **Test**: Query → Semantic search → Assert expected doc IDs present
**2. Generation Evaluation**
Tests the MCP client LLM's ability to synthesize correct answers from retrieved context.
- **Metric: Answer Correctness** (Is the generated answer factually correct?)
- **Evaluation method**: LLM-as-judge - Compare RAG answer against ground-truth answer
- **Test**: Query → `nc_semantic_search_answer` → LLM evaluates answer vs. ground truth (binary true/false)
#### Out of Scope (Initial Implementation)
- **Context Relevance/Precision**: Measuring irrelevant documents in retrieval results
- **Faithfulness/Groundedness**: Detecting hallucinations not supported by retrieved context
- **Answer Relevance**: Whether answer addresses the specific question asked
- **Out-of-Scope Handling**: Testing "I don't know" responses when answer isn't in context
- **Continuous benchmarking**: Automated tracking of metric trends over time
- **Custom domain datasets**: Production-specific test data (medical corpus used initially)
These remain valuable for future iterations but add complexity beyond our initial goals.
#### Implementation
**Test Structure**
Location: `tests/rag_evaluation/`
- `test_retrieval_quality.py` - Retrieval evaluation tests
- `test_generation_quality.py` - Generation evaluation tests
- `conftest.py` - Fixtures for test data, MCP clients, and evaluation LLMs
**Required Pytest Fixtures**
1. **`nfcorpus_test_data`** (session-scoped)
- Downloads/caches BeIR nfcorpus dataset at runtime
- Loads 5 pre-selected test queries with:
- Query text
- Pre-generated ground-truth answer (from `tests/rag_evaluation/fixtures/ground_truth.json`)
- Expected document IDs (from qrels with score=2)
- Uploads all corpus documents as notes in test Nextcloud instance
- Triggers vector sync to index documents
- Waits for indexing completion
- Returns test case data structure
2. **`mcp_sampling_client`** (session-scoped)
- Creates MCP client that supports sampling
- Configurable LLM provider (ollama or anthropic) via environment:
- `RAG_EVAL_PROVIDER=ollama` (default) or `anthropic`
- `RAG_EVAL_OLLAMA_BASE_URL=http://localhost:11434`
- `RAG_EVAL_OLLAMA_MODEL=llama3.1:8b`
- `RAG_EVAL_ANTHROPIC_API_KEY=sk-...`
- `RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022`
- Returns configured MCP client fixture
3. **`evaluation_llm`** (session-scoped)
- Separate LLM instance for evaluation (independent from MCP client)
- Same provider configuration as `mcp_sampling_client`
- Returns callable: `async def evaluate(prompt: str) -> str`
**Test Implementation Examples**
```python
# tests/rag_evaluation/test_retrieval_quality.py
async def test_retrieval_recall(nc_client, nfcorpus_test_data):
"""Test that semantic search retrieves documents containing the answer."""
for test_case in nfcorpus_test_data:
# Perform semantic search (retrieval only, no generation)
results = await nc_client.notes.semantic_search(
query=test_case.query,
limit=10
)
retrieved_doc_ids = {r.document_id for r in results}
expected_doc_ids = set(test_case.expected_document_ids)
# Context Recall: Are expected documents in top-k results?
recall = len(expected_doc_ids & retrieved_doc_ids) / len(expected_doc_ids)
assert recall >= 0.8, f"Recall {recall} below threshold for query: {test_case.query}"
# tests/rag_evaluation/test_generation_quality.py
async def test_answer_correctness(mcp_sampling_client, evaluation_llm, nfcorpus_test_data):
"""Test that RAG system generates factually correct answers."""
for test_case in nfcorpus_test_data:
# Execute full RAG pipeline (retrieval + generation)
result = await mcp_sampling_client.call_tool(
"nc_semantic_search_answer",
arguments={"query": test_case.query, "limit": 5}
)
rag_answer = result["generated_answer"]
# LLM-as-judge evaluation
evaluation_prompt = f"""Compare these two answers and respond with only TRUE or FALSE.
Question: {test_case.query}
Generated Answer: {rag_answer}
Ground Truth Answer: {test_case.ground_truth}
Are these answers semantically equivalent (do they convey the same factual information)?
Respond with only: TRUE or FALSE"""
evaluation_result = await evaluation_llm(evaluation_prompt)
assert evaluation_result.strip().upper() == "TRUE", \
f"Answer mismatch for query: {test_case.query}\nGot: {rag_answer}\nExpected: {test_case.ground_truth}"
```
**Dataset Integration**
The BeIR nfcorpus dataset structure:
- **corpus.jsonl**: 3,633 medical/biomedical documents (articles from PubMed)
- **queries.jsonl**: 3,237 queries (questions)
- **qrels/*.tsv**: Relevance judgments mapping query IDs to document IDs with scores (2=highly relevant, 1=somewhat relevant)
**Important**: The dataset provides relevance judgments (which documents answer which queries) but does NOT include ground truth answers. We must generate synthetic ground truth offline.
**Selected Test Queries** (5 diverse candidates):
1. **PLAIN-2630**: "Alkylphenol Endocrine Disruptors and Allergies" (5 words, 21 highly relevant docs)
2. **PLAIN-2660**: "How Long to Detox From Fish Before Pregnancy?" (8 words, 20 highly relevant docs)
3. **PLAIN-2510**: "Coffee and Artery Function" (4 words, 16 highly relevant docs)
4. **PLAIN-2430**: "Preventing Brain Loss with B Vitamins?" (6 words, 15 highly relevant docs)
5. **PLAIN-2690**: "Chronic Headaches and Pork Tapeworms" (5 words, 14 highly relevant docs)
**Ground Truth Generation** (offline, pre-test):
Ground truth answers will be generated offline using a script that:
1. Loads nfcorpus dataset
2. For each selected query, extracts top 3-5 highly relevant documents
3. Uses an LLM (ollama/anthropic) to synthesize a reference answer
4. Stores ground truth in `tests/rag_evaluation/fixtures/ground_truth.json`
```python
# tools/generate_rag_ground_truth.py
async def generate_ground_truth(query: str, relevant_docs: List[dict], llm: LLMProvider) -> str:
"""Generate synthetic ground truth answer from highly relevant documents."""
context = "\n\n".join([
f"Document {i+1}:\nTitle: {doc['title']}\n{doc['text']}"
for i, doc in enumerate(relevant_docs[:5])
])
prompt = f"""Based on the following documents, provide a comprehensive answer to this question:
Question: {query}
{context}
Provide a factual, well-structured answer that synthesizes information from the documents.
Focus on accuracy and completeness."""
return await llm.generate(prompt, max_tokens=500)
```
**Dataset Loading at Test Runtime** (in `nfcorpus_test_data` fixture):
1. Download nfcorpus dataset (cached in pytest temp directory)
2. Load corpus, queries, and qrels (relevance judgments)
3. Load pre-generated ground truth from `tests/rag_evaluation/fixtures/ground_truth.json`
4. Upload all corpus documents as Nextcloud notes
5. Trigger vector sync to index documents
6. Wait for indexing completion
7. Return test cases with query, ground truth, and expected doc IDs
**LLM Provider Abstraction**
```python
# tests/rag_evaluation/llm_providers.py
class LLMProvider(Protocol):
async def generate(self, prompt: str, max_tokens: int = 100) -> str: ...
class OllamaProvider:
def __init__(self, base_url: str, model: str):
self.base_url = base_url
self.model = model
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
# Use httpx to call Ollama API
...
class AnthropicProvider:
def __init__(self, api_key: str, model: str):
self.client = anthropic.AsyncAnthropic(api_key=api_key)
self.model = model
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
message = await self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
```
### Consequences
**Positive:**
* **Actionable debugging**: Separate retrieval/generation tests pinpoint failure location
* **Industry-standard metrics**: Context Recall and Answer Correctness are recognized RAG evaluation metrics
* **Simple initial implementation**: Binary LLM evaluation (true/false) is straightforward to implement and interpret
* **Extensible framework**: Easy to add more metrics (faithfulness, relevance) later
* **Standardized benchmark**: nfcorpus provides objective comparison against published RAG systems
* **Hybrid evaluation**: Combines efficiency (heuristics for retrieval) with quality (LLM-as-judge for generation)
* **Provider flexibility**: Supports both local (Ollama) and cloud (Anthropic) LLM evaluation
**Negative:**
* **Medical domain bias**: nfcorpus is medical/biomedical content, may not represent production use cases (personal notes, calendar events, etc.)
* **Manual test execution**: Tests require external LLM access and are not integrated into CI pipeline
* **Limited initial coverage**: Starting with only 5 queries provides limited statistical confidence
* **Evaluation cost**: LLM-as-judge for generation evaluation incurs API costs (Anthropic) or requires local inference (Ollama)
* **Single metric per component**: Initial scope tests only one metric per component, missing other important quality dimensions
* **Synthetic ground truth**: Ground truth answers are LLM-generated, not human-validated, which may introduce evaluation bias
* **Large corpus upload**: Uploading 3,633 documents at test runtime may be slow; caching strategy needed
**Future Work:**
* Expand to 50-100 queries for statistical significance
* Add custom test dataset with production-representative documents (meeting notes, task lists, etc.)
* Implement additional metrics (faithfulness, context relevance, answer relevance)
* Create automated benchmarking dashboard to track metric trends
* Test multi-hop reasoning (synthesis questions requiring multiple documents)
* Evaluate out-of-scope handling ("I don't know" responses)
+2
View File
@@ -102,7 +102,9 @@ module-root = ""
[dependency-groups] [dependency-groups]
dev = [ dev = [
"anthropic>=0.42.0", # For RAG evaluation with Anthropic LLMs
"commitizen>=4.8.2", "commitizen>=4.8.2",
"datasets>=3.3.0", # For BeIR nfcorpus dataset loading
"ipython>=9.2.0", "ipython>=9.2.0",
"playwright>=1.49.1", "playwright>=1.49.1",
"pytest>=8.3.5", "pytest>=8.3.5",
+277
View File
@@ -0,0 +1,277 @@
# RAG Evaluation Tests
This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the `nc_semantic_search_answer` tool.
## Architecture
The RAG system has two components that are tested independently:
1. **Retrieval** - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
2. **Generation** - MCP client LLM synthesis (retrieved context → natural language answer)
See [ADR-013](../../docs/ADR-013-rag-evaluation.md) for full architectural details.
## Test Structure
```
tests/rag_evaluation/
├── README.md # This file
├── conftest.py # Pytest fixtures
├── llm_providers.py # LLM provider abstraction (Ollama/Anthropic)
├── fixtures/
│ └── ground_truth.json # Pre-generated reference answers
├── test_retrieval_quality.py # Retrieval evaluation (Context Recall)
└── test_generation_quality.py # Generation evaluation (Answer Correctness)
```
## Metrics
### Retrieval Evaluation
- **Metric**: Context Recall
- **Method**: Heuristic - Check if ground-truth document IDs appear in top-k results
- **Target**: ≥80% recall
### Generation Evaluation
- **Metric**: Answer Correctness
- **Method**: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
- **Evaluation**: External LLM evaluates semantic equivalence
## Dataset
**BeIR/nfcorpus** - Medical/biomedical corpus with ~3,600 documents
**Test Queries** (5 selected):
1. PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
2. PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
3. PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
4. PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
5. PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)
## Setup
### 1. Install Dependencies
```bash
uv sync --group dev
```
This installs:
- `anthropic>=0.42.0` - For Anthropic LLM evaluation
- `click>=8.1.8` - For CLI interface
- `datasets>=3.3.0` - For BeIR nfcorpus dataset loading
### 2. Configure LLM Provider
Set environment variables for your LLM provider:
**Option A: Ollama (default, local/remote)**
```bash
export RAG_EVAL_PROVIDER=ollama
export OLLAMA_HOST=https://ollama.example.com # or RAG_EVAL_OLLAMA_BASE_URL
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b
```
**Option B: Anthropic (cloud)**
```bash
export RAG_EVAL_PROVIDER=anthropic
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
```
### 3. One-Time Setup: Generate Ground Truth
Generate synthetic reference answers for the 5 test queries:
```bash
uv run python tools/rag_eval_cli.py generate
```
**What this does:**
- Downloads nfcorpus dataset to `tests/rag_evaluation/fixtures/nfcorpus/` (cached locally)
- For each of the 5 selected queries, extracts highly relevant documents
- Uses configured LLM to synthesize a reference answer
- Saves to `tests/rag_evaluation/fixtures/ground_truth.json`
**Optional flags:**
- `--provider ollama|anthropic` - Override LLM provider
- `--model MODEL_NAME` - Override model name
- `--force-download` - Re-download nfcorpus dataset
### 4. One-Time Setup: Upload Corpus to Nextcloud
Upload all 3,633 nfcorpus documents as Nextcloud notes:
```bash
uv run python tools/rag_eval_cli.py upload \
--nextcloud-url http://localhost:8000 \
--username admin \
--password admin
```
**What this does:**
- Downloads nfcorpus dataset (if not already cached)
- Uploads all documents as notes in Nextcloud
- Saves document ID → note ID mapping to `tests/rag_evaluation/fixtures/note_mapping.json`
**Optional flags:**
- `--category CATEGORY` - Custom category for notes (default: `nfcorpus_rag_eval`)
- `--force-download` - Re-download nfcorpus dataset
**Important:** This step requires:
- A running Nextcloud instance with vector sync enabled
- Notes app installed
- Valid credentials
**Duration:** ~10-15 minutes to upload 3,633 documents
## Running Tests
### Run All RAG Evaluation Tests
```bash
uv run pytest tests/rag_evaluation/ -v
```
### Run Specific Test Suites
**Retrieval Quality Only:**
```bash
uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v
```
**Generation Quality Only:**
```bash
uv run pytest tests/rag_evaluation/test_generation_quality.py -v
```
### Run Individual Tests
```bash
uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v
```
## Test Execution Flow
**Prerequisites** (one-time setup):
1. Generated ground truth (`tools/rag_eval_cli.py generate`)
2. Uploaded corpus to Nextcloud (`tools/rag_eval_cli.py upload`)
### Retrieval Quality Tests
1. **Setup** (`nfcorpus_test_data` fixture):
- Loads pre-generated ground truth from `fixtures/ground_truth.json`
- Loads note mapping from `fixtures/note_mapping.json`
- Returns test cases with expected note IDs
2. **Test** (`test_retrieval_context_recall`):
- For each query: Perform semantic search (top-10)
- Extract retrieved note IDs
- Calculate Context Recall = (expected ∩ retrieved) / expected
- Assert recall ≥ 80%
3. **Cleanup**:
- None required (notes persist in Nextcloud for reuse)
### Generation Quality Tests
1. **Setup**:
- Same as retrieval tests (reuses `nfcorpus_test_data` fixture)
- Creates evaluation LLM provider
2. **Test** (`test_answer_correctness`):
- For each query: Call `nc_semantic_search_answer` MCP tool
- Extract generated answer
- Use LLM-as-judge to compare vs ground truth
- Assert semantic equivalence (TRUE/FALSE)
3. **Cleanup**:
- LLM provider closed
## Expected Test Duration
**One-time setup:**
- **Generate ground truth**: ~5-10 minutes (5 queries with LLM generation)
- **Upload corpus**: ~10-15 minutes (3,633 documents)
- **Total setup**: ~15-25 minutes
**Test execution** (after setup):
- **Retrieval tests**: ~1-2 minutes (5 queries, no upload/cleanup)
- **Generation tests**: ~5-10 minutes (RAG generation + LLM evaluation)
- **Total per run**: ~6-12 minutes
**Note**: These are NOT smoke tests and are NOT run in CI.
## Limitations & Future Work
**Current Limitations:**
- Only 5 test queries (limited statistical confidence)
- Medical domain bias (may not represent production use cases)
- Synthetic ground truth (LLM-generated, not human-validated)
- Manual test execution (requires external LLM access)
**Future Enhancements:**
- Expand to 50-100 queries for statistical significance
- Add custom test dataset with production-representative documents
- Implement additional metrics (faithfulness, context relevance, answer relevance)
- Create automated benchmarking dashboard
- Test multi-hop reasoning (synthesis questions)
- Evaluate out-of-scope handling ("I don't know" responses)
## Troubleshooting
### Tests Fail with "Ground truth file not found"
Run the generate command first:
```bash
uv run python tools/rag_eval_cli.py generate
```
### Tests Fail with "Note mapping file not found"
Run the upload command first:
```bash
uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin
```
### Tests Fail with "MCP sampling client not yet implemented"
The `mcp_sampling_client` fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in `conftest.py`.
### Upload Command Fails
Common issues:
1. **Nextcloud not running**: Ensure Nextcloud is accessible at the URL
2. **Invalid credentials**: Verify username/password
3. **Notes app not installed**: Install Notes app in Nextcloud
4. **Network timeout**: Increase timeout in CLI (currently 60s)
### LLM Timeout
If ground truth generation times out:
1. Increase timeout in `llm_providers.py` (currently 10 min)
2. Use a faster model: `--model llama3.2:1b`
3. Check Ollama/Anthropic service availability
### Dataset Download Fails
The nfcorpus dataset is downloaded automatically. If download fails:
1. Check internet connection
2. Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
3. Extract to `tests/rag_evaluation/fixtures/nfcorpus/`
4. Or use HuggingFace datasets cache: `~/.cache/huggingface/datasets/BeIR___nfcorpus/`
### Vector Sync Not Indexing Documents
After uploading, vector sync must index the documents:
1. Check vector sync is enabled in Nextcloud
2. Trigger manual sync if needed
3. Wait for background job to process all documents
4. Verify in Qdrant that vectors exist for uploaded notes
## References
- [ADR-013: RAG Evaluation Testing Framework](../../docs/ADR-013-rag-evaluation.md)
- [ADR-008: MCP Sampling for Semantic Search](../../docs/ADR-008-mcp-sampling-for-semantic-search.md)
- [BeIR Benchmark](https://github.com/beir-cellar/beir)
- [NFCorpus Dataset](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/)
+1
View File
@@ -0,0 +1 @@
"""RAG evaluation tests for the Nextcloud MCP semantic search system."""
+145
View File
@@ -0,0 +1,145 @@
"""Pytest fixtures for RAG evaluation tests.
IMPORTANT: Before running these tests, you must:
1. Generate ground truth: uv run python tools/rag_eval_cli.py generate
2. Upload corpus: uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin
This ensures that the ground truth and note mappings are available.
"""
import json
from pathlib import Path
from typing import Any
import pytest
from tests.rag_evaluation.llm_providers import create_llm_provider
# Paths
FIXTURES_DIR = Path(__file__).parent / "fixtures"
GROUND_TRUTH_FILE = FIXTURES_DIR / "ground_truth.json"
NOTE_MAPPING_FILE = FIXTURES_DIR / "note_mapping.json"
@pytest.fixture(scope="session")
def ground_truth_data() -> list[dict[str, Any]]:
"""Load pre-generated ground truth data.
Returns:
List of test cases with query, ground truth answer, and expected doc IDs
Raises:
FileNotFoundError: If ground_truth.json doesn't exist
"""
if not GROUND_TRUTH_FILE.exists():
raise FileNotFoundError(
f"Ground truth file not found: {GROUND_TRUTH_FILE}\n"
"Run: uv run python tools/rag_eval_cli.py generate"
)
with open(GROUND_TRUTH_FILE) as f:
return json.load(f)
@pytest.fixture(scope="session")
def note_mapping() -> dict[str, int]:
"""Load document ID → note ID mapping.
Returns:
Dict mapping nfcorpus document ID to Nextcloud note ID
Raises:
FileNotFoundError: If note_mapping.json doesn't exist
"""
if not NOTE_MAPPING_FILE.exists():
raise FileNotFoundError(
f"Note mapping file not found: {NOTE_MAPPING_FILE}\n"
"Run: uv run python tools/rag_eval_cli.py upload --nextcloud-url ... --username ... --password ..."
)
with open(NOTE_MAPPING_FILE) as f:
return json.load(f)
@pytest.fixture(scope="session")
def nfcorpus_test_data(
ground_truth_data: list[dict[str, Any]],
note_mapping: dict[str, int],
):
"""Prepare nfcorpus test data for evaluation.
This fixture combines ground truth answers with note mappings to create
test cases ready for retrieval and generation quality tests.
Args:
ground_truth_data: Pre-generated ground truth answers
note_mapping: Document ID → note ID mapping
Returns:
List of test cases with query, ground truth, expected doc IDs, and note IDs
"""
test_cases = []
for gt in ground_truth_data:
# Map expected document IDs to note IDs
expected_note_ids = [
note_mapping.get(doc_id)
for doc_id in gt["expected_document_ids"]
if doc_id in note_mapping
]
# Filter out None values (docs that weren't uploaded)
expected_note_ids = [nid for nid in expected_note_ids if nid is not None]
test_cases.append(
{
"query_id": gt["query_id"],
"query_text": gt["query_text"],
"ground_truth_answer": gt["ground_truth_answer"],
"expected_document_ids": gt["expected_document_ids"],
"expected_note_ids": expected_note_ids,
"highly_relevant_count": gt["highly_relevant_count"],
}
)
return test_cases
@pytest.fixture(scope="session")
async def evaluation_llm():
"""Create LLM provider for evaluation (separate from MCP client).
Environment variables:
RAG_EVAL_PROVIDER: Provider type (ollama or anthropic)
RAG_EVAL_OLLAMA_BASE_URL: Ollama base URL (or OLLAMA_HOST)
RAG_EVAL_OLLAMA_MODEL: Ollama model name
RAG_EVAL_ANTHROPIC_API_KEY: Anthropic API key
RAG_EVAL_ANTHROPIC_MODEL: Anthropic model name
Returns:
LLM provider instance (OllamaProvider or AnthropicProvider)
"""
llm = create_llm_provider()
yield llm
await llm.close()
@pytest.fixture(scope="session")
async def mcp_sampling_client():
"""Create MCP client that supports sampling for RAG generation.
This fixture creates an MCP client configured to support sampling,
which is required for testing the nc_semantic_search_answer tool.
TODO: Implement MCP client with sampling support
For now, this is a placeholder.
Returns:
MCP client instance with sampling enabled
"""
# TODO: Implement MCP client creation with sampling support
# This will require:
# 1. Creating an MCP client configured for sampling
# 2. Authenticating with Nextcloud
# 3. Ensuring sampling is enabled
pytest.skip("MCP sampling client not yet implemented")
+145
View File
@@ -0,0 +1,145 @@
"""LLM provider abstraction for RAG evaluation.
Supports Ollama (local) and Anthropic (cloud) providers for both ground truth
generation and evaluation.
"""
import os
from typing import Protocol
import httpx
from anthropic import AsyncAnthropic
class LLMProvider(Protocol):
"""Protocol for LLM providers."""
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate text from a prompt.
Args:
prompt: The prompt to generate from
max_tokens: Maximum tokens to generate
Returns:
Generated text
"""
...
class OllamaProvider:
"""Ollama provider for local LLM inference."""
def __init__(self, base_url: str, model: str):
"""Initialize Ollama provider.
Args:
base_url: Ollama API base URL (e.g., http://localhost:11434)
model: Model name (e.g., llama3.1:8b)
"""
self.base_url = base_url.rstrip("/")
self.model = model
self.client = httpx.AsyncClient(timeout=600.0) # 10 min timeout for generation
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate text using Ollama API."""
response = await self.client.post(
f"{self.base_url}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"num_predict": max_tokens,
"temperature": 0.7,
},
},
)
response.raise_for_status()
data = response.json()
return data["response"]
async def close(self):
"""Close the HTTP client."""
await self.client.aclose()
class AnthropicProvider:
"""Anthropic provider for cloud LLM inference."""
def __init__(self, api_key: str, model: str):
"""Initialize Anthropic provider.
Args:
api_key: Anthropic API key
model: Model name (e.g., claude-3-5-sonnet-20241022)
"""
self.client = AsyncAnthropic(api_key=api_key)
self.model = model
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate text using Anthropic API."""
message = await self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
temperature=0.7,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
async def close(self):
"""Close the client (no-op for Anthropic)."""
pass
def create_llm_provider(
provider: str | None = None,
ollama_base_url: str | None = None,
ollama_model: str | None = None,
anthropic_api_key: str | None = None,
anthropic_model: str | None = None,
) -> LLMProvider:
"""Create an LLM provider from environment variables or arguments.
Args:
provider: Provider type ('ollama' or 'anthropic'). Defaults to RAG_EVAL_PROVIDER env var or 'ollama'
ollama_base_url: Ollama base URL. Defaults to RAG_EVAL_OLLAMA_BASE_URL or 'http://localhost:11434'
ollama_model: Ollama model. Defaults to RAG_EVAL_OLLAMA_MODEL or 'llama3.1:8b'
anthropic_api_key: Anthropic API key. Defaults to RAG_EVAL_ANTHROPIC_API_KEY env var
anthropic_model: Anthropic model. Defaults to RAG_EVAL_ANTHROPIC_MODEL or 'claude-3-5-sonnet-20241022'
Returns:
LLMProvider instance
Raises:
ValueError: If provider is invalid or required credentials are missing
"""
# Get provider from args or env
provider = provider or os.environ.get("RAG_EVAL_PROVIDER", "ollama")
if provider == "ollama":
# Try RAG_EVAL_OLLAMA_BASE_URL, then OLLAMA_HOST, then default
base_url = (
ollama_base_url
or os.environ.get("RAG_EVAL_OLLAMA_BASE_URL")
or os.environ.get("OLLAMA_HOST")
or "http://localhost:11434"
)
model = ollama_model or os.environ.get("RAG_EVAL_OLLAMA_MODEL", "llama3.2:1b")
return OllamaProvider(base_url=base_url, model=model)
elif provider == "anthropic":
api_key = anthropic_api_key or os.environ.get("RAG_EVAL_ANTHROPIC_API_KEY")
if not api_key:
raise ValueError(
"Anthropic API key required. Set RAG_EVAL_ANTHROPIC_API_KEY environment variable."
)
model = anthropic_model or os.environ.get(
"RAG_EVAL_ANTHROPIC_MODEL", "claude-3-5-sonnet-20241022"
)
return AnthropicProvider(api_key=api_key, model=model)
else:
raise ValueError(
f"Invalid provider: {provider}. Must be 'ollama' or 'anthropic'."
)
@@ -0,0 +1,139 @@
"""Tests for RAG generation quality (Answer Correctness metric).
These tests evaluate whether the MCP client LLM generates factually correct
answers from retrieved context using the nc_semantic_search_answer tool.
Metric: Answer Correctness
- Measures: Is the generated answer factually correct?
- Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
- Evaluation: External LLM evaluates semantic equivalence
"""
import pytest
@pytest.mark.integration
async def test_answer_correctness(
mcp_sampling_client,
evaluation_llm,
nfcorpus_test_data,
):
"""Test that RAG system generates factually correct answers.
For each test query:
1. Execute full RAG pipeline via nc_semantic_search_answer MCP tool
2. Extract generated answer from RAG response
3. Use LLM-as-judge to compare against ground truth (binary true/false)
4. Assert answer is semantically equivalent to ground truth
This tests the quality of the generation component (MCP client LLM).
"""
results_summary = []
for test_case in nfcorpus_test_data:
query = test_case["query_text"]
ground_truth = test_case["ground_truth_answer"]
print(f"\n{'=' * 80}")
print(f"Query: {query}")
# Execute full RAG pipeline
print("Executing RAG pipeline...")
rag_result = await mcp_sampling_client.call_tool(
"nc_semantic_search_answer",
arguments={"query": query, "limit": 5},
)
rag_answer = rag_result["generated_answer"]
print(f"RAG Answer preview: {rag_answer[:200]}...")
print(f"Ground Truth preview: {ground_truth[:200]}...")
# LLM-as-judge evaluation
evaluation_prompt = f"""Compare these two answers and respond with only TRUE or FALSE.
Question: {query}
Generated Answer: {rag_answer}
Ground Truth Answer: {ground_truth}
Are these answers semantically equivalent (do they convey the same factual information)?
Respond with only: TRUE or FALSE"""
print("Evaluating answer correctness...")
evaluation_result = await evaluation_llm.generate(
evaluation_prompt,
max_tokens=10,
)
is_correct = evaluation_result.strip().upper() == "TRUE"
result = {
"query_id": test_case["query_id"],
"query": query,
"rag_answer_length": len(rag_answer),
"ground_truth_length": len(ground_truth),
"is_correct": is_correct,
"evaluation_result": evaluation_result.strip(),
}
results_summary.append(result)
print(f" Evaluation: {evaluation_result.strip()}")
print(f" Status: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}")
# Assert answer correctness
assert is_correct, (
f"Answer mismatch for query: {query}\n\n"
f"Generated Answer:\n{rag_answer}\n\n"
f"Ground Truth:\n{ground_truth}\n\n"
f"Evaluation: {evaluation_result.strip()}"
)
# Print summary
print(f"\n{'=' * 80}")
print("Answer Correctness Summary:")
print(f" Total queries: {len(results_summary)}")
print(f" Correct: {sum(r['is_correct'] for r in results_summary)}")
print(f" Incorrect: {sum(not r['is_correct'] for r in results_summary)}")
accuracy = sum(r["is_correct"] for r in results_summary) / len(results_summary)
print(f" Accuracy: {accuracy:.2%}")
print(f"{'=' * 80}")
@pytest.mark.integration
async def test_answer_contains_sources(mcp_sampling_client, nfcorpus_test_data):
"""Test that RAG answers include source citations.
This is a basic quality check - we verify that the nc_semantic_search_answer
tool returns both a generated answer and source documents.
"""
for test_case in nfcorpus_test_data:
query = test_case["query_text"]
# Execute RAG pipeline
rag_result = await mcp_sampling_client.call_tool(
"nc_semantic_search_answer",
arguments={"query": query, "limit": 5},
)
# Check response structure
assert "generated_answer" in rag_result, "Response missing 'generated_answer'"
assert "sources" in rag_result, "Response missing 'sources'"
# Check sources are provided
sources = rag_result["sources"]
assert len(sources) > 0, f"No sources returned for query: {query}"
# Check each source has required fields
for i, source in enumerate(sources):
assert "document_id" in source or "id" in source, (
f"Source {i} missing document ID"
)
assert "excerpt" in source or "content" in source or "text" in source, (
f"Source {i} missing content"
)
print(f"Query: {query}")
print(f" Sources provided: {len(sources)}")
print(" Status: ✓ PASS")
@@ -0,0 +1,143 @@
"""Tests for RAG retrieval quality (Context Recall metric).
These tests evaluate whether the vector sync/embedding pipeline successfully
retrieves documents containing the answer to a query.
Metric: Context Recall
- Measures: Did we retrieve documents containing the answer?
- Method: Heuristic - Check if ground-truth document IDs appear in top-k results
- Target: ≥80% recall (at least 80% of expected docs in top-10 results)
"""
import pytest
@pytest.mark.integration
async def test_retrieval_context_recall(nc_client, nfcorpus_test_data):
"""Test that semantic search retrieves documents containing the answer.
For each test query:
1. Perform semantic search (retrieval only, no generation)
2. Extract retrieved document IDs from top-k results
3. Calculate Context Recall: intersection of retrieved and expected docs
4. Assert recall meets threshold (≥80%)
This tests the quality of the vector sync/embedding pipeline.
"""
# Top-k documents to retrieve
k = 10
# Minimum acceptable recall
min_recall = 0.8
results_summary = []
for test_case in nfcorpus_test_data:
query = test_case["query_text"]
expected_note_ids = set(test_case["expected_note_ids"])
# Perform semantic search (retrieval only)
search_results = await nc_client.notes.semantic_search(
query=query,
limit=k,
)
# Extract retrieved note IDs
retrieved_note_ids = {result["id"] for result in search_results}
# Calculate Context Recall
intersection = expected_note_ids & retrieved_note_ids
recall = len(intersection) / len(expected_note_ids) if expected_note_ids else 0
# Store results
result = {
"query_id": test_case["query_id"],
"query": query,
"expected_count": len(expected_note_ids),
"retrieved_count": len(retrieved_note_ids),
"intersection_count": len(intersection),
"recall": recall,
"passed": recall >= min_recall,
}
results_summary.append(result)
# Print detailed result for this query
print(f"\n{'=' * 80}")
print(f"Query: {query}")
print(f" Expected docs: {len(expected_note_ids)}")
print(f" Retrieved (top-{k}): {len(retrieved_note_ids)}")
print(f" Intersection: {len(intersection)}")
print(f" Context Recall: {recall:.2%}")
print(f" Status: {'✓ PASS' if result['passed'] else '✗ FAIL'}")
# Assert recall meets threshold
assert recall >= min_recall, (
f"Context Recall {recall:.2%} below threshold {min_recall:.2%} "
f"for query: {query}\n"
f"Expected {len(expected_note_ids)} docs, found {len(intersection)} in top-{k}"
)
# Print summary
print(f"\n{'=' * 80}")
print("Context Recall Summary:")
print(f" Total queries: {len(results_summary)}")
print(f" Passed: {sum(r['passed'] for r in results_summary)}")
print(f" Failed: {sum(not r['passed'] for r in results_summary)}")
print(
f" Average recall: {sum(r['recall'] for r in results_summary) / len(results_summary):.2%}"
)
print(f"{'=' * 80}")
@pytest.mark.integration
async def test_retrieval_top1_precision(nc_client, nfcorpus_test_data):
"""Test that the top-1 retrieved document is highly relevant.
This is a stricter test than context recall - we verify that
the single most relevant document (rank 1) is in the expected set.
This tests whether the ranking is good, not just retrieval.
"""
results_summary = []
for test_case in nfcorpus_test_data:
query = test_case["query_text"]
expected_note_ids = set(test_case["expected_note_ids"])
# Perform semantic search
search_results = await nc_client.notes.semantic_search(
query=query,
limit=1, # Only top-1
)
# Check if top result is in expected set
if search_results:
top_result_id = search_results[0]["id"]
is_relevant = top_result_id in expected_note_ids
else:
is_relevant = False
result = {
"query_id": test_case["query_id"],
"query": query,
"top_result_id": search_results[0]["id"] if search_results else None,
"is_relevant": is_relevant,
}
results_summary.append(result)
print(f"\nQuery: {query}")
print(f" Top-1 relevant: {'✓ YES' if is_relevant else '✗ NO'}")
# This is informational - we don't assert here
# Some queries may have multiple valid top results
# Print summary
precision_at_1 = sum(r["is_relevant"] for r in results_summary) / len(
results_summary
)
print(f"\n{'=' * 80}")
print(f"Precision@1: {precision_at_1:.2%}")
print(
f" ({sum(r['is_relevant'] for r in results_summary)}/{len(results_summary)} queries)"
)
print(f"{'=' * 80}")
Generated
+1049
View File
File diff suppressed because it is too large Load Diff