feat: implement RAG evaluation framework with CLI tooling
- Add ADR-013 documenting RAG evaluation architecture - Implement two-part evaluation: Context Recall (retrieval) + Answer Correctness (generation) - Create Click CLI for ground truth generation and corpus upload - Add pytest fixtures and tests for retrieval/generation quality - Use BeIR/nfcorpus dataset with 5 selected test queries - Support Ollama and Anthropic LLM providers - Generate synthetic ground truth answers offline - Add comprehensive documentation in tests/rag_evaluation/README.md The framework separates one-time setup (generate/upload) from test execution, making tests much faster (~6-12 min vs ~15-25 min per run). Tests are manual only (not in CI) and require external LLM access. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -13,3 +13,6 @@ docker-compose.override.yml
|
||||
# Generated by pytest used to login users
|
||||
.nextcloud_oauth_*.json
|
||||
.playwright-mcp/
|
||||
|
||||
# RAG Evaluation
|
||||
tests/rag_evaluation/fixtures/
|
||||
|
||||
@@ -0,0 +1,254 @@
|
||||
## ADR-013: RAG Evaluation Testing Framework
|
||||
|
||||
**Status:** Proposed
|
||||
|
||||
**Date:** 2025-11-15
|
||||
|
||||
### Context
|
||||
|
||||
The `nc_semantic_search_answer` tool implements a Retrieval-Augmented Generation (RAG) system where:
|
||||
1. **Retrieval**: Vector sync pipeline indexes Nextcloud documents (notes, calendar, contacts, etc.) into a vector database
|
||||
2. **Generation**: MCP client's LLM synthesizes answers from retrieved documents via MCP sampling (ADR-008)
|
||||
|
||||
We need a testing framework to evaluate RAG system performance and identify whether failures occur in retrieval (wrong documents found) or generation (poor answer quality). This framework must use industry-standard evaluation methodologies while remaining practical to implement and maintain.
|
||||
|
||||
To establish a baseline, we will use the **BeIR/nfcorpus** dataset (medical/biomedical corpus) with ~5,000 documents and established query/answer pairs.
|
||||
|
||||
Homepage: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/
|
||||
Download: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
|
||||
|
||||
### Decision
|
||||
|
||||
We will implement a **two-part evaluation framework** that independently tests retrieval and generation quality using pytest fixtures.
|
||||
|
||||
#### In Scope
|
||||
|
||||
**1. Retrieval Evaluation**
|
||||
Tests the vector sync/embedding pipeline's ability to find relevant documents.
|
||||
|
||||
- **Metric: Context Recall** (Did we retrieve documents containing the answer?)
|
||||
- **Evaluation method**: Heuristic - Check if ground-truth document IDs appear in top-k retrieval results
|
||||
- **Test**: Query → Semantic search → Assert expected doc IDs present
|
||||
|
||||
**2. Generation Evaluation**
|
||||
Tests the MCP client LLM's ability to synthesize correct answers from retrieved context.
|
||||
|
||||
- **Metric: Answer Correctness** (Is the generated answer factually correct?)
|
||||
- **Evaluation method**: LLM-as-judge - Compare RAG answer against ground-truth answer
|
||||
- **Test**: Query → `nc_semantic_search_answer` → LLM evaluates answer vs. ground truth (binary true/false)
|
||||
|
||||
#### Out of Scope (Initial Implementation)
|
||||
|
||||
- **Context Relevance/Precision**: Measuring irrelevant documents in retrieval results
|
||||
- **Faithfulness/Groundedness**: Detecting hallucinations not supported by retrieved context
|
||||
- **Answer Relevance**: Whether answer addresses the specific question asked
|
||||
- **Out-of-Scope Handling**: Testing "I don't know" responses when answer isn't in context
|
||||
- **Continuous benchmarking**: Automated tracking of metric trends over time
|
||||
- **Custom domain datasets**: Production-specific test data (medical corpus used initially)
|
||||
|
||||
These remain valuable for future iterations but add complexity beyond our initial goals.
|
||||
|
||||
#### Implementation
|
||||
|
||||
**Test Structure**
|
||||
|
||||
Location: `tests/rag_evaluation/`
|
||||
- `test_retrieval_quality.py` - Retrieval evaluation tests
|
||||
- `test_generation_quality.py` - Generation evaluation tests
|
||||
- `conftest.py` - Fixtures for test data, MCP clients, and evaluation LLMs
|
||||
|
||||
**Required Pytest Fixtures**
|
||||
|
||||
1. **`nfcorpus_test_data`** (session-scoped)
|
||||
- Downloads/caches BeIR nfcorpus dataset at runtime
|
||||
- Loads 5 pre-selected test queries with:
|
||||
- Query text
|
||||
- Pre-generated ground-truth answer (from `tests/rag_evaluation/fixtures/ground_truth.json`)
|
||||
- Expected document IDs (from qrels with score=2)
|
||||
- Uploads all corpus documents as notes in test Nextcloud instance
|
||||
- Triggers vector sync to index documents
|
||||
- Waits for indexing completion
|
||||
- Returns test case data structure
|
||||
|
||||
2. **`mcp_sampling_client`** (session-scoped)
|
||||
- Creates MCP client that supports sampling
|
||||
- Configurable LLM provider (ollama or anthropic) via environment:
|
||||
- `RAG_EVAL_PROVIDER=ollama` (default) or `anthropic`
|
||||
- `RAG_EVAL_OLLAMA_BASE_URL=http://localhost:11434`
|
||||
- `RAG_EVAL_OLLAMA_MODEL=llama3.1:8b`
|
||||
- `RAG_EVAL_ANTHROPIC_API_KEY=sk-...`
|
||||
- `RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022`
|
||||
- Returns configured MCP client fixture
|
||||
|
||||
3. **`evaluation_llm`** (session-scoped)
|
||||
- Separate LLM instance for evaluation (independent from MCP client)
|
||||
- Same provider configuration as `mcp_sampling_client`
|
||||
- Returns callable: `async def evaluate(prompt: str) -> str`
|
||||
|
||||
**Test Implementation Examples**
|
||||
|
||||
```python
|
||||
# tests/rag_evaluation/test_retrieval_quality.py
|
||||
async def test_retrieval_recall(nc_client, nfcorpus_test_data):
|
||||
"""Test that semantic search retrieves documents containing the answer."""
|
||||
for test_case in nfcorpus_test_data:
|
||||
# Perform semantic search (retrieval only, no generation)
|
||||
results = await nc_client.notes.semantic_search(
|
||||
query=test_case.query,
|
||||
limit=10
|
||||
)
|
||||
|
||||
retrieved_doc_ids = {r.document_id for r in results}
|
||||
expected_doc_ids = set(test_case.expected_document_ids)
|
||||
|
||||
# Context Recall: Are expected documents in top-k results?
|
||||
recall = len(expected_doc_ids & retrieved_doc_ids) / len(expected_doc_ids)
|
||||
assert recall >= 0.8, f"Recall {recall} below threshold for query: {test_case.query}"
|
||||
|
||||
|
||||
# tests/rag_evaluation/test_generation_quality.py
|
||||
async def test_answer_correctness(mcp_sampling_client, evaluation_llm, nfcorpus_test_data):
|
||||
"""Test that RAG system generates factually correct answers."""
|
||||
for test_case in nfcorpus_test_data:
|
||||
# Execute full RAG pipeline (retrieval + generation)
|
||||
result = await mcp_sampling_client.call_tool(
|
||||
"nc_semantic_search_answer",
|
||||
arguments={"query": test_case.query, "limit": 5}
|
||||
)
|
||||
|
||||
rag_answer = result["generated_answer"]
|
||||
|
||||
# LLM-as-judge evaluation
|
||||
evaluation_prompt = f"""Compare these two answers and respond with only TRUE or FALSE.
|
||||
|
||||
Question: {test_case.query}
|
||||
|
||||
Generated Answer: {rag_answer}
|
||||
|
||||
Ground Truth Answer: {test_case.ground_truth}
|
||||
|
||||
Are these answers semantically equivalent (do they convey the same factual information)?
|
||||
Respond with only: TRUE or FALSE"""
|
||||
|
||||
evaluation_result = await evaluation_llm(evaluation_prompt)
|
||||
|
||||
assert evaluation_result.strip().upper() == "TRUE", \
|
||||
f"Answer mismatch for query: {test_case.query}\nGot: {rag_answer}\nExpected: {test_case.ground_truth}"
|
||||
```
|
||||
|
||||
**Dataset Integration**
|
||||
|
||||
The BeIR nfcorpus dataset structure:
|
||||
- **corpus.jsonl**: 3,633 medical/biomedical documents (articles from PubMed)
|
||||
- **queries.jsonl**: 3,237 queries (questions)
|
||||
- **qrels/*.tsv**: Relevance judgments mapping query IDs to document IDs with scores (2=highly relevant, 1=somewhat relevant)
|
||||
|
||||
**Important**: The dataset provides relevance judgments (which documents answer which queries) but does NOT include ground truth answers. We must generate synthetic ground truth offline.
|
||||
|
||||
**Selected Test Queries** (5 diverse candidates):
|
||||
|
||||
1. **PLAIN-2630**: "Alkylphenol Endocrine Disruptors and Allergies" (5 words, 21 highly relevant docs)
|
||||
2. **PLAIN-2660**: "How Long to Detox From Fish Before Pregnancy?" (8 words, 20 highly relevant docs)
|
||||
3. **PLAIN-2510**: "Coffee and Artery Function" (4 words, 16 highly relevant docs)
|
||||
4. **PLAIN-2430**: "Preventing Brain Loss with B Vitamins?" (6 words, 15 highly relevant docs)
|
||||
5. **PLAIN-2690**: "Chronic Headaches and Pork Tapeworms" (5 words, 14 highly relevant docs)
|
||||
|
||||
**Ground Truth Generation** (offline, pre-test):
|
||||
|
||||
Ground truth answers will be generated offline using a script that:
|
||||
1. Loads nfcorpus dataset
|
||||
2. For each selected query, extracts top 3-5 highly relevant documents
|
||||
3. Uses an LLM (ollama/anthropic) to synthesize a reference answer
|
||||
4. Stores ground truth in `tests/rag_evaluation/fixtures/ground_truth.json`
|
||||
|
||||
```python
|
||||
# tools/generate_rag_ground_truth.py
|
||||
async def generate_ground_truth(query: str, relevant_docs: List[dict], llm: LLMProvider) -> str:
|
||||
"""Generate synthetic ground truth answer from highly relevant documents."""
|
||||
context = "\n\n".join([
|
||||
f"Document {i+1}:\nTitle: {doc['title']}\n{doc['text']}"
|
||||
for i, doc in enumerate(relevant_docs[:5])
|
||||
])
|
||||
|
||||
prompt = f"""Based on the following documents, provide a comprehensive answer to this question:
|
||||
|
||||
Question: {query}
|
||||
|
||||
{context}
|
||||
|
||||
Provide a factual, well-structured answer that synthesizes information from the documents.
|
||||
Focus on accuracy and completeness."""
|
||||
|
||||
return await llm.generate(prompt, max_tokens=500)
|
||||
```
|
||||
|
||||
**Dataset Loading at Test Runtime** (in `nfcorpus_test_data` fixture):
|
||||
|
||||
1. Download nfcorpus dataset (cached in pytest temp directory)
|
||||
2. Load corpus, queries, and qrels (relevance judgments)
|
||||
3. Load pre-generated ground truth from `tests/rag_evaluation/fixtures/ground_truth.json`
|
||||
4. Upload all corpus documents as Nextcloud notes
|
||||
5. Trigger vector sync to index documents
|
||||
6. Wait for indexing completion
|
||||
7. Return test cases with query, ground truth, and expected doc IDs
|
||||
|
||||
**LLM Provider Abstraction**
|
||||
|
||||
```python
|
||||
# tests/rag_evaluation/llm_providers.py
|
||||
class LLMProvider(Protocol):
|
||||
async def generate(self, prompt: str, max_tokens: int = 100) -> str: ...
|
||||
|
||||
class OllamaProvider:
|
||||
def __init__(self, base_url: str, model: str):
|
||||
self.base_url = base_url
|
||||
self.model = model
|
||||
|
||||
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
|
||||
# Use httpx to call Ollama API
|
||||
...
|
||||
|
||||
class AnthropicProvider:
|
||||
def __init__(self, api_key: str, model: str):
|
||||
self.client = anthropic.AsyncAnthropic(api_key=api_key)
|
||||
self.model = model
|
||||
|
||||
async def generate(self, prompt: str, max_tokens: int = 100) -> str:
|
||||
message = await self.client.messages.create(
|
||||
model=self.model,
|
||||
max_tokens=max_tokens,
|
||||
messages=[{"role": "user", "content": prompt}]
|
||||
)
|
||||
return message.content[0].text
|
||||
```
|
||||
|
||||
### Consequences
|
||||
|
||||
**Positive:**
|
||||
|
||||
* **Actionable debugging**: Separate retrieval/generation tests pinpoint failure location
|
||||
* **Industry-standard metrics**: Context Recall and Answer Correctness are recognized RAG evaluation metrics
|
||||
* **Simple initial implementation**: Binary LLM evaluation (true/false) is straightforward to implement and interpret
|
||||
* **Extensible framework**: Easy to add more metrics (faithfulness, relevance) later
|
||||
* **Standardized benchmark**: nfcorpus provides objective comparison against published RAG systems
|
||||
* **Hybrid evaluation**: Combines efficiency (heuristics for retrieval) with quality (LLM-as-judge for generation)
|
||||
* **Provider flexibility**: Supports both local (Ollama) and cloud (Anthropic) LLM evaluation
|
||||
|
||||
**Negative:**
|
||||
|
||||
* **Medical domain bias**: nfcorpus is medical/biomedical content, may not represent production use cases (personal notes, calendar events, etc.)
|
||||
* **Manual test execution**: Tests require external LLM access and are not integrated into CI pipeline
|
||||
* **Limited initial coverage**: Starting with only 5 queries provides limited statistical confidence
|
||||
* **Evaluation cost**: LLM-as-judge for generation evaluation incurs API costs (Anthropic) or requires local inference (Ollama)
|
||||
* **Single metric per component**: Initial scope tests only one metric per component, missing other important quality dimensions
|
||||
* **Synthetic ground truth**: Ground truth answers are LLM-generated, not human-validated, which may introduce evaluation bias
|
||||
* **Large corpus upload**: Uploading 3,633 documents at test runtime may be slow; caching strategy needed
|
||||
|
||||
**Future Work:**
|
||||
|
||||
* Expand to 50-100 queries for statistical significance
|
||||
* Add custom test dataset with production-representative documents (meeting notes, task lists, etc.)
|
||||
* Implement additional metrics (faithfulness, context relevance, answer relevance)
|
||||
* Create automated benchmarking dashboard to track metric trends
|
||||
* Test multi-hop reasoning (synthesis questions requiring multiple documents)
|
||||
* Evaluate out-of-scope handling ("I don't know" responses)
|
||||
@@ -102,7 +102,9 @@ module-root = ""
|
||||
|
||||
[dependency-groups]
|
||||
dev = [
|
||||
"anthropic>=0.42.0", # For RAG evaluation with Anthropic LLMs
|
||||
"commitizen>=4.8.2",
|
||||
"datasets>=3.3.0", # For BeIR nfcorpus dataset loading
|
||||
"ipython>=9.2.0",
|
||||
"playwright>=1.49.1",
|
||||
"pytest>=8.3.5",
|
||||
|
||||
@@ -0,0 +1,277 @@
|
||||
# RAG Evaluation Tests
|
||||
|
||||
This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the `nc_semantic_search_answer` tool.
|
||||
|
||||
## Architecture
|
||||
|
||||
The RAG system has two components that are tested independently:
|
||||
|
||||
1. **Retrieval** - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
|
||||
2. **Generation** - MCP client LLM synthesis (retrieved context → natural language answer)
|
||||
|
||||
See [ADR-013](../../docs/ADR-013-rag-evaluation.md) for full architectural details.
|
||||
|
||||
## Test Structure
|
||||
|
||||
```
|
||||
tests/rag_evaluation/
|
||||
├── README.md # This file
|
||||
├── conftest.py # Pytest fixtures
|
||||
├── llm_providers.py # LLM provider abstraction (Ollama/Anthropic)
|
||||
├── fixtures/
|
||||
│ └── ground_truth.json # Pre-generated reference answers
|
||||
├── test_retrieval_quality.py # Retrieval evaluation (Context Recall)
|
||||
└── test_generation_quality.py # Generation evaluation (Answer Correctness)
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
### Retrieval Evaluation
|
||||
- **Metric**: Context Recall
|
||||
- **Method**: Heuristic - Check if ground-truth document IDs appear in top-k results
|
||||
- **Target**: ≥80% recall
|
||||
|
||||
### Generation Evaluation
|
||||
- **Metric**: Answer Correctness
|
||||
- **Method**: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
|
||||
- **Evaluation**: External LLM evaluates semantic equivalence
|
||||
|
||||
## Dataset
|
||||
|
||||
**BeIR/nfcorpus** - Medical/biomedical corpus with ~3,600 documents
|
||||
|
||||
**Test Queries** (5 selected):
|
||||
1. PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
|
||||
2. PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
|
||||
3. PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
|
||||
4. PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
|
||||
5. PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)
|
||||
|
||||
## Setup
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
uv sync --group dev
|
||||
```
|
||||
|
||||
This installs:
|
||||
- `anthropic>=0.42.0` - For Anthropic LLM evaluation
|
||||
- `click>=8.1.8` - For CLI interface
|
||||
- `datasets>=3.3.0` - For BeIR nfcorpus dataset loading
|
||||
|
||||
### 2. Configure LLM Provider
|
||||
|
||||
Set environment variables for your LLM provider:
|
||||
|
||||
**Option A: Ollama (default, local/remote)**
|
||||
```bash
|
||||
export RAG_EVAL_PROVIDER=ollama
|
||||
export OLLAMA_HOST=https://ollama.example.com # or RAG_EVAL_OLLAMA_BASE_URL
|
||||
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b
|
||||
```
|
||||
|
||||
**Option B: Anthropic (cloud)**
|
||||
```bash
|
||||
export RAG_EVAL_PROVIDER=anthropic
|
||||
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
|
||||
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
|
||||
```
|
||||
|
||||
### 3. One-Time Setup: Generate Ground Truth
|
||||
|
||||
Generate synthetic reference answers for the 5 test queries:
|
||||
|
||||
```bash
|
||||
uv run python tools/rag_eval_cli.py generate
|
||||
```
|
||||
|
||||
**What this does:**
|
||||
- Downloads nfcorpus dataset to `tests/rag_evaluation/fixtures/nfcorpus/` (cached locally)
|
||||
- For each of the 5 selected queries, extracts highly relevant documents
|
||||
- Uses configured LLM to synthesize a reference answer
|
||||
- Saves to `tests/rag_evaluation/fixtures/ground_truth.json`
|
||||
|
||||
**Optional flags:**
|
||||
- `--provider ollama|anthropic` - Override LLM provider
|
||||
- `--model MODEL_NAME` - Override model name
|
||||
- `--force-download` - Re-download nfcorpus dataset
|
||||
|
||||
### 4. One-Time Setup: Upload Corpus to Nextcloud
|
||||
|
||||
Upload all 3,633 nfcorpus documents as Nextcloud notes:
|
||||
|
||||
```bash
|
||||
uv run python tools/rag_eval_cli.py upload \
|
||||
--nextcloud-url http://localhost:8000 \
|
||||
--username admin \
|
||||
--password admin
|
||||
```
|
||||
|
||||
**What this does:**
|
||||
- Downloads nfcorpus dataset (if not already cached)
|
||||
- Uploads all documents as notes in Nextcloud
|
||||
- Saves document ID → note ID mapping to `tests/rag_evaluation/fixtures/note_mapping.json`
|
||||
|
||||
**Optional flags:**
|
||||
- `--category CATEGORY` - Custom category for notes (default: `nfcorpus_rag_eval`)
|
||||
- `--force-download` - Re-download nfcorpus dataset
|
||||
|
||||
**Important:** This step requires:
|
||||
- A running Nextcloud instance with vector sync enabled
|
||||
- Notes app installed
|
||||
- Valid credentials
|
||||
|
||||
**Duration:** ~10-15 minutes to upload 3,633 documents
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Run All RAG Evaluation Tests
|
||||
|
||||
```bash
|
||||
uv run pytest tests/rag_evaluation/ -v
|
||||
```
|
||||
|
||||
### Run Specific Test Suites
|
||||
|
||||
**Retrieval Quality Only:**
|
||||
```bash
|
||||
uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v
|
||||
```
|
||||
|
||||
**Generation Quality Only:**
|
||||
```bash
|
||||
uv run pytest tests/rag_evaluation/test_generation_quality.py -v
|
||||
```
|
||||
|
||||
### Run Individual Tests
|
||||
|
||||
```bash
|
||||
uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
|
||||
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v
|
||||
```
|
||||
|
||||
## Test Execution Flow
|
||||
|
||||
**Prerequisites** (one-time setup):
|
||||
1. Generated ground truth (`tools/rag_eval_cli.py generate`)
|
||||
2. Uploaded corpus to Nextcloud (`tools/rag_eval_cli.py upload`)
|
||||
|
||||
### Retrieval Quality Tests
|
||||
|
||||
1. **Setup** (`nfcorpus_test_data` fixture):
|
||||
- Loads pre-generated ground truth from `fixtures/ground_truth.json`
|
||||
- Loads note mapping from `fixtures/note_mapping.json`
|
||||
- Returns test cases with expected note IDs
|
||||
|
||||
2. **Test** (`test_retrieval_context_recall`):
|
||||
- For each query: Perform semantic search (top-10)
|
||||
- Extract retrieved note IDs
|
||||
- Calculate Context Recall = (expected ∩ retrieved) / expected
|
||||
- Assert recall ≥ 80%
|
||||
|
||||
3. **Cleanup**:
|
||||
- None required (notes persist in Nextcloud for reuse)
|
||||
|
||||
### Generation Quality Tests
|
||||
|
||||
1. **Setup**:
|
||||
- Same as retrieval tests (reuses `nfcorpus_test_data` fixture)
|
||||
- Creates evaluation LLM provider
|
||||
|
||||
2. **Test** (`test_answer_correctness`):
|
||||
- For each query: Call `nc_semantic_search_answer` MCP tool
|
||||
- Extract generated answer
|
||||
- Use LLM-as-judge to compare vs ground truth
|
||||
- Assert semantic equivalence (TRUE/FALSE)
|
||||
|
||||
3. **Cleanup**:
|
||||
- LLM provider closed
|
||||
|
||||
## Expected Test Duration
|
||||
|
||||
**One-time setup:**
|
||||
- **Generate ground truth**: ~5-10 minutes (5 queries with LLM generation)
|
||||
- **Upload corpus**: ~10-15 minutes (3,633 documents)
|
||||
- **Total setup**: ~15-25 minutes
|
||||
|
||||
**Test execution** (after setup):
|
||||
- **Retrieval tests**: ~1-2 minutes (5 queries, no upload/cleanup)
|
||||
- **Generation tests**: ~5-10 minutes (RAG generation + LLM evaluation)
|
||||
- **Total per run**: ~6-12 minutes
|
||||
|
||||
**Note**: These are NOT smoke tests and are NOT run in CI.
|
||||
|
||||
## Limitations & Future Work
|
||||
|
||||
**Current Limitations:**
|
||||
- Only 5 test queries (limited statistical confidence)
|
||||
- Medical domain bias (may not represent production use cases)
|
||||
- Synthetic ground truth (LLM-generated, not human-validated)
|
||||
- Manual test execution (requires external LLM access)
|
||||
|
||||
**Future Enhancements:**
|
||||
- Expand to 50-100 queries for statistical significance
|
||||
- Add custom test dataset with production-representative documents
|
||||
- Implement additional metrics (faithfulness, context relevance, answer relevance)
|
||||
- Create automated benchmarking dashboard
|
||||
- Test multi-hop reasoning (synthesis questions)
|
||||
- Evaluate out-of-scope handling ("I don't know" responses)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Tests Fail with "Ground truth file not found"
|
||||
|
||||
Run the generate command first:
|
||||
```bash
|
||||
uv run python tools/rag_eval_cli.py generate
|
||||
```
|
||||
|
||||
### Tests Fail with "Note mapping file not found"
|
||||
|
||||
Run the upload command first:
|
||||
```bash
|
||||
uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin
|
||||
```
|
||||
|
||||
### Tests Fail with "MCP sampling client not yet implemented"
|
||||
|
||||
The `mcp_sampling_client` fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in `conftest.py`.
|
||||
|
||||
### Upload Command Fails
|
||||
|
||||
Common issues:
|
||||
1. **Nextcloud not running**: Ensure Nextcloud is accessible at the URL
|
||||
2. **Invalid credentials**: Verify username/password
|
||||
3. **Notes app not installed**: Install Notes app in Nextcloud
|
||||
4. **Network timeout**: Increase timeout in CLI (currently 60s)
|
||||
|
||||
### LLM Timeout
|
||||
|
||||
If ground truth generation times out:
|
||||
1. Increase timeout in `llm_providers.py` (currently 10 min)
|
||||
2. Use a faster model: `--model llama3.2:1b`
|
||||
3. Check Ollama/Anthropic service availability
|
||||
|
||||
### Dataset Download Fails
|
||||
|
||||
The nfcorpus dataset is downloaded automatically. If download fails:
|
||||
1. Check internet connection
|
||||
2. Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
|
||||
3. Extract to `tests/rag_evaluation/fixtures/nfcorpus/`
|
||||
4. Or use HuggingFace datasets cache: `~/.cache/huggingface/datasets/BeIR___nfcorpus/`
|
||||
|
||||
### Vector Sync Not Indexing Documents
|
||||
|
||||
After uploading, vector sync must index the documents:
|
||||
1. Check vector sync is enabled in Nextcloud
|
||||
2. Trigger manual sync if needed
|
||||
3. Wait for background job to process all documents
|
||||
4. Verify in Qdrant that vectors exist for uploaded notes
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-013: RAG Evaluation Testing Framework](../../docs/ADR-013-rag-evaluation.md)
|
||||
- [ADR-008: MCP Sampling for Semantic Search](../../docs/ADR-008-mcp-sampling-for-semantic-search.md)
|
||||
- [BeIR Benchmark](https://github.com/beir-cellar/beir)
|
||||
- [NFCorpus Dataset](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/)
|
||||
@@ -0,0 +1 @@
|
||||
"""RAG evaluation tests for the Nextcloud MCP semantic search system."""
|
||||
@@ -0,0 +1,145 @@
|
||||
"""Pytest fixtures for RAG evaluation tests.
|
||||
|
||||
IMPORTANT: Before running these tests, you must:
|
||||
1. Generate ground truth: uv run python tools/rag_eval_cli.py generate
|
||||
2. Upload corpus: uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin
|
||||
|
||||
This ensures that the ground truth and note mappings are available.
|
||||
"""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from tests.rag_evaluation.llm_providers import create_llm_provider
|
||||
|
||||
# Paths
|
||||
FIXTURES_DIR = Path(__file__).parent / "fixtures"
|
||||
GROUND_TRUTH_FILE = FIXTURES_DIR / "ground_truth.json"
|
||||
NOTE_MAPPING_FILE = FIXTURES_DIR / "note_mapping.json"
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ground_truth_data() -> list[dict[str, Any]]:
|
||||
"""Load pre-generated ground truth data.
|
||||
|
||||
Returns:
|
||||
List of test cases with query, ground truth answer, and expected doc IDs
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If ground_truth.json doesn't exist
|
||||
"""
|
||||
if not GROUND_TRUTH_FILE.exists():
|
||||
raise FileNotFoundError(
|
||||
f"Ground truth file not found: {GROUND_TRUTH_FILE}\n"
|
||||
"Run: uv run python tools/rag_eval_cli.py generate"
|
||||
)
|
||||
|
||||
with open(GROUND_TRUTH_FILE) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def note_mapping() -> dict[str, int]:
|
||||
"""Load document ID → note ID mapping.
|
||||
|
||||
Returns:
|
||||
Dict mapping nfcorpus document ID to Nextcloud note ID
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If note_mapping.json doesn't exist
|
||||
"""
|
||||
if not NOTE_MAPPING_FILE.exists():
|
||||
raise FileNotFoundError(
|
||||
f"Note mapping file not found: {NOTE_MAPPING_FILE}\n"
|
||||
"Run: uv run python tools/rag_eval_cli.py upload --nextcloud-url ... --username ... --password ..."
|
||||
)
|
||||
|
||||
with open(NOTE_MAPPING_FILE) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def nfcorpus_test_data(
|
||||
ground_truth_data: list[dict[str, Any]],
|
||||
note_mapping: dict[str, int],
|
||||
):
|
||||
"""Prepare nfcorpus test data for evaluation.
|
||||
|
||||
This fixture combines ground truth answers with note mappings to create
|
||||
test cases ready for retrieval and generation quality tests.
|
||||
|
||||
Args:
|
||||
ground_truth_data: Pre-generated ground truth answers
|
||||
note_mapping: Document ID → note ID mapping
|
||||
|
||||
Returns:
|
||||
List of test cases with query, ground truth, expected doc IDs, and note IDs
|
||||
"""
|
||||
test_cases = []
|
||||
|
||||
for gt in ground_truth_data:
|
||||
# Map expected document IDs to note IDs
|
||||
expected_note_ids = [
|
||||
note_mapping.get(doc_id)
|
||||
for doc_id in gt["expected_document_ids"]
|
||||
if doc_id in note_mapping
|
||||
]
|
||||
|
||||
# Filter out None values (docs that weren't uploaded)
|
||||
expected_note_ids = [nid for nid in expected_note_ids if nid is not None]
|
||||
|
||||
test_cases.append(
|
||||
{
|
||||
"query_id": gt["query_id"],
|
||||
"query_text": gt["query_text"],
|
||||
"ground_truth_answer": gt["ground_truth_answer"],
|
||||
"expected_document_ids": gt["expected_document_ids"],
|
||||
"expected_note_ids": expected_note_ids,
|
||||
"highly_relevant_count": gt["highly_relevant_count"],
|
||||
}
|
||||
)
|
||||
|
||||
return test_cases
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
async def evaluation_llm():
|
||||
"""Create LLM provider for evaluation (separate from MCP client).
|
||||
|
||||
Environment variables:
|
||||
RAG_EVAL_PROVIDER: Provider type (ollama or anthropic)
|
||||
RAG_EVAL_OLLAMA_BASE_URL: Ollama base URL (or OLLAMA_HOST)
|
||||
RAG_EVAL_OLLAMA_MODEL: Ollama model name
|
||||
RAG_EVAL_ANTHROPIC_API_KEY: Anthropic API key
|
||||
RAG_EVAL_ANTHROPIC_MODEL: Anthropic model name
|
||||
|
||||
Returns:
|
||||
LLM provider instance (OllamaProvider or AnthropicProvider)
|
||||
"""
|
||||
llm = create_llm_provider()
|
||||
yield llm
|
||||
await llm.close()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
async def mcp_sampling_client():
|
||||
"""Create MCP client that supports sampling for RAG generation.
|
||||
|
||||
This fixture creates an MCP client configured to support sampling,
|
||||
which is required for testing the nc_semantic_search_answer tool.
|
||||
|
||||
TODO: Implement MCP client with sampling support
|
||||
For now, this is a placeholder.
|
||||
|
||||
Returns:
|
||||
MCP client instance with sampling enabled
|
||||
"""
|
||||
# TODO: Implement MCP client creation with sampling support
|
||||
# This will require:
|
||||
# 1. Creating an MCP client configured for sampling
|
||||
# 2. Authenticating with Nextcloud
|
||||
# 3. Ensuring sampling is enabled
|
||||
pytest.skip("MCP sampling client not yet implemented")
|
||||
@@ -0,0 +1,145 @@
|
||||
"""LLM provider abstraction for RAG evaluation.
|
||||
|
||||
Supports Ollama (local) and Anthropic (cloud) providers for both ground truth
|
||||
generation and evaluation.
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Protocol
|
||||
|
||||
import httpx
|
||||
from anthropic import AsyncAnthropic
|
||||
|
||||
|
||||
class LLMProvider(Protocol):
|
||||
"""Protocol for LLM providers."""
|
||||
|
||||
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
|
||||
"""Generate text from a prompt.
|
||||
|
||||
Args:
|
||||
prompt: The prompt to generate from
|
||||
max_tokens: Maximum tokens to generate
|
||||
|
||||
Returns:
|
||||
Generated text
|
||||
"""
|
||||
...
|
||||
|
||||
|
||||
class OllamaProvider:
|
||||
"""Ollama provider for local LLM inference."""
|
||||
|
||||
def __init__(self, base_url: str, model: str):
|
||||
"""Initialize Ollama provider.
|
||||
|
||||
Args:
|
||||
base_url: Ollama API base URL (e.g., http://localhost:11434)
|
||||
model: Model name (e.g., llama3.1:8b)
|
||||
"""
|
||||
self.base_url = base_url.rstrip("/")
|
||||
self.model = model
|
||||
self.client = httpx.AsyncClient(timeout=600.0) # 10 min timeout for generation
|
||||
|
||||
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
|
||||
"""Generate text using Ollama API."""
|
||||
response = await self.client.post(
|
||||
f"{self.base_url}/api/generate",
|
||||
json={
|
||||
"model": self.model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {
|
||||
"num_predict": max_tokens,
|
||||
"temperature": 0.7,
|
||||
},
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
return data["response"]
|
||||
|
||||
async def close(self):
|
||||
"""Close the HTTP client."""
|
||||
await self.client.aclose()
|
||||
|
||||
|
||||
class AnthropicProvider:
|
||||
"""Anthropic provider for cloud LLM inference."""
|
||||
|
||||
def __init__(self, api_key: str, model: str):
|
||||
"""Initialize Anthropic provider.
|
||||
|
||||
Args:
|
||||
api_key: Anthropic API key
|
||||
model: Model name (e.g., claude-3-5-sonnet-20241022)
|
||||
"""
|
||||
self.client = AsyncAnthropic(api_key=api_key)
|
||||
self.model = model
|
||||
|
||||
async def generate(self, prompt: str, max_tokens: int = 500) -> str:
|
||||
"""Generate text using Anthropic API."""
|
||||
message = await self.client.messages.create(
|
||||
model=self.model,
|
||||
max_tokens=max_tokens,
|
||||
temperature=0.7,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
return message.content[0].text
|
||||
|
||||
async def close(self):
|
||||
"""Close the client (no-op for Anthropic)."""
|
||||
pass
|
||||
|
||||
|
||||
def create_llm_provider(
|
||||
provider: str | None = None,
|
||||
ollama_base_url: str | None = None,
|
||||
ollama_model: str | None = None,
|
||||
anthropic_api_key: str | None = None,
|
||||
anthropic_model: str | None = None,
|
||||
) -> LLMProvider:
|
||||
"""Create an LLM provider from environment variables or arguments.
|
||||
|
||||
Args:
|
||||
provider: Provider type ('ollama' or 'anthropic'). Defaults to RAG_EVAL_PROVIDER env var or 'ollama'
|
||||
ollama_base_url: Ollama base URL. Defaults to RAG_EVAL_OLLAMA_BASE_URL or 'http://localhost:11434'
|
||||
ollama_model: Ollama model. Defaults to RAG_EVAL_OLLAMA_MODEL or 'llama3.1:8b'
|
||||
anthropic_api_key: Anthropic API key. Defaults to RAG_EVAL_ANTHROPIC_API_KEY env var
|
||||
anthropic_model: Anthropic model. Defaults to RAG_EVAL_ANTHROPIC_MODEL or 'claude-3-5-sonnet-20241022'
|
||||
|
||||
Returns:
|
||||
LLMProvider instance
|
||||
|
||||
Raises:
|
||||
ValueError: If provider is invalid or required credentials are missing
|
||||
"""
|
||||
# Get provider from args or env
|
||||
provider = provider or os.environ.get("RAG_EVAL_PROVIDER", "ollama")
|
||||
|
||||
if provider == "ollama":
|
||||
# Try RAG_EVAL_OLLAMA_BASE_URL, then OLLAMA_HOST, then default
|
||||
base_url = (
|
||||
ollama_base_url
|
||||
or os.environ.get("RAG_EVAL_OLLAMA_BASE_URL")
|
||||
or os.environ.get("OLLAMA_HOST")
|
||||
or "http://localhost:11434"
|
||||
)
|
||||
model = ollama_model or os.environ.get("RAG_EVAL_OLLAMA_MODEL", "llama3.2:1b")
|
||||
return OllamaProvider(base_url=base_url, model=model)
|
||||
|
||||
elif provider == "anthropic":
|
||||
api_key = anthropic_api_key or os.environ.get("RAG_EVAL_ANTHROPIC_API_KEY")
|
||||
if not api_key:
|
||||
raise ValueError(
|
||||
"Anthropic API key required. Set RAG_EVAL_ANTHROPIC_API_KEY environment variable."
|
||||
)
|
||||
model = anthropic_model or os.environ.get(
|
||||
"RAG_EVAL_ANTHROPIC_MODEL", "claude-3-5-sonnet-20241022"
|
||||
)
|
||||
return AnthropicProvider(api_key=api_key, model=model)
|
||||
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Invalid provider: {provider}. Must be 'ollama' or 'anthropic'."
|
||||
)
|
||||
@@ -0,0 +1,139 @@
|
||||
"""Tests for RAG generation quality (Answer Correctness metric).
|
||||
|
||||
These tests evaluate whether the MCP client LLM generates factually correct
|
||||
answers from retrieved context using the nc_semantic_search_answer tool.
|
||||
|
||||
Metric: Answer Correctness
|
||||
- Measures: Is the generated answer factually correct?
|
||||
- Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
|
||||
- Evaluation: External LLM evaluates semantic equivalence
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
async def test_answer_correctness(
|
||||
mcp_sampling_client,
|
||||
evaluation_llm,
|
||||
nfcorpus_test_data,
|
||||
):
|
||||
"""Test that RAG system generates factually correct answers.
|
||||
|
||||
For each test query:
|
||||
1. Execute full RAG pipeline via nc_semantic_search_answer MCP tool
|
||||
2. Extract generated answer from RAG response
|
||||
3. Use LLM-as-judge to compare against ground truth (binary true/false)
|
||||
4. Assert answer is semantically equivalent to ground truth
|
||||
|
||||
This tests the quality of the generation component (MCP client LLM).
|
||||
"""
|
||||
results_summary = []
|
||||
|
||||
for test_case in nfcorpus_test_data:
|
||||
query = test_case["query_text"]
|
||||
ground_truth = test_case["ground_truth_answer"]
|
||||
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"Query: {query}")
|
||||
|
||||
# Execute full RAG pipeline
|
||||
print("Executing RAG pipeline...")
|
||||
rag_result = await mcp_sampling_client.call_tool(
|
||||
"nc_semantic_search_answer",
|
||||
arguments={"query": query, "limit": 5},
|
||||
)
|
||||
|
||||
rag_answer = rag_result["generated_answer"]
|
||||
|
||||
print(f"RAG Answer preview: {rag_answer[:200]}...")
|
||||
print(f"Ground Truth preview: {ground_truth[:200]}...")
|
||||
|
||||
# LLM-as-judge evaluation
|
||||
evaluation_prompt = f"""Compare these two answers and respond with only TRUE or FALSE.
|
||||
|
||||
Question: {query}
|
||||
|
||||
Generated Answer: {rag_answer}
|
||||
|
||||
Ground Truth Answer: {ground_truth}
|
||||
|
||||
Are these answers semantically equivalent (do they convey the same factual information)?
|
||||
Respond with only: TRUE or FALSE"""
|
||||
|
||||
print("Evaluating answer correctness...")
|
||||
evaluation_result = await evaluation_llm.generate(
|
||||
evaluation_prompt,
|
||||
max_tokens=10,
|
||||
)
|
||||
|
||||
is_correct = evaluation_result.strip().upper() == "TRUE"
|
||||
|
||||
result = {
|
||||
"query_id": test_case["query_id"],
|
||||
"query": query,
|
||||
"rag_answer_length": len(rag_answer),
|
||||
"ground_truth_length": len(ground_truth),
|
||||
"is_correct": is_correct,
|
||||
"evaluation_result": evaluation_result.strip(),
|
||||
}
|
||||
results_summary.append(result)
|
||||
|
||||
print(f" Evaluation: {evaluation_result.strip()}")
|
||||
print(f" Status: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}")
|
||||
|
||||
# Assert answer correctness
|
||||
assert is_correct, (
|
||||
f"Answer mismatch for query: {query}\n\n"
|
||||
f"Generated Answer:\n{rag_answer}\n\n"
|
||||
f"Ground Truth:\n{ground_truth}\n\n"
|
||||
f"Evaluation: {evaluation_result.strip()}"
|
||||
)
|
||||
|
||||
# Print summary
|
||||
print(f"\n{'=' * 80}")
|
||||
print("Answer Correctness Summary:")
|
||||
print(f" Total queries: {len(results_summary)}")
|
||||
print(f" Correct: {sum(r['is_correct'] for r in results_summary)}")
|
||||
print(f" Incorrect: {sum(not r['is_correct'] for r in results_summary)}")
|
||||
accuracy = sum(r["is_correct"] for r in results_summary) / len(results_summary)
|
||||
print(f" Accuracy: {accuracy:.2%}")
|
||||
print(f"{'=' * 80}")
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
async def test_answer_contains_sources(mcp_sampling_client, nfcorpus_test_data):
|
||||
"""Test that RAG answers include source citations.
|
||||
|
||||
This is a basic quality check - we verify that the nc_semantic_search_answer
|
||||
tool returns both a generated answer and source documents.
|
||||
"""
|
||||
for test_case in nfcorpus_test_data:
|
||||
query = test_case["query_text"]
|
||||
|
||||
# Execute RAG pipeline
|
||||
rag_result = await mcp_sampling_client.call_tool(
|
||||
"nc_semantic_search_answer",
|
||||
arguments={"query": query, "limit": 5},
|
||||
)
|
||||
|
||||
# Check response structure
|
||||
assert "generated_answer" in rag_result, "Response missing 'generated_answer'"
|
||||
assert "sources" in rag_result, "Response missing 'sources'"
|
||||
|
||||
# Check sources are provided
|
||||
sources = rag_result["sources"]
|
||||
assert len(sources) > 0, f"No sources returned for query: {query}"
|
||||
|
||||
# Check each source has required fields
|
||||
for i, source in enumerate(sources):
|
||||
assert "document_id" in source or "id" in source, (
|
||||
f"Source {i} missing document ID"
|
||||
)
|
||||
assert "excerpt" in source or "content" in source or "text" in source, (
|
||||
f"Source {i} missing content"
|
||||
)
|
||||
|
||||
print(f"Query: {query}")
|
||||
print(f" Sources provided: {len(sources)}")
|
||||
print(" Status: ✓ PASS")
|
||||
@@ -0,0 +1,143 @@
|
||||
"""Tests for RAG retrieval quality (Context Recall metric).
|
||||
|
||||
These tests evaluate whether the vector sync/embedding pipeline successfully
|
||||
retrieves documents containing the answer to a query.
|
||||
|
||||
Metric: Context Recall
|
||||
- Measures: Did we retrieve documents containing the answer?
|
||||
- Method: Heuristic - Check if ground-truth document IDs appear in top-k results
|
||||
- Target: ≥80% recall (at least 80% of expected docs in top-10 results)
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
async def test_retrieval_context_recall(nc_client, nfcorpus_test_data):
|
||||
"""Test that semantic search retrieves documents containing the answer.
|
||||
|
||||
For each test query:
|
||||
1. Perform semantic search (retrieval only, no generation)
|
||||
2. Extract retrieved document IDs from top-k results
|
||||
3. Calculate Context Recall: intersection of retrieved and expected docs
|
||||
4. Assert recall meets threshold (≥80%)
|
||||
|
||||
This tests the quality of the vector sync/embedding pipeline.
|
||||
"""
|
||||
# Top-k documents to retrieve
|
||||
k = 10
|
||||
|
||||
# Minimum acceptable recall
|
||||
min_recall = 0.8
|
||||
|
||||
results_summary = []
|
||||
|
||||
for test_case in nfcorpus_test_data:
|
||||
query = test_case["query_text"]
|
||||
expected_note_ids = set(test_case["expected_note_ids"])
|
||||
|
||||
# Perform semantic search (retrieval only)
|
||||
search_results = await nc_client.notes.semantic_search(
|
||||
query=query,
|
||||
limit=k,
|
||||
)
|
||||
|
||||
# Extract retrieved note IDs
|
||||
retrieved_note_ids = {result["id"] for result in search_results}
|
||||
|
||||
# Calculate Context Recall
|
||||
intersection = expected_note_ids & retrieved_note_ids
|
||||
recall = len(intersection) / len(expected_note_ids) if expected_note_ids else 0
|
||||
|
||||
# Store results
|
||||
result = {
|
||||
"query_id": test_case["query_id"],
|
||||
"query": query,
|
||||
"expected_count": len(expected_note_ids),
|
||||
"retrieved_count": len(retrieved_note_ids),
|
||||
"intersection_count": len(intersection),
|
||||
"recall": recall,
|
||||
"passed": recall >= min_recall,
|
||||
}
|
||||
results_summary.append(result)
|
||||
|
||||
# Print detailed result for this query
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"Query: {query}")
|
||||
print(f" Expected docs: {len(expected_note_ids)}")
|
||||
print(f" Retrieved (top-{k}): {len(retrieved_note_ids)}")
|
||||
print(f" Intersection: {len(intersection)}")
|
||||
print(f" Context Recall: {recall:.2%}")
|
||||
print(f" Status: {'✓ PASS' if result['passed'] else '✗ FAIL'}")
|
||||
|
||||
# Assert recall meets threshold
|
||||
assert recall >= min_recall, (
|
||||
f"Context Recall {recall:.2%} below threshold {min_recall:.2%} "
|
||||
f"for query: {query}\n"
|
||||
f"Expected {len(expected_note_ids)} docs, found {len(intersection)} in top-{k}"
|
||||
)
|
||||
|
||||
# Print summary
|
||||
print(f"\n{'=' * 80}")
|
||||
print("Context Recall Summary:")
|
||||
print(f" Total queries: {len(results_summary)}")
|
||||
print(f" Passed: {sum(r['passed'] for r in results_summary)}")
|
||||
print(f" Failed: {sum(not r['passed'] for r in results_summary)}")
|
||||
print(
|
||||
f" Average recall: {sum(r['recall'] for r in results_summary) / len(results_summary):.2%}"
|
||||
)
|
||||
print(f"{'=' * 80}")
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
async def test_retrieval_top1_precision(nc_client, nfcorpus_test_data):
|
||||
"""Test that the top-1 retrieved document is highly relevant.
|
||||
|
||||
This is a stricter test than context recall - we verify that
|
||||
the single most relevant document (rank 1) is in the expected set.
|
||||
|
||||
This tests whether the ranking is good, not just retrieval.
|
||||
"""
|
||||
results_summary = []
|
||||
|
||||
for test_case in nfcorpus_test_data:
|
||||
query = test_case["query_text"]
|
||||
expected_note_ids = set(test_case["expected_note_ids"])
|
||||
|
||||
# Perform semantic search
|
||||
search_results = await nc_client.notes.semantic_search(
|
||||
query=query,
|
||||
limit=1, # Only top-1
|
||||
)
|
||||
|
||||
# Check if top result is in expected set
|
||||
if search_results:
|
||||
top_result_id = search_results[0]["id"]
|
||||
is_relevant = top_result_id in expected_note_ids
|
||||
else:
|
||||
is_relevant = False
|
||||
|
||||
result = {
|
||||
"query_id": test_case["query_id"],
|
||||
"query": query,
|
||||
"top_result_id": search_results[0]["id"] if search_results else None,
|
||||
"is_relevant": is_relevant,
|
||||
}
|
||||
results_summary.append(result)
|
||||
|
||||
print(f"\nQuery: {query}")
|
||||
print(f" Top-1 relevant: {'✓ YES' if is_relevant else '✗ NO'}")
|
||||
|
||||
# This is informational - we don't assert here
|
||||
# Some queries may have multiple valid top results
|
||||
|
||||
# Print summary
|
||||
precision_at_1 = sum(r["is_relevant"] for r in results_summary) / len(
|
||||
results_summary
|
||||
)
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"Precision@1: {precision_at_1:.2%}")
|
||||
print(
|
||||
f" ({sum(r['is_relevant'] for r in results_summary)}/{len(results_summary)} queries)"
|
||||
)
|
||||
print(f"{'=' * 80}")
|
||||
Reference in New Issue
Block a user