feat: implement RAG evaluation framework with CLI tooling
- Add ADR-013 documenting RAG evaluation architecture - Implement two-part evaluation: Context Recall (retrieval) + Answer Correctness (generation) - Create Click CLI for ground truth generation and corpus upload - Add pytest fixtures and tests for retrieval/generation quality - Use BeIR/nfcorpus dataset with 5 selected test queries - Support Ollama and Anthropic LLM providers - Generate synthetic ground truth answers offline - Add comprehensive documentation in tests/rag_evaluation/README.md The framework separates one-time setup (generate/upload) from test execution, making tests much faster (~6-12 min vs ~15-25 min per run). Tests are manual only (not in CI) and require external LLM access. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -13,3 +13,6 @@ docker-compose.override.yml
|
||||
# Generated by pytest used to login users
|
||||
.nextcloud_oauth_*.json
|
||||
.playwright-mcp/
|
||||
|
||||
# RAG Evaluation
|
||||
tests/rag_evaluation/fixtures/
|
||||
|
||||
Reference in New Issue
Block a user