Refactored LLM provider infrastructure to support sustainable additions of new providers with both embedding and text generation capabilities.
## Major Changes
### Unified Provider Architecture (ADR-015)
- Created `nextcloud_mcp_server/providers/` with unified Provider ABC
- Providers now support optional capabilities (embeddings and/or generation)
- Auto-detection registry with priority: Bedrock → Ollama → Simple
- Backward compatible - existing code continues to work
### New Providers
- **BedrockProvider**: Full Amazon Bedrock integration
- Embeddings: Titan Embed, Cohere Embed models
- Generation: Claude, Llama, Titan Text, Mistral models
- Model-specific request/response handling
- AWS credential chain integration
- **OllamaProvider**: Migrated with both capabilities support
- **AnthropicProvider**: Moved from test code to production providers
- **SimpleProvider**: Migrated in-memory fallback provider
### Breaking Changes
None - full backward compatibility maintained:
- `embedding.get_embedding_service()` still works
- RAG evaluation tests updated to use unified providers
- All existing tests pass (127 unit tests)
### Testing
- Added 9 comprehensive Bedrock unit tests with mocked boto3
- All existing unit tests pass
- Type checking (ty) and linting (ruff) pass
- Verified backward compatibility
### Documentation
- `docs/ADR-015-unified-provider-architecture.md`: Comprehensive ADR
- `docs/bedrock-setup.md`: AWS setup guide with IAM permissions
- `CLAUDE.md`: Updated with provider architecture section
### Dependencies
- Added `boto3>=1.35.0` to dev dependencies (optional)
## Environment Variables
### Bedrock
- `AWS_REGION`: AWS region (e.g., "us-east-1")
- `BEDROCK_EMBEDDING_MODEL`: Model ID for embeddings
- `BEDROCK_GENERATION_MODEL`: Model ID for generation
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`: Optional credentials
### Ollama
- `OLLAMA_BASE_URL`: API URL
- `OLLAMA_EMBEDDING_MODEL`: Embedding model (default: "nomic-embed-text")
- `OLLAMA_GENERATION_MODEL`: Generation model
## AWS Bedrock Permissions Required
Minimal IAM policy:
```json
{
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": ["arn:aws:bedrock:*::foundation-model/*"]
}
```
See `docs/bedrock-setup.md` for detailed setup instructions.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
RAG Evaluation Tests
This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the nc_semantic_search_answer tool.
Architecture
The RAG system has two components that are tested independently:
- Retrieval - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
- Generation - MCP client LLM synthesis (retrieved context → natural language answer)
See ADR-013 for full architectural details.
Test Structure
tests/rag_evaluation/
├── README.md # This file
├── conftest.py # Pytest fixtures
├── llm_providers.py # LLM provider abstraction (Ollama/Anthropic)
├── fixtures/
│ └── ground_truth.json # Pre-generated reference answers
├── test_retrieval_quality.py # Retrieval evaluation (Context Recall)
└── test_generation_quality.py # Generation evaluation (Answer Correctness)
Metrics
Retrieval Evaluation
- Metric: Context Recall
- Method: Heuristic - Check if ground-truth document IDs appear in top-k results
- Target: ≥80% recall
Generation Evaluation
- Metric: Answer Correctness
- Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
- Evaluation: External LLM evaluates semantic equivalence
Dataset
BeIR/nfcorpus - Medical/biomedical corpus with ~3,600 documents
Test Queries (5 selected):
- PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
- PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
- PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
- PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
- PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)
Setup
1. Install Dependencies
uv sync --group dev
This installs:
anthropic>=0.42.0- For Anthropic LLM evaluationclick>=8.1.8- For CLI interfacedatasets>=3.3.0- For BeIR nfcorpus dataset loading
2. Configure LLM Provider
Set environment variables for your LLM provider:
Option A: Ollama (default, local/remote)
export RAG_EVAL_PROVIDER=ollama
export OLLAMA_HOST=https://ollama.example.com # or RAG_EVAL_OLLAMA_BASE_URL
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b
Option B: Anthropic (cloud)
export RAG_EVAL_PROVIDER=anthropic
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
3. One-Time Setup: Generate Ground Truth
Generate synthetic reference answers for the 5 test queries:
uv run python tools/rag_eval_cli.py generate
What this does:
- Downloads nfcorpus dataset to
tests/rag_evaluation/fixtures/nfcorpus/(cached locally) - For each of the 5 selected queries, extracts highly relevant documents
- Uses configured LLM to synthesize a reference answer
- Saves to
tests/rag_evaluation/fixtures/ground_truth.json
Optional flags:
--provider ollama|anthropic- Override LLM provider--model MODEL_NAME- Override model name--force-download- Re-download nfcorpus dataset
4. One-Time Setup: Upload Corpus to Nextcloud
Upload all 3,633 nfcorpus documents as Nextcloud notes:
uv run python tools/rag_eval_cli.py upload \
--nextcloud-url http://localhost:8000 \
--username admin \
--password admin
What this does:
- Downloads nfcorpus dataset (if not already cached)
- Uploads all documents as notes in Nextcloud
- Saves document ID → note ID mapping to
tests/rag_evaluation/fixtures/note_mapping.json
Optional flags:
--category CATEGORY- Custom category for notes (default:nfcorpus_rag_eval)--force-download- Re-download nfcorpus dataset--force- Delete all existing notes in the target category before uploading (efficient corpus refresh)
Important: This step requires:
- A running Nextcloud instance with vector sync enabled
- Notes app installed
- Valid credentials
Duration: ~10-15 minutes to upload 3,633 documents
Running Tests
Run All RAG Evaluation Tests
uv run pytest tests/rag_evaluation/ -v
Run Specific Test Suites
Retrieval Quality Only:
uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v
Generation Quality Only:
uv run pytest tests/rag_evaluation/test_generation_quality.py -v
Run Individual Tests
uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v
Test Execution Flow
Prerequisites (one-time setup):
- Generated ground truth (
tools/rag_eval_cli.py generate) - Uploaded corpus to Nextcloud (
tools/rag_eval_cli.py upload)
Retrieval Quality Tests
-
Setup (
nfcorpus_test_datafixture):- Loads pre-generated ground truth from
fixtures/ground_truth.json - Loads note mapping from
fixtures/note_mapping.json - Returns test cases with expected note IDs
- Loads pre-generated ground truth from
-
Test (
test_retrieval_context_recall):- For each query: Perform semantic search (top-10)
- Extract retrieved note IDs
- Calculate Context Recall = (expected ∩ retrieved) / expected
- Assert recall ≥ 80%
-
Cleanup:
- None required (notes persist in Nextcloud for reuse)
Generation Quality Tests
-
Setup:
- Same as retrieval tests (reuses
nfcorpus_test_datafixture) - Creates evaluation LLM provider
- Same as retrieval tests (reuses
-
Test (
test_answer_correctness):- For each query: Call
nc_semantic_search_answerMCP tool - Extract generated answer
- Use LLM-as-judge to compare vs ground truth
- Assert semantic equivalence (TRUE/FALSE)
- For each query: Call
-
Cleanup:
- LLM provider closed
Expected Test Duration
One-time setup:
- Generate ground truth: ~5-10 minutes (5 queries with LLM generation)
- Upload corpus: ~10-15 minutes (3,633 documents)
- Total setup: ~15-25 minutes
Test execution (after setup):
- Retrieval tests: ~1-2 minutes (5 queries, no upload/cleanup)
- Generation tests: ~5-10 minutes (RAG generation + LLM evaluation)
- Total per run: ~6-12 minutes
Note: These are NOT smoke tests and are NOT run in CI.
Limitations & Future Work
Current Limitations:
- Only 5 test queries (limited statistical confidence)
- Medical domain bias (may not represent production use cases)
- Synthetic ground truth (LLM-generated, not human-validated)
- Manual test execution (requires external LLM access)
Future Enhancements:
- Expand to 50-100 queries for statistical significance
- Add custom test dataset with production-representative documents
- Implement additional metrics (faithfulness, context relevance, answer relevance)
- Create automated benchmarking dashboard
- Test multi-hop reasoning (synthesis questions)
- Evaluate out-of-scope handling ("I don't know" responses)
Troubleshooting
Tests Fail with "Ground truth file not found"
Run the generate command first:
uv run python tools/rag_eval_cli.py generate
Tests Fail with "Note mapping file not found"
Run the upload command first:
uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin
Tests Fail with "MCP sampling client not yet implemented"
The mcp_sampling_client fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in conftest.py.
Upload Command Fails
Common issues:
- Nextcloud not running: Ensure Nextcloud is accessible at the URL
- Invalid credentials: Verify username/password
- Notes app not installed: Install Notes app in Nextcloud
- Network timeout: Increase timeout in CLI (currently 60s)
LLM Timeout
If ground truth generation times out:
- Increase timeout in
llm_providers.py(currently 10 min) - Use a faster model:
--model llama3.2:1b - Check Ollama/Anthropic service availability
Dataset Download Fails
The nfcorpus dataset is downloaded automatically. If download fails:
- Check internet connection
- Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
- Extract to
tests/rag_evaluation/fixtures/nfcorpus/ - Or use HuggingFace datasets cache:
~/.cache/huggingface/datasets/BeIR___nfcorpus/
Vector Sync Not Indexing Documents
After uploading, vector sync must index the documents:
- Check vector sync is enabled in Nextcloud
- Trigger manual sync if needed
- Wait for background job to process all documents
- Verify in Qdrant that vectors exist for uploaded notes