Files

T

History

Chris Coutinho 5b484c9226 feat: add unified provider architecture with Amazon Bedrock support

Refactored LLM provider infrastructure to support sustainable additions of new providers with both embedding and text generation capabilities.

## Major Changes

### Unified Provider Architecture (ADR-015)
- Created `nextcloud_mcp_server/providers/` with unified Provider ABC
- Providers now support optional capabilities (embeddings and/or generation)
- Auto-detection registry with priority: Bedrock → Ollama → Simple
- Backward compatible - existing code continues to work

### New Providers
- **BedrockProvider**: Full Amazon Bedrock integration
  - Embeddings: Titan Embed, Cohere Embed models
  - Generation: Claude, Llama, Titan Text, Mistral models
  - Model-specific request/response handling
  - AWS credential chain integration
- **OllamaProvider**: Migrated with both capabilities support
- **AnthropicProvider**: Moved from test code to production providers
- **SimpleProvider**: Migrated in-memory fallback provider

### Breaking Changes
None - full backward compatibility maintained:
- `embedding.get_embedding_service()` still works
- RAG evaluation tests updated to use unified providers
- All existing tests pass (127 unit tests)

### Testing
- Added 9 comprehensive Bedrock unit tests with mocked boto3
- All existing unit tests pass
- Type checking (ty) and linting (ruff) pass
- Verified backward compatibility

### Documentation
- `docs/ADR-015-unified-provider-architecture.md`: Comprehensive ADR
- `docs/bedrock-setup.md`: AWS setup guide with IAM permissions
- `CLAUDE.md`: Updated with provider architecture section

### Dependencies
- Added `boto3>=1.35.0` to dev dependencies (optional)

## Environment Variables

### Bedrock
- `AWS_REGION`: AWS region (e.g., "us-east-1")
- `BEDROCK_EMBEDDING_MODEL`: Model ID for embeddings
- `BEDROCK_GENERATION_MODEL`: Model ID for generation
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`: Optional credentials

### Ollama
- `OLLAMA_BASE_URL`: API URL
- `OLLAMA_EMBEDDING_MODEL`: Embedding model (default: "nomic-embed-text")
- `OLLAMA_GENERATION_MODEL`: Generation model

## AWS Bedrock Permissions Required

Minimal IAM policy:
```json
{
  "Effect": "Allow",
  "Action": ["bedrock:InvokeModel"],
  "Resource": ["arn:aws:bedrock:*::foundation-model/*"]
}
```

See `docs/bedrock-setup.md` for detailed setup instructions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-16 11:36:58 +01:00

__init__.py

feat: implement RAG evaluation framework with CLI tooling

2025-11-15 23:11:21 +01:00

conftest.py

feat: implement RAG evaluation framework with CLI tooling

2025-11-15 23:11:21 +01:00

llm_providers.py

feat: add unified provider architecture with Amazon Bedrock support

2025-11-16 11:36:58 +01:00

README.md

feat: add concurrent uploads and --force flag to upload command

2025-11-16 00:41:00 +01:00

test_generation_quality.py

feat: implement RAG evaluation framework with CLI tooling

2025-11-15 23:11:21 +01:00

test_retrieval_quality.py

feat: implement RAG evaluation framework with CLI tooling

2025-11-15 23:11:21 +01:00

README.md

RAG Evaluation Tests

This directory contains tests for evaluating the Retrieval-Augmented Generation (RAG) system in the Nextcloud MCP server, specifically the nc_semantic_search_answer tool.

Architecture

The RAG system has two components that are tested independently:

Retrieval - Vector sync/embedding pipeline (indexed Nextcloud documents → vector database)
Generation - MCP client LLM synthesis (retrieved context → natural language answer)

See ADR-013 for full architectural details.

Test Structure

tests/rag_evaluation/
├── README.md                       # This file
├── conftest.py                     # Pytest fixtures
├── llm_providers.py                # LLM provider abstraction (Ollama/Anthropic)
├── fixtures/
│   └── ground_truth.json           # Pre-generated reference answers
├── test_retrieval_quality.py       # Retrieval evaluation (Context Recall)
└── test_generation_quality.py      # Generation evaluation (Answer Correctness)

Metrics

Retrieval Evaluation

Metric: Context Recall
Method: Heuristic - Check if ground-truth document IDs appear in top-k results
Target: ≥80% recall

Generation Evaluation

Metric: Answer Correctness
Method: LLM-as-judge - Compare RAG answer vs ground truth (binary true/false)
Evaluation: External LLM evaluates semantic equivalence

Dataset

BeIR/nfcorpus - Medical/biomedical corpus with ~3,600 documents

Test Queries (5 selected):

PLAIN-2630: "Alkylphenol Endocrine Disruptors and Allergies" (21 relevant docs)
PLAIN-2660: "How Long to Detox From Fish Before Pregnancy?" (20 relevant docs)
PLAIN-2510: "Coffee and Artery Function" (16 relevant docs)
PLAIN-2430: "Preventing Brain Loss with B Vitamins?" (15 relevant docs)
PLAIN-2690: "Chronic Headaches and Pork Tapeworms" (14 relevant docs)

Setup

1. Install Dependencies

uv sync --group dev

This installs:

anthropic>=0.42.0 - For Anthropic LLM evaluation
click>=8.1.8 - For CLI interface
datasets>=3.3.0 - For BeIR nfcorpus dataset loading

2. Configure LLM Provider

Set environment variables for your LLM provider:

Option A: Ollama (default, local/remote)

export RAG_EVAL_PROVIDER=ollama
export OLLAMA_HOST=https://ollama.example.com  # or RAG_EVAL_OLLAMA_BASE_URL
export RAG_EVAL_OLLAMA_MODEL=llama3.2:1b

Option B: Anthropic (cloud)

export RAG_EVAL_PROVIDER=anthropic
export RAG_EVAL_ANTHROPIC_API_KEY=sk-ant-...
export RAG_EVAL_ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

3. One-Time Setup: Generate Ground Truth

Generate synthetic reference answers for the 5 test queries:

uv run python tools/rag_eval_cli.py generate

What this does:

Downloads nfcorpus dataset to tests/rag_evaluation/fixtures/nfcorpus/ (cached locally)
For each of the 5 selected queries, extracts highly relevant documents
Uses configured LLM to synthesize a reference answer
Saves to tests/rag_evaluation/fixtures/ground_truth.json

Optional flags:

--provider ollama|anthropic - Override LLM provider
--model MODEL_NAME - Override model name
--force-download - Re-download nfcorpus dataset

4. One-Time Setup: Upload Corpus to Nextcloud

Upload all 3,633 nfcorpus documents as Nextcloud notes:

uv run python tools/rag_eval_cli.py upload \
    --nextcloud-url http://localhost:8000 \
    --username admin \
    --password admin

What this does:

Downloads nfcorpus dataset (if not already cached)
Uploads all documents as notes in Nextcloud
Saves document ID → note ID mapping to tests/rag_evaluation/fixtures/note_mapping.json

Optional flags:

--category CATEGORY - Custom category for notes (default: nfcorpus_rag_eval)
--force-download - Re-download nfcorpus dataset
--force - Delete all existing notes in the target category before uploading (efficient corpus refresh)

Important: This step requires:

A running Nextcloud instance with vector sync enabled
Notes app installed
Valid credentials

Duration: ~10-15 minutes to upload 3,633 documents

Running Tests

Run All RAG Evaluation Tests

uv run pytest tests/rag_evaluation/ -v

Run Specific Test Suites

Retrieval Quality Only:

uv run pytest tests/rag_evaluation/test_retrieval_quality.py -v

Generation Quality Only:

uv run pytest tests/rag_evaluation/test_generation_quality.py -v

Run Individual Tests

uv run pytest tests/rag_evaluation/test_retrieval_quality.py::test_retrieval_context_recall -v
uv run pytest tests/rag_evaluation/test_generation_quality.py::test_answer_correctness -v

Test Execution Flow

Prerequisites (one-time setup):

Generated ground truth (tools/rag_eval_cli.py generate)
Uploaded corpus to Nextcloud (tools/rag_eval_cli.py upload)

Retrieval Quality Tests

Setup (nfcorpus_test_data fixture):
- Loads pre-generated ground truth from fixtures/ground_truth.json
- Loads note mapping from fixtures/note_mapping.json
- Returns test cases with expected note IDs
Test (test_retrieval_context_recall):
- For each query: Perform semantic search (top-10)
- Extract retrieved note IDs
- Calculate Context Recall = (expected ∩ retrieved) / expected
- Assert recall ≥ 80%
Cleanup:
- None required (notes persist in Nextcloud for reuse)

Generation Quality Tests

Setup:
- Same as retrieval tests (reuses nfcorpus_test_data fixture)
- Creates evaluation LLM provider
Test (test_answer_correctness):
- For each query: Call nc_semantic_search_answer MCP tool
- Extract generated answer
- Use LLM-as-judge to compare vs ground truth
- Assert semantic equivalence (TRUE/FALSE)
Cleanup:
- LLM provider closed

Expected Test Duration

One-time setup:

Generate ground truth: ~5-10 minutes (5 queries with LLM generation)
Upload corpus: ~10-15 minutes (3,633 documents)
Total setup: ~15-25 minutes

Test execution (after setup):

Retrieval tests: ~1-2 minutes (5 queries, no upload/cleanup)
Generation tests: ~5-10 minutes (RAG generation + LLM evaluation)
Total per run: ~6-12 minutes

Note: These are NOT smoke tests and are NOT run in CI.

Limitations & Future Work

Current Limitations:

Only 5 test queries (limited statistical confidence)
Medical domain bias (may not represent production use cases)
Synthetic ground truth (LLM-generated, not human-validated)
Manual test execution (requires external LLM access)

Future Enhancements:

Expand to 50-100 queries for statistical significance
Add custom test dataset with production-representative documents
Implement additional metrics (faithfulness, context relevance, answer relevance)
Create automated benchmarking dashboard
Test multi-hop reasoning (synthesis questions)
Evaluate out-of-scope handling ("I don't know" responses)

Troubleshooting

Tests Fail with "Ground truth file not found"

Run the generate command first:

uv run python tools/rag_eval_cli.py generate

Tests Fail with "Note mapping file not found"

Run the upload command first:

uv run python tools/rag_eval_cli.py upload --nextcloud-url http://localhost:8000 --username admin --password admin

Tests Fail with "MCP sampling client not yet implemented"

The mcp_sampling_client fixture is a placeholder. You need to implement MCP client creation with sampling support. See the TODO in conftest.py.

Upload Command Fails

Common issues:

Nextcloud not running: Ensure Nextcloud is accessible at the URL
Invalid credentials: Verify username/password
Notes app not installed: Install Notes app in Nextcloud
Network timeout: Increase timeout in CLI (currently 60s)

LLM Timeout

If ground truth generation times out:

Increase timeout in llm_providers.py (currently 10 min)
Use a faster model: --model llama3.2:1b
Check Ollama/Anthropic service availability

Dataset Download Fails

The nfcorpus dataset is downloaded automatically. If download fails:

Check internet connection
Manually download from: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
Extract to tests/rag_evaluation/fixtures/nfcorpus/
Or use HuggingFace datasets cache: ~/.cache/huggingface/datasets/BeIR___nfcorpus/

Vector Sync Not Indexing Documents

After uploading, vector sync must index the documents:

Check vector sync is enabled in Nextcloud
Trigger manual sync if needed
Wait for background job to process all documents
Verify in Qdrant that vectors exist for uploaded notes

README.md

RAG Evaluation Tests

Architecture

Test Structure

Metrics

Retrieval Evaluation

Generation Evaluation

Dataset

Setup

1. Install Dependencies

2. Configure LLM Provider

3. One-Time Setup: Generate Ground Truth

4. One-Time Setup: Upload Corpus to Nextcloud

Running Tests

Run All RAG Evaluation Tests

Run Specific Test Suites

Run Individual Tests

Test Execution Flow

Retrieval Quality Tests

Generation Quality Tests

Expected Test Duration

Limitations & Future Work

Troubleshooting

Tests Fail with "Ground truth file not found"

Tests Fail with "Note mapping file not found"

Tests Fail with "MCP sampling client not yet implemented"

Upload Command Fails

LLM Timeout

Dataset Download Fails

Vector Sync Not Indexing Documents

References