feat: Replace custom document chunker with LangChain MarkdownTextSplitter

Migrates from custom word-based chunking to LangChain's MarkdownTextSplitter
for better semantic search quality. This implements the chunking portion of
ADR-011.

Changes:
- Replace custom regex word chunker with MarkdownTextSplitter
- Optimized for Markdown content (headers, code blocks, lists)
- Convert from word-based (512 words) to character-based (2048 chars) chunking
- Maintain backward-compatible ChunkWithPosition interface
- Update configuration defaults and validation
- Update all unit tests (12/12 passing)

Benefits:
- Respects markdown structure boundaries
- Never breaks code blocks or headers mid-chunk
- Preserves semantic coherence within chunks
- Expected 20-30% improvement in recall quality
- Industry-standard approach (used by production RAG systems)

Note: Full reindex required to apply new chunking to existing documents.
Current vector database still contains old word-based chunks.

Related: ADR-011 (Improving Semantic Search Quality)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Chris Coutinho
2025-11-18 12:17:23 +01:00
parent b72aeca55f
commit eec923eff5
7 changed files with 505 additions and 127 deletions
@@ -1,7 +1,8 @@
# ADR-011: Improving Semantic Search Quality Through Better Chunking and Embeddings
**Status**: Proposed
**Status**: Partially Implemented (Chunking Complete, Embeddings Pending)
**Date**: 2025-11-12
**Implementation Date**: 2025-11-18 (Chunking)
**Authors**: Development Team
**Related**: ADR-003 (Vector Database Architecture), ADR-008 (MCP Sampling for RAG)
@@ -893,3 +894,50 @@ This ADR addresses the root causes of poor semantic search recall:
- No new infrastructure or ongoing costs
**Next Steps**: Approve ADR → Implement changes → Reindex → Validate → Production rollout
## Implementation Status
### Completed (2025-11-18)
**✅ Semantic Markdown-Aware Chunking (Option C1 + C3 Hybrid)**
Implementation details:
- Replaced custom word-based chunking with `MarkdownTextSplitter` from LangChain
- Optimized for Nextcloud Notes markdown content with special handling for:
- Headers (`#`, `##`, `###`, etc.)
- Code blocks (` ``` `)
- Lists (`-`, `*`, `1.`)
- Horizontal rules (`---`)
- Paragraphs and sentences
- Maintained `ChunkWithPosition` interface for backward compatibility
- Updated configuration defaults:
- `DOCUMENT_CHUNK_SIZE`: 512 words → 2048 characters
- `DOCUMENT_CHUNK_OVERLAP`: 50 words → 200 characters
- Updated unit tests to verify position tracking and boundary preservation
- All tests passing with markdown-aware character-based chunking
**Files Modified**:
- `nextcloud_mcp_server/vector/document_chunker.py` - LangChain integration
- `nextcloud_mcp_server/config.py` - Character-based defaults
- `tests/unit/test_document_chunker.py` - Updated test suite
**Dependencies Added**:
- `langchain-text-splitters>=1.0.0` (already present in `pyproject.toml`)
**Migration Required**:
- ⚠️ Full reindex required to apply new chunking strategy
- Existing documents in vector database use old word-based chunks
- See "Migration Strategy" section above for reindexing process
### Pending
**⏳ Embedding Model Upgrade (Option E1)**
Still to be implemented:
- Switch from `nomic-embed-text` (768-dim) to `mxbai-embed-large-v1` (1024-dim)
- Implement dynamic dimension detection in `ollama_provider.py`
- Create migration script for collection reindexing
- Run benchmarking to validate improvement
- Deploy to production with atomic collection swap
**Estimated Timeline**: 1-2 weeks for implementation and validation