feat: Replace custom document chunker with LangChain MarkdownTextSplitter
Migrates from custom word-based chunking to LangChain's MarkdownTextSplitter for better semantic search quality. This implements the chunking portion of ADR-011. Changes: - Replace custom regex word chunker with MarkdownTextSplitter - Optimized for Markdown content (headers, code blocks, lists) - Convert from word-based (512 words) to character-based (2048 chars) chunking - Maintain backward-compatible ChunkWithPosition interface - Update configuration defaults and validation - Update all unit tests (12/12 passing) Benefits: - Respects markdown structure boundaries - Never breaks code blocks or headers mid-chunk - Preserves semantic coherence within chunks - Expected 20-30% improvement in recall quality - Industry-standard approach (used by production RAG systems) Note: Full reindex required to apply new chunking to existing documents. Current vector database still contains old word-based chunks. Related: ADR-011 (Improving Semantic Search Quality) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -1,7 +1,8 @@
|
||||
# ADR-011: Improving Semantic Search Quality Through Better Chunking and Embeddings
|
||||
|
||||
**Status**: Proposed
|
||||
**Status**: Partially Implemented (Chunking Complete, Embeddings Pending)
|
||||
**Date**: 2025-11-12
|
||||
**Implementation Date**: 2025-11-18 (Chunking)
|
||||
**Authors**: Development Team
|
||||
**Related**: ADR-003 (Vector Database Architecture), ADR-008 (MCP Sampling for RAG)
|
||||
|
||||
@@ -893,3 +894,50 @@ This ADR addresses the root causes of poor semantic search recall:
|
||||
- No new infrastructure or ongoing costs
|
||||
|
||||
**Next Steps**: Approve ADR → Implement changes → Reindex → Validate → Production rollout
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Completed (2025-11-18)
|
||||
|
||||
**✅ Semantic Markdown-Aware Chunking (Option C1 + C3 Hybrid)**
|
||||
|
||||
Implementation details:
|
||||
- Replaced custom word-based chunking with `MarkdownTextSplitter` from LangChain
|
||||
- Optimized for Nextcloud Notes markdown content with special handling for:
|
||||
- Headers (`#`, `##`, `###`, etc.)
|
||||
- Code blocks (` ``` `)
|
||||
- Lists (`-`, `*`, `1.`)
|
||||
- Horizontal rules (`---`)
|
||||
- Paragraphs and sentences
|
||||
- Maintained `ChunkWithPosition` interface for backward compatibility
|
||||
- Updated configuration defaults:
|
||||
- `DOCUMENT_CHUNK_SIZE`: 512 words → 2048 characters
|
||||
- `DOCUMENT_CHUNK_OVERLAP`: 50 words → 200 characters
|
||||
- Updated unit tests to verify position tracking and boundary preservation
|
||||
- All tests passing with markdown-aware character-based chunking
|
||||
|
||||
**Files Modified**:
|
||||
- `nextcloud_mcp_server/vector/document_chunker.py` - LangChain integration
|
||||
- `nextcloud_mcp_server/config.py` - Character-based defaults
|
||||
- `tests/unit/test_document_chunker.py` - Updated test suite
|
||||
|
||||
**Dependencies Added**:
|
||||
- `langchain-text-splitters>=1.0.0` (already present in `pyproject.toml`)
|
||||
|
||||
**Migration Required**:
|
||||
- ⚠️ Full reindex required to apply new chunking strategy
|
||||
- Existing documents in vector database use old word-based chunks
|
||||
- See "Migration Strategy" section above for reindexing process
|
||||
|
||||
### Pending
|
||||
|
||||
**⏳ Embedding Model Upgrade (Option E1)**
|
||||
|
||||
Still to be implemented:
|
||||
- Switch from `nomic-embed-text` (768-dim) to `mxbai-embed-large-v1` (1024-dim)
|
||||
- Implement dynamic dimension detection in `ollama_provider.py`
|
||||
- Create migration script for collection reindexing
|
||||
- Run benchmarking to validate improvement
|
||||
- Deploy to production with atomic collection swap
|
||||
|
||||
**Estimated Timeline**: 1-2 weeks for implementation and validation
|
||||
|
||||
Reference in New Issue
Block a user