Commit Graph

3 Commits

Author SHA1 Message Date
Chris Coutinho b8010270c1 fix: Add async/await, PDF metadata, and type safety fixes
This commit addresses multiple issues with async operations, PDF metadata
extraction, and type safety in document processing and search.

## Async/Await Fixes
- processor.py:259 - Added await for chunker.chunk_text(content)
- processor.py:270 - Added await for bm25_service.encode_batch(chunk_texts)
- tests/unit/test_document_chunker.py - Converted all 12 test methods to async

## PDF Metadata Enhancement
- pymupdf.py:143 - Added file_size metadata extraction
- pymupdf.py:145-206 - Refactored to extract text page-by-page
  - Manually loop through pages instead of using page_chunks=True
  - Generate page_boundaries metadata for precise page tracking
  - Works around pymupdf.layout.activate() breaking page_chunks=True
- processor.py:32-66 - Added assign_page_numbers() helper function
  - Assigns page numbers to chunks based on overlap with page boundaries
  - Handles chunks spanning multiple pages
- processor.py:298-300 - Call assign_page_numbers() for PDF files

## Type Safety Fixes
- bm25_hybrid.py:184 - Removed int() conversion of doc_id
- semantic.py:131 - Removed int() conversion of doc_id
- viz_routes.py:275 - Removed int() conversion of doc_id
- Added comments documenting that doc_id can be int (notes) or str (file paths)

## Testing
- All 18 tests passing (12 unit + 6 integration)
- No type errors in modified files
- Container logs show successful processing
- Vector viz searches working correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 02:37:07 +01:00
Chris Coutinho eec923eff5 feat: Replace custom document chunker with LangChain MarkdownTextSplitter
Migrates from custom word-based chunking to LangChain's MarkdownTextSplitter
for better semantic search quality. This implements the chunking portion of
ADR-011.

Changes:
- Replace custom regex word chunker with MarkdownTextSplitter
- Optimized for Markdown content (headers, code blocks, lists)
- Convert from word-based (512 words) to character-based (2048 chars) chunking
- Maintain backward-compatible ChunkWithPosition interface
- Update configuration defaults and validation
- Update all unit tests (12/12 passing)

Benefits:
- Respects markdown structure boundaries
- Never breaks code blocks or headers mid-chunk
- Preserves semantic coherence within chunks
- Expected 20-30% improvement in recall quality
- Industry-standard approach (used by production RAG systems)

Note: Full reindex required to apply new chunking to existing documents.
Current vector database still contains old word-based chunks.

Related: ADR-011 (Improving Semantic Search Quality)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 12:17:23 +01:00
Chris Coutinho 862308418e fix: prevent infinite loop in DocumentChunker with position tracking
Fixed a critical infinite loop bug in document_chunker.py that occurred
when the overlap parameter caused the chunker to not make forward progress.

Changes:
- Added ChunkWithPosition dataclass to track character positions
- Refactored chunk_text() to use regex word matching for accurate position tracking
- Added safety check to ensure forward progress (next_start_idx > start_idx)
- Changed return type from list[str] to list[ChunkWithPosition]

The bug manifested when:
1. end_idx reached len(word_matches) (processing last chunk)
2. next_start_idx = end_idx - overlap would not advance past start_idx
3. Loop would continue indefinitely without making progress

Fix ensures chunker always terminates by breaking when not advancing.

All 9 unit tests now pass in 1.66s (previously timing out at 180s).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 06:39:15 +01:00