brandon/nextcloud-mcp-server

Fork 0

Files

T

Chris Coutinho c1763ebc6a ADR search and handling categories in notes

2025-05-07 23:06:22 +02:00

6.1 KiB

Raw Permalink Blame History

ADR-001: Enhanced Note Search with Token-Based Relevance Ranking

Status

Proposed

Context

The current search implementation in the Nextcloud MCP server performs simple substring matching without relevance ranking. The existing method:

Fetches all notes
Performs case-insensitive substring matching on title and content
Returns matches without any ordering by relevance

This approach has several limitations:

Requires exact substring matches
No ranking by relevance
Only finds notes where the exact query string appears
Cannot prioritize more important matches (e.g., title vs content)
Inefficient for large note collections

We need to improve the search functionality without adding external dependencies to enhance the user experience while maintaining simplicity.

Decision

We will implement a token-based search with relevance ranking that:

Splits queries and note content into individual tokens (words)
Matches based on tokens rather than complete substrings
Applies weighted scoring with title matches valued higher than content matches
Sorts results by relevance score
Maintains backward compatibility with the existing API

Implementation Details

1. Query Processing

The search query will be tokenized (split into individual words), normalized (converted to lowercase), and filtered for stop words if necessary:

def process_query(query: str) -> list[str]:
    # Convert to lowercase and split into tokens
    tokens = query.lower().split()
    # Filter out very short tokens (optional)
    tokens = [token for token in tokens if len(token) > 1]
    # Could add stop word removal here
    return tokens

2. Note Content Processing

Each note's title and content will be processed in a similar way:

def process_note_content(note: dict) -> tuple[list[str], list[str]]:
    # Process title
    title = note.get("title", "").lower()
    title_tokens = title.split()
    
    # Process content
    content = note.get("content", "").lower()
    content_tokens = content.split()
    
    return title_tokens, content_tokens

3. Scoring Algorithm

We'll implement a scoring function that:

Assigns higher weight to title matches (e.g., 3x more important than content matches)
Considers the percentage of query tokens that match
Factors in the frequency of matches

def calculate_score(query_tokens: list[str], title_tokens: list[str], content_tokens: list[str]) -> float:
    # Constants for weighting
    TITLE_WEIGHT = 3.0
    CONTENT_WEIGHT = 1.0
    
    score = 0.0
    
    # Count matches in title
    title_matches = sum(1 for qt in query_tokens if qt in title_tokens)
    if query_tokens:  # Avoid division by zero
        title_match_ratio = title_matches / len(query_tokens)
        score += TITLE_WEIGHT * title_match_ratio
    
    # Count matches in content
    content_matches = sum(1 for qt in query_tokens if qt in content_tokens)
    if query_tokens:  # Avoid division by zero
        content_match_ratio = content_matches / len(query_tokens)
        score += CONTENT_WEIGHT * content_match_ratio
    
    # If no tokens matched at all, return zero
    if title_matches == 0 and content_matches == 0:
        return 0.0
        
    return score

4. Enhanced Search Implementation

def notes_search_notes(self, *, query: str):
    """
    Search notes using token-based matching with relevance ranking.
    Returns notes sorted by relevance score.
    """
    all_notes = self.notes_get_all()
    search_results = []
    
    # Process the query
    query_tokens = process_query(query)
    
    # If empty query after processing, return empty results
    if not query_tokens:
        return []
    
    # Process and score each note
    for note in all_notes:
        title_tokens, content_tokens = process_note_content(note)
        score = calculate_score(query_tokens, title_tokens, content_tokens)
        
        # Only include notes with a non-zero score
        if score > 0:
            search_results.append({
                "id": note.get("id"),
                "title": note.get("title"),
                "category": note.get("category"),
                "modified": note.get("modified"),
                "_score": score  # Include score for sorting (optional field)
            })
    
    # Sort by score in descending order
    search_results.sort(key=lambda x: x["_score"], reverse=True)
    
    # Remove score field before returning (optional)
    for result in search_results:
        if "_score" in result:
            del result["_score"]
    
    return search_results

5. Performance Considerations

The enhanced search still retrieves all notes from the server, which could be inefficient for large collections
Future improvements could include caching or building an in-memory index
For very large note collections, consider adding pagination to the API

Consequences

Benefits

Better search results with matches on individual words instead of exact phrases
Relevant results appear first due to ranking
Title matches are prioritized, matching user expectations
No additional dependencies required
Maintains backward compatibility with existing API

Limitations

Slightly increased complexity in the search implementation
Still requires fetching all notes for each search operation
No handling of typos or similar words (would require fuzzy matching)
No stemming/lemmatization to match word variations

Future Potential Enhancements

Add support for phrase queries (exact matches)
Implement an in-memory index for faster repeated searches
Add basic natural language processing features (stemming, stop words)
Support for fuzzy matching to handle typos

Alternatives Considered

Implementing a full-text search engine (e.g., integrating with Elasticsearch)
Using vector-based semantic search with embeddings
Adding external NLP libraries for more sophisticated text processing

These alternatives were not selected for the initial implementation due to the desire to maintain simplicity and avoid adding dependencies, but could be considered for future enhancements.

6.1 KiB Raw Permalink Blame History