Vespa provides powerful full-text search capabilities with BM25 ranking, linguistic processing, and advanced text matching operators. The search engine handles tokenization, stemming, and relevance ranking out of the box.Documentation Index
Fetch the complete documentation index at: https://support.agentrank.io/llms.txt
Use this file to discover all available pages before exploring further.
Basic Text Indexing
Define text fields in your schema:The
index: enable-bm25 directive enables BM25 ranking features for the field.Text Matching Operators
Contains (Term Matching)
Matches individual terms with linguistic processing:- Tokenizes the query
- Applies stemming (“searching” → “search”)
- Handles case normalization
Matches (Phrase Matching)
Exact phrase matching:Prefix Matching
Substring Matching
Suffix Matching
BM25 Ranking
Vespa implements the BM25 ranking algorithm, the industry-standard text relevance function. The implementation is insearchlib/src/vespa/searchlib/features/bm25_feature.cpp.
BM25 Formula
From the source code (searchlib/src/vespa/searchlib/features/bm25_feature.cpp:59-78):
Using BM25 in Ranking
BM25 Parameters
Customize k1 and b parameters:Understanding k1 parameter
Understanding k1 parameter
k1 controls term frequency saturation:
- Lower values (0.5-1.0): Less emphasis on term frequency
- Default (1.2): Balanced
- Higher values (1.5-2.0): More emphasis on term frequency
Understanding b parameter
Understanding b parameter
b controls document length normalization:
- b = 0: No length normalization
- b = 0.75: Default, balanced normalization
- b = 1.0: Full length normalization
Linguistic Processing
Vespa applies linguistic processing fromcontainer-search/src/main/java/com/yahoo/search/yql/YqlParser.java:26-31:
- Tokenization: Split text into tokens
- Normalization: Case folding, Unicode normalization
- Stemming: Reduce words to root form
- Segmentation: Language-specific word segmentation (e.g., Chinese, Japanese)
Controlling Linguistic Processing
Disable stemming for specific terms:Language Detection
Vespa can detect document language automatically:Multi-Field Search
Search across multiple fields:Fieldsets
Define fieldsets for convenience:Advanced Text Operators
Fuzzy Search
Tolerate typos and misspellings:maxEditDistance controls how many character edits are allowed (default: 2).Regular Expressions
WAND (Weak AND)
Efficiently find documents matching any of many terms:WeakAnd
Similar to WAND but for boolean queries:application/src/test/java/com/yahoo/application/ApplicationTest.java):
Text Ranking Features
Vespa provides many text-based ranking features:Term Frequency Features
Available Features
bm25(field): BM25 score for fieldfieldMatch(field): Advanced field matching scorefieldLength(field): Document field lengthfieldTermMatch(field, term_idx): Per-term matching infoterm(idx).significance: Term significance (IDF-based)termDistance(field, term1_idx, term2_idx): Distance between terms
Query Annotations
Fine-tune query term behavior:Text Search Performance
Indexing Performance
Query Performance
Use weakAnd for large result sets
When queries match many documents, weakAnd provides better latency
Stopwords
Configure stopwords to filter common words:Highlighting
Enable result highlighting:Best Practices
- Enable BM25 for text relevance: Use
index: enable-bm25on important text fields - Use appropriate match modes:
containsfor terms,matchesfor phrases - Leverage linguistic processing: Let Vespa handle stemming and normalization
- Combine with filters: Use structured filters to narrow results before text matching
- Monitor field lengths: Very long fields can impact ranking quality
Next Steps
- Learn about Ranking Expressions to combine text signals
- Explore Vector Search for semantic search
- Read about Grouping & Aggregation for result organization