Documentation Index Fetch the complete documentation index at: https://support.agentrank.io/llms.txt
Use this file to discover all available pages before exploring further.
Vespa provides built-in embedder components that transform text into vector representations for semantic search, similarity matching, and retrieval tasks.
Overview
Embedders implement the Embedder interface and can be used during:
Document processing - Embed text fields when indexing documents
Query processing - Embed query text for semantic search
Custom processing - Use embedders in custom components
All embedders support:
Automatic tokenization and text preprocessing
Caching of embedding results
Configurable model parameters
ONNX model inference
Available Embedders
BertBaseEmbedder BERT-based models with WordPiece tokenization
HuggingFaceEmbedder Generic Hugging Face transformer models
ColBertEmbedder Multi-vector token-level embeddings
SpladeEmbedder Sparse learned embeddings
BertBaseEmbedder
The BertBaseEmbedder supports BERT and BERT-compatible models (DistilBERT, RoBERTa, etc.).
Configuration
< container id = "default" version = "1.0" >
< component id = "myBertEmbedder" class = "ai.vespa.embedding.BertBaseEmbedder" bundle = "model-integration" >
< config name = "embedding.bert-base-embedder" >
< tokenizerVocab >
< model > models/vocab.txt </ model >
</ tokenizerVocab >
< transformerModel >
< model > models/bert_model.onnx </ model >
</ transformerModel >
< transformerMaxTokens > 384 </ transformerMaxTokens >
< transformerInputIds > input_ids </ transformerInputIds >
< transformerAttentionMask > attention_mask </ transformerAttentionMask >
< transformerTokenTypeIds > token_type_ids </ transformerTokenTypeIds >
< transformerOutput > output_0 </ transformerOutput >
< poolingStrategy > mean </ poolingStrategy >
</ config >
</ component >
</ container >
Model Requirements
BERT-compatible models must have three inputs:
// From: model-integration/src/main/java/ai/vespa/embedding/BertBaseEmbedder.java:23-30
/**
* A BERT Base compatible embedder. This embedder uses a WordPiece embedder to
* produce a token sequence that is then input to a transformer model. A BERT base
* compatible transformer model must have three inputs:
*
* - A token sequence (input_ids)
* - An attention mask (attention_mask)
* - Token types for cross encoding (token_type_ids)
*/
Pooling Strategies
Configure how token embeddings are pooled into a sentence embedding:
Average all token embeddings (recommended for most cases): < poolingStrategy > mean </ poolingStrategy >
Use only the [CLS] token embedding: < poolingStrategy > cls </ poolingStrategy >
Schema Integration
schema doc {
document doc {
field text type string {}
}
field embedding type tensor<float>(x[384]) {
indexing: input text | embed myBertEmbedder | attribute
attribute {
distance-metric: angular
}
}
rank-profile semantic {
inputs {
query(q) tensor<float>(x[384])
}
first-phase {
expression: closeness(field, embedding)
}
}
}
HuggingFaceEmbedder
The HuggingFaceEmbedder supports any Hugging Face model exported to ONNX format.
Configuration
< component id = "hf" class = "ai.vespa.embedding.HuggingFaceEmbedder" bundle = "model-integration" >
< config name = "embedding.hugging-face-embedder" >
< tokenizerPath >
< model > models/tokenizer.json </ model >
</ tokenizerPath >
< transformerModel >
< model > models/model.onnx </ model >
</ transformerModel >
< transformerMaxTokens > 512 </ transformerMaxTokens >
< transformerInputIds > input_ids </ transformerInputIds >
< transformerAttentionMask > attention_mask </ transformerAttentionMask >
< transformerOutput > last_hidden_state </ transformerOutput >
< normalize > true </ normalize >
< poolingStrategy > mean </ poolingStrategy >
</ config >
</ component >
The embedder automatically detects the number of inputs your model requires:
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:54-73
static ModelAnalysis analyze ( OnnxEvaluator evaluator, HuggingFaceEmbedderConfig config) {
Map < String , TensorType > inputs = evaluator . getInputInfo ();
int numInputs = inputs . size ();
String inputIdsName = config . transformerInputIds ();
String attentionMaskName = "" ;
String tokenTypeIdsName = "" ;
validateName (inputs, inputIdsName, "input" );
// some new models have only 1 input
if (numInputs > 1 ) {
attentionMaskName = config . transformerAttentionMask ();
validateName (inputs, attentionMaskName, "input" );
// newer models have only 2 inputs (they do not use token type IDs)
if (numInputs > 2 ) {
tokenTypeIdsName = config . transformerTokenTypeIds ();
validateName (inputs, tokenTypeIdsName, "input" );
}
}
// ...
}
Normalization
Enable L2 normalization for cosine similarity:
< normalize > true </ normalize >
This normalizes embeddings to unit length, making cosine similarity equivalent to dot product.
Query and Document Instructions
Some models require different prompts for queries vs documents:
< prependQuery > query: </ prependQuery >
< prependDocument > passage: </ prependDocument >
Binary Quantization
Reduce memory usage with int8 quantization:
field embedding type tensor<int8>(x[64]) {
indexing: input text | embed hf | attribute
}
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:216-234
private Tensor binaryQuantization ( HuggingFaceEmbedder . HFEmbeddingResult embeddingResult, TensorType targetType) {
long outputDimensions = embeddingResult . output (). shape ()[ 2 ];
long targetDimensions = targetType . dimensions (). get ( 0 ). size (). get ();
//🪆 flexibility - packing only the first 8*targetDimension float values from the model output
long targetUnpackagedDimensions = 8 * targetDimensions;
if (targetUnpackagedDimensions > outputDimensions) {
throw new IllegalArgumentException ( "Cannot pack " + outputDimensions + " into " + targetDimensions + " int8's" );
}
// pool and normalize using float version before binary quantization
TensorType poolingType = new TensorType. Builder ( TensorType . Value . FLOAT ).
indexed ( targetType . indexedSubtype (). dimensions (). get ( 0 ). name (), targetUnpackagedDimensions)
. build ();
Tensor result = analysis . poolingStrategy (). toSentenceEmbedding (poolingType, embeddingResult . output (), embeddingResult . attentionMask ());
result = normalize ? EmbeddingNormalizer . normalize (result, poolingType) : result;
Tensor packedResult = Tensors . packBits (result);
return packedResult;
}
ColBertEmbedder
ColBERT produces multiple vectors per text (one per token), enabling fine-grained similarity matching.
Configuration
< component id = "colbert" class = "ai.vespa.embedding.ColBertEmbedder" bundle = "model-integration" >
< config name = "embedding.col-bert-embedder" >
< tokenizerPath >
< model > models/tokenizer.json </ model >
</ tokenizerPath >
< transformerModel >
< model > models/colbert.onnx </ model >
</ transformerModel >
< transformerMaxTokens > 512 </ transformerMaxTokens >
< maxQueryTokens > 32 </ maxQueryTokens >
< maxDocumentTokens > 256 </ maxDocumentTokens >
< queryTokenId > 1 </ queryTokenId >
< documentTokenId > 2 </ documentTokenId >
</ config >
</ component >
Multi-Vector Schema
ColBERT requires a mixed tensor type:
schema doc {
document doc {
field text type string {}
}
field colbert type tensor<float>(token{}, x[128]) {
indexing: input text | embed colbert | attribute
}
rank-profile colbert {
inputs {
query(qt) tensor<float>(qt{}, x[128])
}
first-phase {
expression: sum(
reduce(
sum(
query(qt) * attribute(colbert), x
),
max, token
),
qt
)
}
}
}
Token Filtering
ColBERT automatically filters punctuation tokens for documents:
// From: model-integration/src/main/java/ai/vespa/embedding/ColBertEmbedder.java:45-105
private static final String PUNCTUATION = "! \" #$%&'()*+,-./:;<=>?@[ \\ ]^_`{|}~" ;
protected TransformerInput buildTransformerInput ( List < Long > tokens, int maxTokens, boolean isQuery) {
if ( ! isQuery) {
tokens = tokens . stream (). filter (token -> ! skipTokens . contains (token)). toList ();
}
// ...
}
SpladeEmbedder
SPLADE creates sparse embeddings using learned term importance weights.
Configuration
< component id = "splade" class = "ai.vespa.embedding.SpladeEmbedder" bundle = "model-integration" >
< config name = "embedding.splade-embedder" >
< tokenizerPath >
< model > models/tokenizer.json </ model >
</ tokenizerPath >
< transformerModel >
< model > models/splade.onnx </ model >
</ transformerModel >
< termScoreThreshold > 0.0 </ termScoreThreshold >
</ config >
</ component >
Sparse Tensor Output
SPLADE produces a mapped tensor with vocabulary terms as labels:
field splade type tensor<float>(term{}) {
indexing: input text | embed splade | attribute
attribute: fast-search
}
Custom Reduction
SPLADE uses optimized reduction for performance:
// From: model-integration/src/main/java/ai/vespa/embedding/SpladeEmbedder.java:177-222
public Tensor sparsifyCustomReduce ( IndexedTensor modelOutput, TensorType tensorType) {
var builder = Tensor . Builder . of (tensorType);
long [] shape = modelOutput . shape ();
int sequenceLength = ( int ) shape[ 1 ];
int vocabSize = ( int ) shape[ 2 ];
String dimension = tensorType . dimensions (). get ( 0 ). name ();
long [] tokens = new long [ 1 ];
DirectIndexedAddress directAddress = modelOutput . directAddress ();
directAddress . setIndex ( 0 , 0 );
for ( int v = 0 ; v < vocabSize; v ++ ) {
double maxValue = 0.0d ;
directAddress . setIndex ( 2 , v);
long increment = directAddress . getStride ( 1 );
long directIndex = directAddress . getDirectIndex ();
for ( int s = 0 ; s < sequenceLength; s ++ ) {
double value = modelOutput . get (directIndex + s * increment);
if (value > maxValue) {
maxValue = value;
}
}
double logOfRelu = Math . log ( 1 + maxValue);
if (logOfRelu > termScoreThreshold) {
tokens[ 0 ] = v;
String term = tokenizer . decode (tokens);
builder . cell ()
. label (dimension, term)
. value (logOfRelu);
}
}
return builder . build ();
}
Exporting Models to ONNX
From Hugging Face
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
model_id = "sentence-transformers/all-MiniLM-L6-v2"
# Export model to ONNX
model = ORTModelForFeatureExtraction.from_pretrained(
model_id, export = True
)
model.save_pretrained( "exported_model" )
# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained( "exported_model" )
From PyTorch
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained( "model-name" )
tokenizer = AutoTokenizer.from_pretrained( "model-name" )
# Create dummy input
dummy_input = tokenizer( "sample text" , return_tensors = "pt" )
# Export to ONNX
torch.onnx.export(
model,
(dummy_input[ "input_ids" ],
dummy_input[ "attention_mask" ],
dummy_input[ "token_type_ids" ]),
"model.onnx" ,
input_names = [ "input_ids" , "attention_mask" , "token_type_ids" ],
output_names = [ "last_hidden_state" ],
dynamic_axes = {
"input_ids" : { 0 : "batch" , 1 : "sequence" },
"attention_mask" : { 0 : "batch" , 1 : "sequence" },
"token_type_ids" : { 0 : "batch" , 1 : "sequence" },
"last_hidden_state" : { 0 : "batch" , 1 : "sequence" }
},
opset_version = 14
)
Caching
Embedders automatically cache results per request:
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:180-183
private HuggingFaceEmbedder . HFEmbeddingResult lookupOrEvaluate ( Context context, String text) {
var key = new HFEmbedderCacheKey ( context . getEmbedderId (), text);
return context . computeCachedValueIfAbsent (key, () -> evaluate (context, text));
}
Thread Configuration
Configure ONNX Runtime threads in onnx-evaluator.def:
< config name = "ai.vespa.modelintegration.evaluator.onnx-evaluator" >
< executionMode > sequential </ executionMode >
< interOpThreads > 1 </ interOpThreads >
< intraOpThreads > -4 </ intraOpThreads > <!-- CPUs / 4 -->
</ config >
GPU Acceleration
Enable GPU inference:
< gpuDevice > 0 </ gpuDevice > <!-- Use first GPU, -1 for CPU -->
Next Steps
ONNX Models Learn about ONNX model deployment
Semantic Search Build semantic search with embeddings
RAG Applications Combine embeddings with generation
Performance Optimize embedding performance