Documentation Index Fetch the complete documentation index at: https://support.agentrank.io/llms.txt
Use this file to discover all available pages before exploring further.
Vespa provides native support for ONNX (Open Neural Network Exchange) models, enabling you to deploy machine learning models from PyTorch, TensorFlow, scikit-learn, and other frameworks.
Overview
ONNX models can be used for:
Ranking - Score documents during search
Embeddings - Generate vector representations
Feature extraction - Transform data for downstream tasks
Stateless inference - Serve predictions via REST API
Vespa evaluates ONNX models using ONNX Runtime, providing high-performance inference on CPU and GPU.
Adding ONNX Models
Export Your Model to ONNX
Convert your trained model to ONNX format: import torch
import torch.onnx
# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load( 'model.pt' ))
model.eval()
# Create dummy input
dummy_input = torch.randn( 1 , 10 )
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"my_model.onnx" ,
input_names = [ 'input' ],
output_names = [ 'output' ],
dynamic_axes = {
'input' : { 0 : 'batch_size' },
'output' : { 0 : 'batch_size' }
},
opset_version = 14
)
Add Model to Application Package
Place the ONNX file in your application’s models/ directory: my-app/
├── services.xml
├── schemas/
│ └── doc.sd
└── models/
└── my_model.onnx
Declare Model in Schema
Reference the model in your schema file: schema doc {
onnx-model my_model {
file: models/my_model.onnx
input input: my_input_expression
output output: my_output
}
}
Use Model in Ranking
Reference the model in rank profiles: rank-profile with_onnx {
function my_input_expression() {
expression: tensor<float>(d0[10]):[1,2,3,4,5,6,7,8,9,10]
}
first-phase {
expression: onnx(my_model).output
}
}
Model Configuration
Basic Declaration
Declare an ONNX model in your schema:
onnx-model classifier {
file: models/classifier.onnx
}
Map ONNX input names to Vespa expressions:
onnx-model scorer {
file: models/scorer.onnx
# Map ONNX inputs to Vespa features
input input_ids: tokenSequence
input attention_mask: tokenMask
input segment_ids: tokenTypes
}
// From: config-model/src/main/java/com/yahoo/schema/OnnxModel.java:57-84
private String validateInputSource ( String source) {
var optRef = Reference . simple (source);
if ( optRef . isPresent ()) {
Reference ref = optRef . get ();
// input can be one of:
// attribute(foo), query(foo), constant(foo)
if ( FeatureNames . isSimpleFeature (ref)) {
return ref . toString ();
}
// or a function (evaluated by backend)
if ( ref . isSimpleRankingExpressionWrapper ()) {
var arg = ref . simpleArgument ();
if ( arg . isPresent ()) {
return ref . toString ();
}
}
} else {
// otherwise it must be an identifier
Reference ref = Reference . fromIdentifier (source);
return ref . toString ();
}
// invalid input source
throw new IllegalArgumentException ( "invalid input for ONNX model " + getName () + ": " + source);
}
Valid input sources:
attribute(field_name) - Document attribute
query(param_name) - Query parameter
constant(const_name) - Ranking constant
Function names defined in the rank profile
Output Mapping
Map ONNX output names to Vespa identifiers:
onnx-model encoder {
file: models/encoder.onnx
output embeddings: last_hidden_state
output pooled: pooler_output
}
Reference outputs in ranking:
rank-profile semantic {
first-phase {
expression: onnx(encoder).embeddings
}
}
ONNX Runtime Configuration
Configure ONNX Runtime execution in services.xml:
< container id = "default" version = "1.0" >
< component id = "ai.vespa.modelintegration.evaluator.OnnxRuntime"
bundle = "model-integration" >
< config name = "ai.vespa.modelintegration.evaluator.onnx-evaluator" >
<!-- Execution mode: sequential or parallel -->
< executionMode > sequential </ executionMode >
<!-- Number of threads for parallel execution -->
< interOpThreads > 1 </ interOpThreads >
<!-- Intra-op threads: -4 means CPUs/4, 0 means CPUs, >0 is explicit count -->
< intraOpThreads > -4 </ intraOpThreads >
<!-- GPU device: 0+ for GPU device ID, -1 for CPU -->
< gpuDevice > -1 </ gpuDevice >
</ config >
</ component >
</ container >
Execution Modes
Sequential (Default)
Parallel
Single-threaded execution, best for low-latency inference: < executionMode > sequential </ executionMode >
< interOpThreads > 1 </ interOpThreads >
Multi-threaded execution for batch processing: < executionMode > parallel </ executionMode >
< interOpThreads > 4 </ interOpThreads >
< intraOpThreads > 2 </ intraOpThreads >
GPU Acceleration
Enable GPU inference with CUDA:
< gpuDevice > 0 </ gpuDevice > <!-- Use first GPU -->
GPU support requires ONNX Runtime with CUDA provider. Ensure your deployment environment has compatible CUDA drivers.
Using ONNX Models
In Ranking Expressions
Reference ONNX models in rank profiles:
schema product {
document product {
field title type string {}
field price type float {}
field category type string {}
}
onnx-model ranker {
file: models/ranker.onnx
input features: featureVector
}
rank-profile ml_ranking {
function featureVector() {
expression: tensor<float>(d0[5]):[
attribute(price),
query(user_score),
fieldMatch(title).completeness,
attribute(popularity),
freshness(timestamp)
]
}
first-phase {
expression: onnx(ranker).output
}
}
}
With Multiple Outputs
Access specific model outputs:
onnx-model multi_output {
file: models/multi.onnx
output scores: output_scores
output embeddings: output_embeddings
}
rank-profile combined {
first-phase {
expression: onnx(multi_output).scores
}
second-phase {
expression: sum(onnx(multi_output).embeddings * query(q_vec))
}
}
Stateless Evaluation API
Use the ModelsEvaluator API for stateless inference:
// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java:17-24
/**
* Evaluates machine-learned models added to Vespa applications and available as config form.
* Usage:
* <code>Tensor result = evaluator.bind("foo", value).bind("bar", value").evaluate()</code>
*
* @author bratseth
*/
public class ModelsEvaluator extends AbstractComponent {
public FunctionEvaluator evaluatorOf ( String modelName , String ... names ) {
return requireModel (modelName). evaluatorOf (names);
}
}
Access via REST API:
curl 'http://localhost:8080/model-evaluation/v1/my_model/eval' \
-d '{"input": [1.0, 2.0, 3.0, 4.0, 5.0]}'
Model Optimization
Model Quantization
Reduce model size and improve performance with quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
# Dynamic quantization
quantize_dynamic(
model_input = "model.onnx" ,
model_output = "model_quantized.onnx" ,
weight_type = QuantType.QUInt8
)
Model Simplification
Simplify ONNX graphs:
import onnx
from onnxsim import simplify
# Load and simplify model
model = onnx.load( "model.onnx" )
model_simplified, check = simplify(model)
assert check, "Simplified model is invalid"
onnx.save(model_simplified, "model_simplified.onnx" )
Dynamic Shapes
Support variable batch sizes and sequence lengths:
torch.onnx.export(
model,
dummy_input,
"model.onnx" ,
dynamic_axes = {
'input_ids' : { 0 : 'batch' , 1 : 'sequence' },
'attention_mask' : { 0 : 'batch' , 1 : 'sequence' },
'output' : { 0 : 'batch' , 1 : 'sequence' }
}
)
OnnxEvaluator Interface
The core evaluation interface:
// From: model-integration/src/main/java/ai/vespa/modelintegration/evaluator/OnnxEvaluator.java:10-29
/**
* Evaluator for ONNX models.
*
* @author bjorncs
*/
public interface OnnxEvaluator extends AutoCloseable {
record IdAndType ( String id, TensorType type) { }
Tensor evaluate ( Map < String , Tensor > inputs , String output );
Map < String , Tensor > evaluate ( Map < String , Tensor > inputs );
Map < String , OnnxEvaluator . IdAndType > getInputs ();
Map < String , OnnxEvaluator . IdAndType > getOutputs ();
Map < String , TensorType > getInputInfo ();
Map < String , TensorType > getOutputInfo ();
@ Override void close ();
}
Common Model Types
Classification Models
onnx-model classifier {
file: models/classifier.onnx
input features: featureVector
output logits: output
}
rank-profile classify {
function featureVector() {
expression: tensor<float>(d0[100]):[...]
}
first-phase {
expression: onnx(classifier).logits
}
}
Reranking Models
onnx-model cross_encoder {
file: models/cross_encoder.onnx
input input_ids: inputSequence
input attention_mask: inputMask
}
rank-profile rerank {
first-phase {
expression: bm25(title) + bm25(body)
}
second-phase {
expression: onnx(cross_encoder).logits{d0:0,d1:0}
rerank-count: 100
}
}
Embedding Models
See the Embeddings page for embedding-specific models.
Troubleshooting
Model Validation
Vespa validates models at deployment:
vespa deploy
# Check for errors like:
# "Model does not contain required input: 'input_ids'"
# "Model contains: input_tokens, attention_scores"
Use onnx Python package:
import onnx
model = onnx.load( "model.onnx" )
print ( "Inputs:" )
for input in model.graph.input:
print ( f " { input .name } : { input .type } " )
print ( "Outputs:" )
for output in model.graph.output:
print ( f " { output.name } : { output.type } " )
Reduce model size through quantization
Use dynamic batching for throughput
Enable GPU acceleration
Optimize intra-op thread count
Use model quantization (int8, uint8)
Limit number of concurrent evaluations
Monitor model size vs available RAM
Verify input tensor shapes and types
Check input/output name mappings
Validate preprocessing matches training
Test model with onnxruntime directly
Examples
TensorFlow to ONNX
import tensorflow as tf
import tf2onnx
# Load TensorFlow model
model = tf.keras.models.load_model( 'model.h5' )
# Convert to ONNX
spec = (tf.TensorSpec(( None , 10 ), tf.float32, name = "input" ),)
output_path = "model.onnx"
model_proto, _ = tf2onnx.convert.from_keras(
model,
input_signature = spec,
opset = 14 ,
output_path = output_path
)
scikit-learn to ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.ensemble import RandomForestClassifier
# Train sklearn model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Convert to ONNX
initial_type = [( 'float_input' , FloatTensorType([ None , X_train.shape[ 1 ]]))]
onnx_model = convert_sklearn(model, initial_types = initial_type)
with open ( "model.onnx" , "wb" ) as f:
f.write(onnx_model.SerializeToString())
Next Steps
Embeddings Use ONNX models for text embeddings
Model Evaluation Stateless vs ranking evaluation
RAG Applications Combine models with retrieval
Performance Tuning Optimize model inference