Documentation Index Fetch the complete documentation index at: https://support.agentrank.io/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers common Vespa operational issues, their symptoms, root causes, and solutions.
Diagnostic Approach
Identify Symptoms
Gather error messages, metrics, and logs
Check Service Health
Verify all services are running: vespa-model-inspect services
curl http://localhost:19050/state/v1/health
Review Metrics
Look for anomalies in key metrics
Examine Logs
Check logs for errors: vespa-logfmt -l all -f /opt/vespa/logs/vespa/vespa.log
Apply Fix
Implement solution and verify resolution
High Query Latency
Slow query responses (> 1 second)
Increasing query latency over time
Timeout errors
Check these metrics: QUERY_LATENCY // Overall latency
QUERY_CONTAINER_LATENCY // Container processing time
CONTENT_PROTON_DOCUMENTDB_MATCHING_QUERY_LATENCY // Content node latency
CONTENT_PROTON_DOCSUM_LATENCY // Document summary latency
Common Causes and Solutions
1. Inefficient Rank Profiles # Identify slow rank profiles
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name | contains("rank_profile.query_latency"))'
Solution: rank-profile optimized {
first-phase {
expression: bm25(title) # Fast first phase
}
second-phase {
expression: lightgbm("model.json")
rerank-count: 100 # Limit expensive reranking
}
}
2. Thread Pool Saturation // Check for thread pool issues
JDISC_THREAD_POOL_WORK_QUEUE_SIZE // Growing queue
JDISC_THREAD_POOL_REJECTED_TASKS // Rejected requests
CONTENT_PROTON_EXECUTOR_MATCH_QUEUESIZE // Match queue depth
Solution: < content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< requestthreads >
< count > 16 </ count > <!-- Increase threads -->
</ requestthreads >
</ searchnode >
</ tuning >
</ content >
3. Large Result Sets
Symptom: Queries with hits=1000 are slowSolution: # Limit result size
/search/?query=foo&hits=100
# Use pagination
/search/?query=foo&hits=20&offset=20
4. Expensive Grouping Operations
Solution: Optimize grouping queries:# Before (slow)
select * from sources * where true
all(group(category) each(output(count())))
# After (faster)
select * from sources * where true
all(group(category) max(100) each(output(count())))
Query Errors
Timeout Errors
Backend Communication Errors
Illegal Query Errors
Error: Query timed out after 5000ms
Check:
- ERROR_TIMEOUT metric
- CONTENT_PROTON_DOCUMENTDB_MATCHING_SOFT_DOOMED_QUERIES
Solutions:
1. Increase query timeout
2. Optimize slow queries
3. Add more content nodes
Feeding Issues
Slow Document Ingestion
Low feed throughput (< 100 docs/sec on capable hardware)
High feed latency
Growing queue of pending operations
Check these metrics: HTTPAPI_LATENCY // Feed operation latency
HTTPAPI_PENDING // Pending operations
HTTPAPI_QUEUED_OPERATIONS // Queued operations
HTTPAPI_FAILED_TIMEOUT // Timeout failures
1. Use Async Operations import asyncio
import aiohttp
async def feed_document ( session , doc ):
async with session.post(
'http://localhost:8080/document/v1/namespace/doctype/docid' ,
json = doc,
params = { 'timeout' : '180s' }
) as response:
return await response.json()
async def batch_feed ( documents ):
async with aiohttp.ClientSession() as session:
tasks = [feed_document(session, doc) for doc in documents]
results = await asyncio.gather( * tasks)
return results
2. Check Resource Limits // Monitor feeding blocked status
CONTENT_PROTON_RESOURCE_USAGE_FEEDING_BLOCKED // 1 = blocked
CONTENT_PROTON_RESOURCE_USAGE_MEMORY
CONTENT_PROTON_RESOURCE_USAGE_DISK
If feeding is blocked: < content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< resource-limits >
< memory > 0.90 </ memory > <!-- Increase limit -->
< disk > 0.85 </ disk >
</ resource-limits >
</ searchnode >
</ tuning >
</ content >
3. Optimize Document Structure
Remove unnecessary fields
Use appropriate field types
Enable compression for large text fields
Feed Failures
Condition Not Met
Document Not Found
Insufficient Storage
Error: Condition did not match document
Metric: HTTPAPI_CONDITION_NOT_MET
Cause: Test-and-set condition failed
Solution:
- Verify document exists
- Check condition logic
- Handle concurrent updates
Memory Issues
High Memory Usage
Critical Symptoms:
Memory usage > 90%
Feeding blocked due to memory
Frequent GC pauses
OOM errors
Identify Memory Consumers
# Check memory metrics
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name | contains("memory_usage"))'
Key metrics: CONTENT_PROTON_DOCUMENTDB_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_INDEX_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_MEMORY_USAGE_USED_BYTES
Check Attribute Usage
Attributes are stored in memory. Large or high-cardinality attributes consume significant memory. # Check attribute address space usage
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name == "content.proton.documentdb.attribute.resource_usage.address_space")'
Apply Solutions
Reduce attribute memory: schema product {
document product {
# Remove attribute from large fields
field description type string {
indexing: summary | index # Don't use attribute
}
# Use paged attributes for large datasets
field tags type array<string> {
indexing: summary | attribute
attribute: paged # Store on disk, page into memory
}
}
}
Increase memory limits: < content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< resource-limits >
< memory > 0.90 </ memory >
</ resource-limits >
</ searchnode >
</ tuning >
</ content >
JVM Memory Issues (Containers)
// Check GC metrics
JDISC_GC_MS // GC pause time
JDISC_GC_COUNT // GC frequency
MEM_HEAP_USED // Heap usage
Symptoms:
GC pauses > 500ms
GC consuming > 10% CPU
Heap consistently > 80% used
Solutions:
Increase heap size:
< container version = "1.0" id = "default" >
< nodes >
< jvm options = "-Xms16g -Xmx16g" />
</ nodes >
</ container >
Tune GC:
< jvm options = "
-Xms16g -Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=70
" />
Identify memory leaks:
# Generate heap dump
jmap -dump:format=b,file=heap.bin < pi d >
# Analyze with MAT or VisualVM
Disk Issues
Disk Full
Check Disk Usage
CONTENT_PROTON_RESOURCE_USAGE_DISK // Overall usage
CONTENT_PROTON_DOCUMENTDB_DISK_USAGE // Per document DB
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_USAGE
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_BLOAT
CONTENT_PROTON_TRANSACTIONLOG_DISK_USAGE
Identify Causes
High bloat: # Check disk bloat
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name | contains("disk_bloat"))'
Bloat > 30% indicates inefficient storage.
Apply Solutions
1. Enable compression: < content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< summary >
< store >
< compression >
< type > lz4 </ type >
< level > 9 </ level >
</ compression >
</ store >
</ summary >
</ searchnode >
</ tuning >
</ content >
2. Trigger compaction: # Compact document store
vespa-visit -v -s 'id.user==1234' --fieldset '[document]' > /dev/null
3. Remove old documents: # Remove documents by selection
curl -X DELETE 'http://localhost:8080/document/v1/namespace/doctype/docid?selection=timestamp<1640000000'
4. Add more nodes:
Scale horizontally to distribute data.
Slow Disk I/O
Symptoms:
High query latency
Slow feed operations
High disk queue depth
Check: CONTENT_PROTON_DOCUMENTDB_INDEX_IO_SEARCH_READ_BYTES
CONTENT_PROTON_DOCUMENTDB_INDEX_IO_SEARCH_CACHED_READ_BYTES
Solutions:
Optimize cache usage:
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Increase document store cache -->
< summary >
< store >
< cache >
< maxsize-percent > 10 </ maxsize-percent >
</ cache >
</ store >
</ summary >
<!-- Increase index cache -->
< diskindexcache >
< size > 4294967296 </ size > <!-- 4GB -->
</ diskindexcache >
</ searchnode >
</ tuning >
</ content >
Monitor cache hit rates:
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_CACHE_HIT_RATE
CONTENT_PROTON_INDEX_CACHE_POSTINGLIST_HIT_RATE
Target hit rate > 90%.
Use faster storage:
NVMe SSDs for best performance
Ensure sufficient IOPS provisioned
Network Issues
Connection Timeouts
Client to Container
Container to Content
Symptom: Connection timeout to container
Check:
- Container service status
- Network connectivity: ping <container-host>
- Port accessibility: telnet <container-host> 8080
Solution:
- Verify container is running
- Check firewall rules
- Review load balancer configuration
High Connection Count
// Monitor connections
SERVER_NUM_OPEN_CONNECTIONS // Current connections
SERVER_CONNECTIONS_OPEN_MAX // Peak connections
SERVER_CONNECTION_DURATION_MEAN // Average duration
If connections are high:
Check for connection leaks in clients
Tune connection timeouts:
< container version = "1.0" id = "default" >
< http >
< server id = "default" port = "8080" >
< config name = "jdisc.http.connector" >
< idleTimeout > 30.0 </ idleTimeout >
< maxConnectionLife > 300.0 </ maxConnectionLife >
</ config >
</ server >
</ http >
</ container >
Cluster State Issues
Node Down
Identify Down Node
vespa-model-inspect services
curl http://localhost:19050/state/v1/health
Check Node Logs
# On the affected node
vespa-logfmt -l all -f /opt/vespa/logs/vespa/vespa.log | tail -100
Common Issues
Out of memory: java.lang.OutOfMemoryError: Java heap space
Solution: Increase heap size or reduce memory usage Disk full: Solution: Free disk space or add capacity Configuration error: Config error: Invalid configuration
Solution: Review and fix services.xml
Restart Service
# Restart Vespa on the node
vespa-stop-services
vespa-start-services
Split Brain / Cluster State Divergence
Symptoms:
Nodes report different cluster states
Inconsistent query results
Feed operations fail intermittently
Solution:
Check cluster controller:
Force cluster state update:
vespa-set-node-state -c < cluste r > -t storage -i < nod e > up
If issues persist, restart cluster controller
1. Compare metrics before/after: # Export current metrics
curl http://localhost:19092/metrics/v2/values > metrics-current.json
# Compare with baseline
# Look for changes in:
# - query_latency
# - throughput (QPS)
# - resource utilization
# - error rates
2. Review recent changes:
Configuration changes
Schema modifications
Application updates
Infrastructure changes
3. Check resource saturation: // CPU
CPU
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_READ
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_WRITE
// Memory
MEM_HEAP_USED
CONTENT_PROTON_RESOURCE_USAGE_MEMORY
// Thread pools
JDISC_THREAD_POOL_ACTIVE_THREADS
CONTENT_PROTON_EXECUTOR_MATCH_UTILIZATION
4. Profile queries: # Enable query tracing
/search/?query = test & trace.level =5
# Analyze trace output for bottlenecks
Log Analysis
# Filter by log level
vespa-logfmt -l error,warning -f /opt/vespa/logs/vespa/vespa.log
# Filter by component
vespa-logfmt -c searchnode -f /opt/vespa/logs/vespa/vespa.log
# Follow logs in real-time
vespa-logfmt -f -F /opt/vespa/logs/vespa/vespa.log
Metric Queries
# Get specific metric
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name == "query_latency")'
# Get all metrics for a component
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name | startswith("content.proton"))'
Query Tracing
# Basic trace
curl 'http://localhost:8080/search/?query=test&trace.level=2'
# Detailed trace
curl 'http://localhost:8080/search/?query=test&trace.level=5'
# Trace specific components
curl 'http://localhost:8080/search/?query=test&trace.rules=rank:5'
Getting Help
Vespa Slack Join the community for real-time help
GitHub Issues Report bugs and request features
Stack Overflow Search existing questions or ask new ones
Documentation Browse official documentation
Next Steps
Monitoring Set up proactive monitoring
Tuning Optimize performance
Scaling Scale your cluster