Indexing and Document Processing

Overview

When documents are fed to Vespa, they go through an indexing pipeline that transforms and processes them before storage. The pipeline consists of:

Indexing Language - Declarative expressions for field transformations
Document Processors - Custom Java components for complex processing
Indexing Pipeline - The complete flow from ingestion to storage

Indexing Language

The indexing language is a domain-specific language for transforming document fields during indexing.

Basic Syntax

Define indexing statements in your schema:

schema music {
    document music {
        field title type string {
            indexing: summary | index
        }
        
        field artist type string {
            indexing: summary | attribute
        }
        
        field year type int {
            indexing: summary | attribute
        }
    }
}

Indexing Expressions

The indexing language supports various expressions for field manipulation:

Input Expression

Read a field value from the document:

field my_field type string {
    indexing: input title | lowercase | index
}

Reference: indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/expressions/InputExpression.java:19

Output Expressions

Specify where to store the processed value:

index

Store in memory index for full-text search:

field title type string {
    indexing: input title | lowercase | index
}

attribute

Store in in-memory attribute for fast access, filtering, and sorting:

field year type int {
    indexing: attribute
}

summary

Store in document summary for retrieval:

field description type string {
    indexing: summary
}

Reference: indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/expressions/IndexExpression.java:7

Transformation Expressions

The indexing language provides 87+ built-in expressions for data transformation:

field normalized_title type string {
    indexing: input title | lowercase | trim | normalize | index
}

field tokens type array<string> {
    indexing: input text | tokenize | index
}

Common Expressions

Expression	Description	Example
`input`	Read field value	`input title`
`lowercase`	Convert to lowercase	`lowercase`
`tokenize`	Split into tokens	`tokenize`
`normalize`	Unicode normalization	`normalize`
`trim`	Remove whitespace	`trim`
`index`	Store in index	`index`
`attribute`	Store as attribute	`attribute`
`summary`	Include in summary	`summary`
`embed`	Generate embeddings	`embed embedder_name`
`flatten`	Flatten nested structures	`flatten`
`for_each`	Process array elements	`for_each { ... }`

Control Flow

Choice Expression

Conditional processing based on field presence:

field display_title type string {
    indexing: (input title || input name || "Untitled") | summary
}

ForEach Expression

Process array elements:

field normalized_tags type array<string> {
    indexing: input tags | for_each { lowercase | trim } | index
}

Script Expressions

Chain multiple operations:

field processed_text type string {
    indexing: input raw_text | 
              lowercase | 
              trim | 
              tokenize | 
              normalize | 
              index | 
              summary
}

Embedding Generation

Generate embeddings during indexing:

schema doc {
    document doc {
        field text type string {
            indexing: summary | index
        }
    }
    
    field embedding type tensor<float>(x[384]) {
        indexing: input text | embed embedder | attribute
    }
}

The embed expression requires configuring an embedder in your services.xml.

Document Processors

Document processors are Java components that perform custom processing on documents before they’re indexed.

Creating a Document Processor

Extend DocumentProcessor and implement the process method:

import com.yahoo.docproc.DocumentProcessor;
import com.yahoo.docproc.Processing;
import com.yahoo.document.DocumentPut;
import com.yahoo.document.Document;

public class MusicEnricherProcessor extends DocumentProcessor {
    
    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document doc = put.getDocument();
                
                // Enrich document
                enrichDocument(doc);
            }
        }
        return Progress.DONE;
    }
    
    private void enrichDocument(Document doc) {
        String artist = (String) doc.getFieldValue("artist");
        if (artist != null) {
            // Add normalized artist field
            doc.setFieldValue("artist_normalized", 
                artist.toLowerCase().trim());
        }
    }
}

Reference: docproc/src/main/java/com/yahoo/docproc/DocumentProcessor.java:45

Processing Return Values

Document processors return a Progress value indicating the outcome:

// Processing completed successfully
return Progress.DONE;

Reference: docproc/src/main/java/com/yahoo/docproc/DocumentProcessor.java:108-150

Accessing Document Operations

The Processing object contains all document operations:

import com.yahoo.docproc.Processing;
import com.yahoo.document.DocumentOperation;
import com.yahoo.document.DocumentPut;
import com.yahoo.document.DocumentUpdate;
import com.yahoo.document.DocumentRemove;

@Override
public Progress process(Processing processing) {
    for (DocumentOperation op : processing.getDocumentOperations()) {
        if (op instanceof DocumentPut) {
            DocumentPut put = (DocumentPut) op;
            processPut(put.getDocument());
        } else if (op instanceof DocumentUpdate) {
            DocumentUpdate update = (DocumentUpdate) op;
            processUpdate(update);
        } else if (op instanceof DocumentRemove) {
            DocumentRemove remove = (DocumentRemove) op;
            processRemove(remove.getId());
        }
    }
    return Progress.DONE;
}

Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:204-207

Context Variables

Store and retrieve context data across processors:

@Override
public Progress process(Processing processing) {
    // Set context variable
    processing.setVariable("start_time", System.currentTimeMillis());
    
    // Get context variable
    Long startTime = (Long) processing.getVariable("start_time");
    
    // Check if variable exists
    if (processing.hasVariable("user_id")) {
        String userId = (String) processing.getVariable("user_id");
    }
    
    return Progress.DONE;
}

Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:140-176

Asynchronous Processing

For operations requiring external calls:

import java.util.concurrent.CompletableFuture;

public class AsyncEnricherProcessor extends DocumentProcessor {
    
    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                Document doc = ((DocumentPut) op).getDocument();
                
                // Check if already processed
                if (processing.hasVariable("enriched_" + doc.getId())) {
                    continue;
                }
                
                // Start async enrichment
                String artist = (String) doc.getFieldValue("artist");
                CompletableFuture<ArtistInfo> future = 
                    fetchArtistInfo(artist);
                    
                future.whenComplete((info, error) -> {
                    if (error == null) {
                        doc.setFieldValue("genre", info.getGenre());
                        processing.setVariable("enriched_" + doc.getId(), true);
                    }
                });
                
                // Return LATER to be called again
                return Progress.LATER;
            }
        }
        return Progress.DONE;
    }
}

When returning Progress.LATER, the processor will be called again. Ensure you track state to avoid infinite loops.

Configuring Document Processors

Define processors in services.xml:

<services version="1.0">
    <container version="1.0" id="default">
        <document-processing>
            <chain id="default" inherits="indexing">
                <documentprocessor id="com.example.MusicEnricherProcessor"/>
                <documentprocessor id="com.example.ValidationProcessor"/>
            </chain>
        </document-processing>
        
        <nodes>
            <node hostalias="node1"/>
        </nodes>
    </container>
</services>

Multiple Processing Chains

Create different chains for different document types:

<document-processing>
    <chain id="music-chain" inherits="indexing">
        <documentprocessor id="com.example.MusicEnricherProcessor"/>
    </chain>
    
    <chain id="user-chain" inherits="indexing">
        <documentprocessor id="com.example.UserValidationProcessor"/>
    </chain>
</document-processing>

Indexing Pipeline

The complete indexing flow:

Document Reception

Vespa receives the document via feed client or HTTP API.

Document Processing

Document processors in the chain execute sequentially:

Document → Processor 1 → Processor 2 → ... → Processor N

Indexing Language Execution

Field-level transformations defined in the schema are applied.

Storage

Processed document is stored:

Fields marked index go to memory index

Fields marked attribute go to attribute storage

Fields marked summary go to document summary

Error Handling

Handle errors in document processors:

@Override
public Progress process(Processing processing) {
    try {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            validateOperation(op);
        }
        return Progress.DONE;
    } catch (ValidationException e) {
        log.warning("Validation failed: " + e.getMessage());
        return Progress.FAILED.withReason(e.getMessage());
    } catch (Exception e) {
        log.severe("Unexpected error: " + e.getMessage());
        return Progress.PERMANENT_FAILURE;
    }
}

Timeouts

Monitor and enforce timeouts:

import java.time.Duration;

@Override
public Progress process(Processing processing) {
    Duration timeLeft = processing.timeLeft();
    
    if (timeLeft.toMillis() < 1000) {
        log.warning("Processing timeout approaching");
        return Progress.TIMEOUT;
    }
    
    // Process with remaining time
    return Progress.DONE;
}

Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:232-237

Best Practices

Keep Indexing Expressions Simple

Use indexing language for simple transformations. Move complex logic to document processors:

// Good: Simple transformation
field title type string {
    indexing: input title | lowercase | index
}

// Complex logic → Use document processor instead

Make Processors Stateless

Document processors must be thread-safe. Avoid mutable instance variables:

public class SafeProcessor extends DocumentProcessor {
    // Good: Immutable configuration
    private final String apiEndpoint;
    
    // Bad: Mutable state
    // private int counter;
    
    @Override
    public Progress process(Processing processing) {
        // Use local variables for state
        int localCounter = 0;
        return Progress.DONE;
    }
}

Handle Async Operations Properly

Track async operation state to avoid reprocessing:

if (!processing.hasVariable("async_started")) {
    // Start async operation
    startAsyncOperation();
    processing.setVariable("async_started", true);
    return Progress.LATER;
}

Use Appropriate Progress Codes

Return the correct progress code:

DONE - Processing complete

LATER - Need more time (async operation)

FAILED - This document failed (temporary)

PERMANENT_FAILURE - Critical error (disables processor)

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

Indexing and Document Processing

Overview

Indexing Language

Basic Syntax

Indexing Expressions

Input Expression

Output Expressions

Transformation Expressions

Common Expressions

Control Flow

Choice Expression

ForEach Expression

Script Expressions

Embedding Generation

Document Processors

Creating a Document Processor

Processing Return Values

Accessing Document Operations

Context Variables

Asynchronous Processing

Configuring Document Processors

Multiple Processing Chains

Indexing Pipeline

Error Handling

Timeouts

Best Practices

See Also

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

Documentation Index

​Overview

​Indexing Language

​Basic Syntax

​Indexing Expressions

​Input Expression

​Output Expressions

​Transformation Expressions

​Common Expressions

​Control Flow

​Choice Expression

​ForEach Expression

​Script Expressions

​Embedding Generation

​Document Processors

​Creating a Document Processor

​Processing Return Values

​Accessing Document Operations

​Context Variables

​Asynchronous Processing

​Configuring Document Processors

​Multiple Processing Chains

​Indexing Pipeline

​Error Handling

​Timeouts

​Best Practices

​See Also

Overview

Indexing Language

Basic Syntax

Indexing Expressions

Input Expression

Output Expressions

Transformation Expressions

Common Expressions

Control Flow

Choice Expression

ForEach Expression

Script Expressions

Embedding Generation

Document Processors

Creating a Document Processor

Processing Return Values

Accessing Document Operations

Context Variables

Asynchronous Processing

Configuring Document Processors

Multiple Processing Chains

Indexing Pipeline

Error Handling

Timeouts

Best Practices

See Also