Caterpillar is a powerful data ingestion and processing pipeline system written in Go. It's designed to handle complex data workflows by connecting multiple processing tasks in a pipeline, similar to how Unix pipes work in shell scripting. Each task processes data and passes it to the next task, creating a flexible and scalable data processing system.
Caterpillar enables you to:
- Ingest data from various sources (HTTP APIs, files, S3, SQS, etc.)
- Transform data using JQ queries, XPath, and custom logic
- Process data with sampling, filtering, and aggregation
- Output data to files, HTTP endpoints, or other destinations
- Chain operations in a pipeline for complex workflows
The system is particularly useful for:
- ETL (Extract, Transform, Load) processes
- API data aggregation and transformation
- Log processing and analysis
- Data migration and synchronization
- Real-time data streaming workflows
- Go 1.24.5 or later
- AWS CLI configured (for AWS-related tasks)
-
Clone the repository:
git clone <repository-url> cd caterpillar
-
Build the project:
go build -o caterpillar cmd/caterpillar/caterpillar.go
-
Run a pipeline:
./caterpillar -conf test/pipelines/hello_name.yaml
Caterpillar is built around the concept of tasks that process records in a pipeline. Here's how it works:
-
Records: Each piece of data in the pipeline is wrapped in a
Record
object containing:Data
: The actual data stringOrigin
: The name of the task that created this recordID
: A unique identifier for the recordContext
: Additional metadata that can be shared between tasks
-
Task Processing: Each task:
- Receives records from its input channel
- Processes the data according to its specific function
- Sends processed records to its output channel
- Closes the output channel when done processing all records
-
Pipeline Flow: Records flow through the pipeline like this:
Task A → Task B → Task C → Task D
When Task A finishes processing all records, it closes its output channel. This triggers Task B to finish processing and close its output channel, creating a chain reaction that propagates to the end of the pipeline.
-
Architecture Components:
- Tasks are connected in sequence and process data
- Channels pass data between tasks with configurable buffer sizes
- Records contain data and metadata that flow through the pipeline
By default, tasks handle errors gracefully:
- Non-critical errors are logged but don't stop the pipeline
- Task failures are recorded but processing continues
- Pipeline completion returns a count of processed records
However, you can configure tasks to fail the entire pipeline on error:
tasks:
- name: critical_task
type: http
fail_on_error: true # Pipeline stops if this task fails
When fail_on_error
is set, the pipeline will return a non-zero exit code if any configured task encounters an error.
Context variables allow you to store data on a record that can be accessed later in the pipeline. This is especially useful when tasks are separated by one or more intermediate tasks.
Setting Context: Tasks can set context variables using JQ expressions:
tasks:
- name: extract_user_data
type: jq
path: .user.id
context:
user_id: . # Stores the user ID in context
user_name: .name # Stores the user name in context
Using Context: Later tasks can reference context variables using the {{ context "key" }}
syntax:
tasks:
- name: use_context
type: file
path: users/{{ context "user_id" }}_{{ context "user_name" }}.json
Context Persistence: Context variables persist throughout the pipeline, so you can set them early and use them much later:
tasks:
- name: extract_data
type: jq
context:
timestamp: .timestamp
user_id: .user.id
- name: process_data
type: http
# Uses context set 2 tasks ago
- name: save_result
type: file
path: output/{{ context "user_id" }}_{{ context "timestamp" }}.txt
Important Notes:
- Context variables are tied to individual records
- When using
explode: true
in JQ tasks, new records inherit context from the original record - Context variables are evaluated at runtime, not at pipeline startup
- Each task can set its own context variables that will be available to downstream tasks
Caterpillar supports several template functions for dynamic configuration. These are evaluated at different times:
- Pipeline initialization:
env
andsecret
functions are evaluated once when the pipeline starts - Per record:
macro
andcontext
functions are evaluated for each record as it's processed
Macros are evaluated for each record and produce dynamic values (evaluated per record):
{{ macro "unixtime" }}
- Current Unix timestamp{{ macro "timestamp" }}
- Current ISO timestamp (2006-01-02T15:04:05Z07:00){{ macro "microtimestamp" }}
- Current microsecond timestamp{{ macro "uuid" }}
- Generate a new UUID
Example:
tasks:
- name: write_file
type: file
path: output/data_{{ macro "timestamp" }}.txt
Access environment variables using the env
function (evaluated at pipeline initialization):
tasks:
- name: api_call
type: http
endpoint: {{ env "API_ENDPOINT" }}
headers:
Authorization: Bearer {{ env "API_TOKEN" }}
Retrieve secrets from AWS Parameter Store (evaluated at pipeline initialization):
tasks:
- name: secure_api
type: http
headers:
Authorization: Bearer {{ secret "/prod/api/token" }}
Reference context variables set by upstream tasks (evaluated per record):
tasks:
- name: use_context
type: file
path: users/{{ context "user_id" }}_{{ context "user_name" }}.json
Caterpillar supports the following tasks, each of which can serve different roles depending on their configuration:
aws_parameter_store
- Read parameters from AWS Systems Manager Parameter Storecompress
- Compress or decompress data using various algorithmsconverter
- Convert data between different formats (CSV, HTML, JSON, XML, SST)delay
- Add controlled delays between record processingecho
- Print data to console for debugging and monitoringfile
- Read from or write to local files and S3 (acts as source or sink)flatten
- Flatten nested JSON structures into single-level key-value pairsheimdall
- Submit jobs to Heimdall data orchestration platform and return resultshttp
- Make HTTP requests with OAuth support and retry logichttp_server
-Start an HTTP server to receive incoming datajoin
- Combine multiple records into a single recordjq
- Transform JSON data using JQ queriesreplace
- Perform regex-based text replacement and transformationsample
- Sample data using various strategies (random, head, tail, nth, percent)split
- Split data by specified delimiterssqs
- Read from or write to AWS SQS queues (acts as source or sink)xpath
- Extract data from XML/HTML using XPath expressions
Control the buffer size between tasks:
channel_size: 10000 # Default is 10,000
tasks:
- name: task1
type: echo
Each task supports common configuration options:
tasks:
- name: my_task
type: http
fail_on_error: true # Stop pipeline on error
context:
extracted_value: .data.value # Set context for downstream tasks
Tasks can be configured to fail the entire pipeline on error:
tasks:
- name: critical_task
type: http
fail_on_error: true # Pipeline stops if this task fails
See the test/pipelines/
directory for comprehensive examples of different pipeline configurations and task combinations.
Caterpillar was created at Pattern, Inc by Stan Babourine, with contributions from Will Graham, Prasad Lohakpure, Mahesh Kamble, Shubham Khanna, Ivan Hladush, Amol Udage, Divyanshu Tiwari, Dnyaneshwar Mane and Narayan Attarde.