Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,7 @@
- [Calcite Query Engine](sql/calcite-query-engine.md)
- [Flint Index Operations](sql/flint-index-operations.md)
- [Flint Query Scheduler](sql/flint-query-scheduler.md)
- [PPL Aggregate Functions](sql/ppl-aggregate-functions.md)
- [PPL Documentation](sql/ppl-documentation.md)
- [PPL Patterns Command](sql/ppl-patterns-command.md)
- [PPL Rename Command](sql/ppl-rename-command.md)
Expand Down
198 changes: 198 additions & 0 deletions docs/features/sql/ppl-aggregate-functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# PPL Aggregate Functions

## Summary

PPL (Piped Processing Language) aggregate functions enable statistical analysis and data aggregation in OpenSearch queries. These functions work with the `stats` and `eventstats` commands to compute aggregations across documents, supporting operations like counting, averaging, collecting values into arrays, and retrieving first/last values based on document or time order.

## Details

### Architecture

```mermaid
graph TB
subgraph "PPL Query Processing"
Q[PPL Query] --> Parser[PPL Parser]
Parser --> AST[Abstract Syntax Tree]
AST --> Visitor[Calcite Visitor]
end

subgraph "Aggregate Function Resolution"
Visitor --> FR[Function Registry]
FR --> BF[BuiltinFunctionName]
BF --> AGG[Aggregator Functions]
end

subgraph "Aggregator Implementations"
AGG --> FIRST[FirstAggregator]
AGG --> LAST[LastAggregator]
AGG --> LIST[ListAggregator]
AGG --> VALUES[ValuesAggregator]
AGG --> DC[DistinctCount]
AGG --> EL[Earliest/Latest]
end

subgraph "OpenSearch Execution"
FIRST --> TH[top_hits agg]
LAST --> TH
DC --> CARD[cardinality agg]
LIST --> ARRAY[ARRAY_AGG]
end
```

### Data Flow

```mermaid
flowchart TB
subgraph Input
SRC[Source Data]
end

subgraph "stats Command"
SRC --> STATS[stats aggregation]
STATS --> GRP[Group By Fields]
GRP --> AGG1[Aggregate Functions]
end

subgraph "eventstats Command"
SRC --> EVST[eventstats]
EVST --> WIN[Window Functions]
WIN --> PART[PARTITION BY]
end

subgraph Output
AGG1 --> RES1[Aggregated Results]
PART --> RES2[Enriched Documents]
end
```

### Components

| Component | Description |
|-----------|-------------|
| `BuiltinFunctionName` | Enum defining all built-in function names including aggregate functions |
| `AggregatorFunctions` | Registry for aggregate function implementations |
| `FirstAggregator` | Returns first value in document order |
| `LastAggregator` | Returns last value in reverse document order |
| `ListAggregator` | Collects values into an array preserving duplicates |
| `ValuesAggregator` | Collects unique values into an array |
| `CalciteAggCallVisitor` | Translates PPL aggregations to Calcite SQL |

### Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| `plugins.calcite.enabled` | Enable Calcite engine for advanced PPL features | `false` |
| List max values | Maximum values collected by `list()` function | 100 |

### Aggregate Functions Reference

#### Standard Aggregate Functions (stats command)

| Function | Description | NULL Handling |
|----------|-------------|---------------|
| `COUNT(field)` | Count of non-null values | Not counted |
| `SUM(field)` | Sum of values | Ignored |
| `AVG(field)` | Average of values | Ignored |
| `MAX(field)` | Maximum value | Ignored |
| `MIN(field)` | Minimum value | Ignored |
| `FIRST(field)` | First value in document order | Returns NULL |
| `LAST(field)` | Last value in document order | Returns NULL |
| `list(field)` | Array of all values | Filtered out |
| `values(field)` | Array of unique values | Filtered out |
| `distinct_count(field)` / `dc(field)` | Count of distinct values | Ignored |

#### Window Aggregate Functions (eventstats command)

| Function | Description | Translation |
|----------|-------------|-------------|
| `distinct_count(field)` / `dc(field)` | Distinct count per partition | `APPROX_DISTINCT_COUNT(field) OVER (...)` |
| `earliest(field)` | Earliest value by time | Time-based window function |
| `latest(field)` | Latest value by time | Time-based window function |

### Usage Examples

#### Basic Aggregations
```ppl
# Calculate average age
source=accounts | stats avg(age)

# Group by field
source=accounts | stats avg(age), sum(age) by gender

# Multiple aggregations
source=accounts | stats max(age), min(age) by gender
```

#### First/Last Functions
```ppl
# Get first and last values
source=logs | stats first(message), last(status) by host

# Combined with sorting
source=events | sort timestamp | stats first(event_type), last(event_data) by session_id
```

#### Multi-value Functions
```ppl
# Collect all values
source=logs | stats list(user_id) as all_users by status

# Collect unique values
source=events | stats values(source_ip) as unique_ips by hour
```

#### Eventstats with Window Functions
```ppl
# Distinct count per partition
source=accounts | eventstats dc(state) as distinct_states
source=accounts | eventstats distinct_count(country) as unique_countries

# With partitioning
source=accounts | eventstats dc(state) as state_count by gender

# Earliest/Latest
source=transactions | eventstats earliest(amount), latest(amount) by account_id
```

### Supported Data Types

Both `list()` and `values()` functions support:

| Category | Types |
|----------|-------|
| Numeric | INTEGER, LONG, FLOAT, DOUBLE |
| String | STRING, TEXT |
| Boolean | BOOLEAN |
| Date/Time | DATE, TIME, TIMESTAMP |
| Complex | STRUCT, ARRAY |

## Limitations

- `first()` and `last()` use document order, not time-based ordering
- `list()` function returns a maximum of 100 values by default
- `values()` function has no default limit but can be configured
- `distinct_count()` in eventstats uses approximate counting
- Window functions require `plugins.calcite.enabled=true`
- Aggregate functions in eventstats are executed on the coordination node

## Related PRs

| Version | PR | Description |
|---------|-----|-------------|
| v3.3.0 | [#4223](https://github.com/opensearch-project/sql/pull/4223) | Support first/last aggregate functions for PPL |
| v3.3.0 | [#4161](https://github.com/opensearch-project/sql/pull/4161) | Add support for `list()` multi-value stats function |
| v3.3.0 | [#4084](https://github.com/opensearch-project/sql/pull/4084) | Support distinct_count/dc in eventstats |
| v3.3.0 | [#4212](https://github.com/opensearch-project/sql/pull/4212) | Add earliest/latest aggregate function for eventstats |

## References

- [Issue #4203](https://github.com/opensearch-project/sql/issues/4203): PPL first/last aggregate function
- [Issue #4026](https://github.com/opensearch-project/sql/issues/4026): Multivalue Statistics Functions for PPL Calcite Engine
- [Issue #4052](https://github.com/opensearch-project/sql/issues/4052): PPL distinct_count/dc function support for eventstats
- [Issue #4047](https://github.com/opensearch-project/sql/issues/4047): PPL eventstats command enhancement
- [PPL Commands Documentation](https://docs.opensearch.org/3.0/search-plugins/sql/ppl/functions/)
- [SQL Aggregate Functions](https://docs.opensearch.org/3.0/search-plugins/sql/sql/aggregations/)

## Change History

- **v3.3.0** (2026-01-11): Added first/last, list, earliest/latest aggregate functions; extended distinct_count/dc to eventstats command
155 changes: 155 additions & 0 deletions docs/releases/v3.3.0/features/sql/ppl-aggregate-functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# PPL Aggregate Functions

## Summary

OpenSearch v3.3.0 expands PPL (Piped Processing Language) aggregate function capabilities with new functions for the `stats` and `eventstats` commands. This release adds `first()`, `last()`, `list()`, `earliest()`, `latest()`, and `distinct_count()`/`dc()` support for `eventstats`, enabling more powerful data analysis and aggregation workflows.

## Details

### What's New in v3.3.0

This release introduces several new aggregate functions and extends existing functions to work with the `eventstats` command:

| Function | Command Support | Description |
|----------|-----------------|-------------|
| `first(field)` | `stats` | Returns the first value in natural document order |
| `last(field)` | `stats` | Returns the last value in natural document order |
| `list(field)` | `stats` | Collects all values into an array (preserves duplicates) |
| `earliest(field)` | `eventstats` | Returns the earliest value based on time |
| `latest(field)` | `eventstats` | Returns the latest value based on time |
| `distinct_count(field)` / `dc(field)` | `eventstats` | Counts distinct values as a window function |

### Technical Changes

#### Architecture Changes

```mermaid
graph TB
subgraph PPL Query
Q[PPL Query] --> P[Parser]
P --> AST[AST Builder]
end

subgraph Calcite Engine
AST --> CV[CalciteVisitor]
CV --> AGG[Aggregate Functions]
AGG --> FIRST[FirstAggregator]
AGG --> LAST[LastAggregator]
AGG --> LIST[ListAggregator]
AGG --> DC[DistinctCountAggregator]
AGG --> EL[Earliest/Latest]
end

subgraph OpenSearch
AGG --> TH[top_hits aggregation]
AGG --> CARD[cardinality aggregation]
end
```

#### New Components

| Component | Description |
|-----------|-------------|
| `FIRST` | Aggregate function returning first value in document order |
| `LAST` | Aggregate function returning last value in reverse document order |
| `LIST` | Multi-value aggregate collecting values into arrays |
| `EARLIEST` | Time-based aggregate for eventstats |
| `LATEST` | Time-based aggregate for eventstats |
| `distinct_count`/`dc` | Distinct count support for eventstats window functions |

#### Function Specifications

**FIRST Function**
- Syntax: `FIRST(field)`
- Returns: First value of the field in natural document order
- Return Type: Same as input field type (nullable)
- Behavior: Uses `top_hits` with `size: 1` for OpenSearch pushdown

**LAST Function**
- Syntax: `LAST(field)`
- Returns: Last value of the field in reverse document order
- Return Type: Same as input field type (nullable)
- Behavior: Uses `top_hits` with `size: 1` and reverse sort

**LIST Function**
- Syntax: `list(field)`
- Returns: Array of all values (preserves duplicates and order)
- Return Type: `ARRAY<T>` where T is the input field type
- Behavior: Collects up to 100 values by default

**DISTINCT_COUNT/DC for eventstats**
- Syntax: `distinct_count(field)` or `dc(field)`
- Returns: Count of distinct values
- Translation: `APPROX_DISTINCT_COUNT(field) OVER (PARTITION BY ...)`

### Usage Examples

**First/Last aggregate functions:**
```ppl
# Basic usage
source=logs | stats first(message), last(status) by host

# Combined with other aggregations
source=metrics | stats first(cpu_usage), last(memory_usage), count(), avg(response_time) by server

# Sequential processing after sorting
source=events | sort timestamp | stats first(event_type), last(event_data) by session_id
```

**List function:**
```ppl
# Collect all user IDs for each status
source=access_logs | stats list(user_id) as all_users by response_status

# Combined with other statistics
source=ecommerce | stats count(*) as total_orders, list(product_id) as all_products by customer_segment
```

**Distinct count in eventstats:**
```ppl
# Basic distinct count
source=accounts | eventstats dc(state) as distinct_states

# With partitioning
source=accounts | eventstats dc(state) as state_count by gender
```

**Earliest/Latest in eventstats:**
```ppl
# Get earliest and latest values
source=transactions | eventstats earliest(amount), latest(amount) by account_id
```

### Migration Notes

- The `first()` and `last()` functions use natural document order, not time-based ordering
- For time-based ordering, use `earliest()` and `latest()` with eventstats
- The `list()` function has a default limit of 100 values per group

## Limitations

- `first()` and `last()` return NULL if no records exist or if the field is NULL
- `list()` returns a maximum of 100 values by default
- `distinct_count()`/`dc()` in eventstats uses approximate counting via `APPROX_DISTINCT_COUNT`
- Window function argument validation was added for eventstats commands

## Related PRs

| PR | Description |
|----|-------------|
| [#4223](https://github.com/opensearch-project/sql/pull/4223) | Support first/last aggregate functions for PPL |
| [#4161](https://github.com/opensearch-project/sql/pull/4161) | Add support for `list()` multi-value stats function |
| [#4084](https://github.com/opensearch-project/sql/pull/4084) | Support distinct_count/dc in eventstats |
| [#4212](https://github.com/opensearch-project/sql/pull/4212) | Add earliest/latest aggregate function for eventstats PPL command |

## References

- [Issue #4203](https://github.com/opensearch-project/sql/issues/4203): PPL first/last aggregate function request
- [Issue #4026](https://github.com/opensearch-project/sql/issues/4026): Multivalue Statistics Functions for PPL Calcite Engine
- [Issue #4052](https://github.com/opensearch-project/sql/issues/4052): PPL distinct_count/dc function support for eventstats
- [Issue #4047](https://github.com/opensearch-project/sql/issues/4047): PPL eventstats command enhancement
- [PPL Commands Documentation](https://docs.opensearch.org/3.0/search-plugins/sql/ppl/functions/)

## Related Feature Report

- [Full feature documentation](../../../features/sql/ppl-aggregate-functions.md)
1 change: 1 addition & 0 deletions docs/releases/v3.3.0/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@

### SQL

- [PPL Aggregate Functions](features/sql/ppl-aggregate-functions.md)
- [PPL Patterns Command Enhancements](features/sql/ppl-patterns-command.md)
- [PPL Rename Command - Wildcard Support](features/sql/ppl-rename-command.md)
- [PPL Rex and Regex Commands](features/sql/ppl-rex-and-regex-commands.md)
Expand Down