From 3f9eb5ba8da83dc755c6efe9a4edf5e11ecbaad1 Mon Sep 17 00:00:00 2001 From: enomott Date: Sun, 11 Jan 2026 11:14:02 +0900 Subject: [PATCH] docs: add PPL Aggregate Functions report for v3.3.0 --- docs/features/index.md | 1 + docs/features/sql/ppl-aggregate-functions.md | 198 ++++++++++++++++++ .../features/sql/ppl-aggregate-functions.md | 155 ++++++++++++++ docs/releases/v3.3.0/index.md | 1 + 4 files changed, 355 insertions(+) create mode 100644 docs/features/sql/ppl-aggregate-functions.md create mode 100644 docs/releases/v3.3.0/features/sql/ppl-aggregate-functions.md diff --git a/docs/features/index.md b/docs/features/index.md index 38dbb9267..42d115fc1 100644 --- a/docs/features/index.md +++ b/docs/features/index.md @@ -310,6 +310,7 @@ - [Calcite Query Engine](sql/calcite-query-engine.md) - [Flint Index Operations](sql/flint-index-operations.md) - [Flint Query Scheduler](sql/flint-query-scheduler.md) +- [PPL Aggregate Functions](sql/ppl-aggregate-functions.md) - [PPL Documentation](sql/ppl-documentation.md) - [PPL Patterns Command](sql/ppl-patterns-command.md) - [PPL Rename Command](sql/ppl-rename-command.md) diff --git a/docs/features/sql/ppl-aggregate-functions.md b/docs/features/sql/ppl-aggregate-functions.md new file mode 100644 index 000000000..25b836f7d --- /dev/null +++ b/docs/features/sql/ppl-aggregate-functions.md @@ -0,0 +1,198 @@ +# PPL Aggregate Functions + +## Summary + +PPL (Piped Processing Language) aggregate functions enable statistical analysis and data aggregation in OpenSearch queries. These functions work with the `stats` and `eventstats` commands to compute aggregations across documents, supporting operations like counting, averaging, collecting values into arrays, and retrieving first/last values based on document or time order. + +## Details + +### Architecture + +```mermaid +graph TB + subgraph "PPL Query Processing" + Q[PPL Query] --> Parser[PPL Parser] + Parser --> AST[Abstract Syntax Tree] + AST --> Visitor[Calcite Visitor] + end + + subgraph "Aggregate Function Resolution" + Visitor --> FR[Function Registry] + FR --> BF[BuiltinFunctionName] + BF --> AGG[Aggregator Functions] + end + + subgraph "Aggregator Implementations" + AGG --> FIRST[FirstAggregator] + AGG --> LAST[LastAggregator] + AGG --> LIST[ListAggregator] + AGG --> VALUES[ValuesAggregator] + AGG --> DC[DistinctCount] + AGG --> EL[Earliest/Latest] + end + + subgraph "OpenSearch Execution" + FIRST --> TH[top_hits agg] + LAST --> TH + DC --> CARD[cardinality agg] + LIST --> ARRAY[ARRAY_AGG] + end +``` + +### Data Flow + +```mermaid +flowchart TB + subgraph Input + SRC[Source Data] + end + + subgraph "stats Command" + SRC --> STATS[stats aggregation] + STATS --> GRP[Group By Fields] + GRP --> AGG1[Aggregate Functions] + end + + subgraph "eventstats Command" + SRC --> EVST[eventstats] + EVST --> WIN[Window Functions] + WIN --> PART[PARTITION BY] + end + + subgraph Output + AGG1 --> RES1[Aggregated Results] + PART --> RES2[Enriched Documents] + end +``` + +### Components + +| Component | Description | +|-----------|-------------| +| `BuiltinFunctionName` | Enum defining all built-in function names including aggregate functions | +| `AggregatorFunctions` | Registry for aggregate function implementations | +| `FirstAggregator` | Returns first value in document order | +| `LastAggregator` | Returns last value in reverse document order | +| `ListAggregator` | Collects values into an array preserving duplicates | +| `ValuesAggregator` | Collects unique values into an array | +| `CalciteAggCallVisitor` | Translates PPL aggregations to Calcite SQL | + +### Configuration + +| Setting | Description | Default | +|---------|-------------|---------| +| `plugins.calcite.enabled` | Enable Calcite engine for advanced PPL features | `false` | +| List max values | Maximum values collected by `list()` function | 100 | + +### Aggregate Functions Reference + +#### Standard Aggregate Functions (stats command) + +| Function | Description | NULL Handling | +|----------|-------------|---------------| +| `COUNT(field)` | Count of non-null values | Not counted | +| `SUM(field)` | Sum of values | Ignored | +| `AVG(field)` | Average of values | Ignored | +| `MAX(field)` | Maximum value | Ignored | +| `MIN(field)` | Minimum value | Ignored | +| `FIRST(field)` | First value in document order | Returns NULL | +| `LAST(field)` | Last value in document order | Returns NULL | +| `list(field)` | Array of all values | Filtered out | +| `values(field)` | Array of unique values | Filtered out | +| `distinct_count(field)` / `dc(field)` | Count of distinct values | Ignored | + +#### Window Aggregate Functions (eventstats command) + +| Function | Description | Translation | +|----------|-------------|-------------| +| `distinct_count(field)` / `dc(field)` | Distinct count per partition | `APPROX_DISTINCT_COUNT(field) OVER (...)` | +| `earliest(field)` | Earliest value by time | Time-based window function | +| `latest(field)` | Latest value by time | Time-based window function | + +### Usage Examples + +#### Basic Aggregations +```ppl +# Calculate average age +source=accounts | stats avg(age) + +# Group by field +source=accounts | stats avg(age), sum(age) by gender + +# Multiple aggregations +source=accounts | stats max(age), min(age) by gender +``` + +#### First/Last Functions +```ppl +# Get first and last values +source=logs | stats first(message), last(status) by host + +# Combined with sorting +source=events | sort timestamp | stats first(event_type), last(event_data) by session_id +``` + +#### Multi-value Functions +```ppl +# Collect all values +source=logs | stats list(user_id) as all_users by status + +# Collect unique values +source=events | stats values(source_ip) as unique_ips by hour +``` + +#### Eventstats with Window Functions +```ppl +# Distinct count per partition +source=accounts | eventstats dc(state) as distinct_states +source=accounts | eventstats distinct_count(country) as unique_countries + +# With partitioning +source=accounts | eventstats dc(state) as state_count by gender + +# Earliest/Latest +source=transactions | eventstats earliest(amount), latest(amount) by account_id +``` + +### Supported Data Types + +Both `list()` and `values()` functions support: + +| Category | Types | +|----------|-------| +| Numeric | INTEGER, LONG, FLOAT, DOUBLE | +| String | STRING, TEXT | +| Boolean | BOOLEAN | +| Date/Time | DATE, TIME, TIMESTAMP | +| Complex | STRUCT, ARRAY | + +## Limitations + +- `first()` and `last()` use document order, not time-based ordering +- `list()` function returns a maximum of 100 values by default +- `values()` function has no default limit but can be configured +- `distinct_count()` in eventstats uses approximate counting +- Window functions require `plugins.calcite.enabled=true` +- Aggregate functions in eventstats are executed on the coordination node + +## Related PRs + +| Version | PR | Description | +|---------|-----|-------------| +| v3.3.0 | [#4223](https://github.com/opensearch-project/sql/pull/4223) | Support first/last aggregate functions for PPL | +| v3.3.0 | [#4161](https://github.com/opensearch-project/sql/pull/4161) | Add support for `list()` multi-value stats function | +| v3.3.0 | [#4084](https://github.com/opensearch-project/sql/pull/4084) | Support distinct_count/dc in eventstats | +| v3.3.0 | [#4212](https://github.com/opensearch-project/sql/pull/4212) | Add earliest/latest aggregate function for eventstats | + +## References + +- [Issue #4203](https://github.com/opensearch-project/sql/issues/4203): PPL first/last aggregate function +- [Issue #4026](https://github.com/opensearch-project/sql/issues/4026): Multivalue Statistics Functions for PPL Calcite Engine +- [Issue #4052](https://github.com/opensearch-project/sql/issues/4052): PPL distinct_count/dc function support for eventstats +- [Issue #4047](https://github.com/opensearch-project/sql/issues/4047): PPL eventstats command enhancement +- [PPL Commands Documentation](https://docs.opensearch.org/3.0/search-plugins/sql/ppl/functions/) +- [SQL Aggregate Functions](https://docs.opensearch.org/3.0/search-plugins/sql/sql/aggregations/) + +## Change History + +- **v3.3.0** (2026-01-11): Added first/last, list, earliest/latest aggregate functions; extended distinct_count/dc to eventstats command diff --git a/docs/releases/v3.3.0/features/sql/ppl-aggregate-functions.md b/docs/releases/v3.3.0/features/sql/ppl-aggregate-functions.md new file mode 100644 index 000000000..112eba66b --- /dev/null +++ b/docs/releases/v3.3.0/features/sql/ppl-aggregate-functions.md @@ -0,0 +1,155 @@ +# PPL Aggregate Functions + +## Summary + +OpenSearch v3.3.0 expands PPL (Piped Processing Language) aggregate function capabilities with new functions for the `stats` and `eventstats` commands. This release adds `first()`, `last()`, `list()`, `earliest()`, `latest()`, and `distinct_count()`/`dc()` support for `eventstats`, enabling more powerful data analysis and aggregation workflows. + +## Details + +### What's New in v3.3.0 + +This release introduces several new aggregate functions and extends existing functions to work with the `eventstats` command: + +| Function | Command Support | Description | +|----------|-----------------|-------------| +| `first(field)` | `stats` | Returns the first value in natural document order | +| `last(field)` | `stats` | Returns the last value in natural document order | +| `list(field)` | `stats` | Collects all values into an array (preserves duplicates) | +| `earliest(field)` | `eventstats` | Returns the earliest value based on time | +| `latest(field)` | `eventstats` | Returns the latest value based on time | +| `distinct_count(field)` / `dc(field)` | `eventstats` | Counts distinct values as a window function | + +### Technical Changes + +#### Architecture Changes + +```mermaid +graph TB + subgraph PPL Query + Q[PPL Query] --> P[Parser] + P --> AST[AST Builder] + end + + subgraph Calcite Engine + AST --> CV[CalciteVisitor] + CV --> AGG[Aggregate Functions] + AGG --> FIRST[FirstAggregator] + AGG --> LAST[LastAggregator] + AGG --> LIST[ListAggregator] + AGG --> DC[DistinctCountAggregator] + AGG --> EL[Earliest/Latest] + end + + subgraph OpenSearch + AGG --> TH[top_hits aggregation] + AGG --> CARD[cardinality aggregation] + end +``` + +#### New Components + +| Component | Description | +|-----------|-------------| +| `FIRST` | Aggregate function returning first value in document order | +| `LAST` | Aggregate function returning last value in reverse document order | +| `LIST` | Multi-value aggregate collecting values into arrays | +| `EARLIEST` | Time-based aggregate for eventstats | +| `LATEST` | Time-based aggregate for eventstats | +| `distinct_count`/`dc` | Distinct count support for eventstats window functions | + +#### Function Specifications + +**FIRST Function** +- Syntax: `FIRST(field)` +- Returns: First value of the field in natural document order +- Return Type: Same as input field type (nullable) +- Behavior: Uses `top_hits` with `size: 1` for OpenSearch pushdown + +**LAST Function** +- Syntax: `LAST(field)` +- Returns: Last value of the field in reverse document order +- Return Type: Same as input field type (nullable) +- Behavior: Uses `top_hits` with `size: 1` and reverse sort + +**LIST Function** +- Syntax: `list(field)` +- Returns: Array of all values (preserves duplicates and order) +- Return Type: `ARRAY` where T is the input field type +- Behavior: Collects up to 100 values by default + +**DISTINCT_COUNT/DC for eventstats** +- Syntax: `distinct_count(field)` or `dc(field)` +- Returns: Count of distinct values +- Translation: `APPROX_DISTINCT_COUNT(field) OVER (PARTITION BY ...)` + +### Usage Examples + +**First/Last aggregate functions:** +```ppl +# Basic usage +source=logs | stats first(message), last(status) by host + +# Combined with other aggregations +source=metrics | stats first(cpu_usage), last(memory_usage), count(), avg(response_time) by server + +# Sequential processing after sorting +source=events | sort timestamp | stats first(event_type), last(event_data) by session_id +``` + +**List function:** +```ppl +# Collect all user IDs for each status +source=access_logs | stats list(user_id) as all_users by response_status + +# Combined with other statistics +source=ecommerce | stats count(*) as total_orders, list(product_id) as all_products by customer_segment +``` + +**Distinct count in eventstats:** +```ppl +# Basic distinct count +source=accounts | eventstats dc(state) as distinct_states + +# With partitioning +source=accounts | eventstats dc(state) as state_count by gender +``` + +**Earliest/Latest in eventstats:** +```ppl +# Get earliest and latest values +source=transactions | eventstats earliest(amount), latest(amount) by account_id +``` + +### Migration Notes + +- The `first()` and `last()` functions use natural document order, not time-based ordering +- For time-based ordering, use `earliest()` and `latest()` with eventstats +- The `list()` function has a default limit of 100 values per group + +## Limitations + +- `first()` and `last()` return NULL if no records exist or if the field is NULL +- `list()` returns a maximum of 100 values by default +- `distinct_count()`/`dc()` in eventstats uses approximate counting via `APPROX_DISTINCT_COUNT` +- Window function argument validation was added for eventstats commands + +## Related PRs + +| PR | Description | +|----|-------------| +| [#4223](https://github.com/opensearch-project/sql/pull/4223) | Support first/last aggregate functions for PPL | +| [#4161](https://github.com/opensearch-project/sql/pull/4161) | Add support for `list()` multi-value stats function | +| [#4084](https://github.com/opensearch-project/sql/pull/4084) | Support distinct_count/dc in eventstats | +| [#4212](https://github.com/opensearch-project/sql/pull/4212) | Add earliest/latest aggregate function for eventstats PPL command | + +## References + +- [Issue #4203](https://github.com/opensearch-project/sql/issues/4203): PPL first/last aggregate function request +- [Issue #4026](https://github.com/opensearch-project/sql/issues/4026): Multivalue Statistics Functions for PPL Calcite Engine +- [Issue #4052](https://github.com/opensearch-project/sql/issues/4052): PPL distinct_count/dc function support for eventstats +- [Issue #4047](https://github.com/opensearch-project/sql/issues/4047): PPL eventstats command enhancement +- [PPL Commands Documentation](https://docs.opensearch.org/3.0/search-plugins/sql/ppl/functions/) + +## Related Feature Report + +- [Full feature documentation](../../../features/sql/ppl-aggregate-functions.md) diff --git a/docs/releases/v3.3.0/index.md b/docs/releases/v3.3.0/index.md index a78e15882..a2c0bd3a4 100644 --- a/docs/releases/v3.3.0/index.md +++ b/docs/releases/v3.3.0/index.md @@ -136,6 +136,7 @@ ### SQL +- [PPL Aggregate Functions](features/sql/ppl-aggregate-functions.md) - [PPL Patterns Command Enhancements](features/sql/ppl-patterns-command.md) - [PPL Rename Command - Wildcard Support](features/sql/ppl-rename-command.md) - [PPL Rex and Regex Commands](features/sql/ppl-rex-and-regex-commands.md)